NAGbinder: Help Page

Top

NAGbinder Help Page

This is a help page of NAGbinder, developed for predicting NAG binding sites in a protein. This page provides diffrent type of information on NAGbinder. In order to provide information in structured forms, we have divided information in following topics.

Importance of NAGbinding: N-Acetylglucosamine (NAG) is one of the eight crucial sugars required to maintain the optimal health and functioning of the human body.
Datasets: Datasets were generated from NAG binding protein structure in PDB (April 2019, release).
Evaluation of models: Standard protocols were used for evaluating models develop in this study.
Algorthm: Standard algorithm are used for devloping NAG interacting residues.
Help on Web Pages: Help pages with screen shots

Goto Top

Importance of NAGbinding

N-Acetylglucosamine (NAG) is one of the eight essential saccharides, which play a vital role in reducing allergies and reduce symptoms in chronic diseases such as arthritis, diabetes, lupus, and kidney disease. NAG actively participates in the numerous process in the human body as repairing of cartilage, decreases inflammation with bones joints, tissue rebuilding, the functioning of the digestive tract and nervous system, molecule transportation such as thyroglobulin.

Best of knowledge their is no method that can predict NAG binding sites in a protein sequence. In contrast, number of methods have been developed to predict wide range of ligands (e.g., ATP, GTP, NAD, FAD) interacting residues in a protein. Thus their is a need to develop method which can predict NAG interacting residues in a protein.

First time, we made an attempt to develop machine learning technique based models for predicting NAG interacting residues in a protein from its amino acid sequence. We used well established techniques commonly used to predict ligand binding sites in a protein, to develop NAGbinder. This server have number of modules to predict NAG binding sites in a protein.

Goto Top

Datasets

We extracted 5736 NAG binding proteins PDB IDs present in the April 2019 PDB release. We removed all protein chains which exhibits high similarity with other chains. We also removes those proteins whose resolution is poor than 3 angstrom. Finally we got 231 NAG binding non-redundant (40%, CD-HIT) protein chains whose resolution is better than 3 Angastrom.

Training and testing dataset: It contain 118 NAG binding protein chains; 1335 NAG interacting residues and 47198 non-interacting residues.

Dataset validation: Independent dataset contain 27 NAG binding protein chains; 650 NAG interacting residues and 27733 non-interacting residues.

Goto Top

Model Evaluation

In this study, we used stanadrd procedure for evaluating the performance of our models. First, we divide in traing and validation dataset in ratio of 80% and 20%. All traing and testing is performed on training dataset and final performance of model is evaluated on validation or independent dataset. Following is brief description on evaluation

Five-fold cross validation: Five-fold cross technique was performed to evaluate the performance of different models developed in this study. In this technique, dataset is divided into five different sets, out of which four sets are used to train the model and the fifth set is used for testing the performance of the model. This process is repeated five times, therefore, each set is used once for testing. The final performance is reported by averaging the performance obtained on five different sets.

External Validation: Five-fold validation described above is on internal validation where same data is used for training and testing; model may be over optimized. In order to measure realistic performance of our models, we also perform external valiadtion. In external validation we measure performance of our model developed on training dataset on an validation or independent dataset. As both training and validation dataset donot share sequence. It means datasets used for traing and validation are different so performance is realistic.

Performance Measures: In this study, we used both threshold dependent as well as threshold independent parameters to evaluate the performance of our models. In case of threshold dependent measures, we used all standard parameters to measure performance it includes sensitivity, specificity, accuracy. Similarly, we used area under curve of ROC to measure overall performance in case of threshold independent measures.

Goto Top

Algorithm

In last one decade number of methods have been developed for predicting ligand interacting residues in a protein. We used similar approach for developing models for predicting NAG interacting residues. Following are major components of algorithm used in this study for developing model

Machine Learning Techniques: In this study, machine learning techniques have been used for developing models. Major machine learning techniques used for developing models includes SVM, Random Forest, ExtraTree, KNN, MLP and Ridge classifier. These machine learning technique has been implemented using Python library scikit-learn.

Pattern Size or Window: In order to develop models using machine learning techniques, we need fixed length vector. We have generated overlapping patterns or segments of window size 9. We also used dummy amino acid "X" at N-terminal and C-terminal to generate patterns for each residue. These patterns were divided in NAG interacting and non-interacting patterns based on status of central of pattern.

Binary Profile: In order to repersent a patterns of 9 amino acids by a vector to numerical vector, we repersent each amino acid by a vector of dimension 21 (20 type of amino acids and one for dummy amino acid "X"). This vector is known binary profile of dimension 189 (21x9); which is commonly used technique to present a pattern by numbers. Binary profile of NAG interacting and non-interacting is used for developing machine learning techniques based models.

PSSM Profile: Evolutionary information provides more information then single sequence of a protein. In this study, we generate PSSM profiles for a protein using PSIBLAST software to extract its evolutionary information. This PSSM profile is normalize and generated PSSM profile correspong to patterns of 9 amino acids. Finally, machine learning techniques based models have been developed using PSSM profiles of patterns.

Goto Top

Help Pages

NAGbinder server discriminate the NAG interacting residues and non-interacting residues from a given sequence. The NAGbinder server uses the SVC based method by using the Position Specific Scoring Matrix (PSSM) generated from the query sequence(s). The overall accuracy of this server is ~ 96.00%. NAGbinder is a web-server specially trained for the NAG interacting residues. The prediction is based on the basis of PSSM pattern of 9 window motif of amino acid sequence by using support vector classifier (SVC). Probability score is calculated in between 0-9 and higher the probability value, higher is the chances of the residue to be NAG interacting.

This page provides help on different modules of server. Following are major modules in this server:
Sequence: This page provides the facility to enter multiple sequences in FASTA format and select the desired method for prediction. User can see the result online and facility of downlaoding the result in ".csv" file format is also provided.

PSSM Profile: This module allows users to predict the NAG interacting residues in the given protein sequence using evolutionary information in the form pf PSSM Matrix. User is asked to submit either single protein sequence or very few sequence in FASTA format. We have also provided the option to upload the file containing sequences, if the number of sequences are in large number.