Stand-alone Programs

 

 

Installation of GPSR 1.0 package locally

 

§  GPSR 1.0: This package developed mainly for UNIX machines. It can be downloaded from webs.iiitd.edu.in/raghava/gpsr/

 

§  Prerequisite softwares for GPSR 1.0 package

 

SVMlight: freely available software (academic). It can be downloaded (old: version 5.0 and new 6.1) from http://download.joachims.org/svm_light/current/

svm_light.tar.gz.

 

HMMER: freely available software (academic). User can download the source code from ftp://ftp.genetics.wustl.edu/pub/eddy/hmmer/

MEME/MAST: freely available software (academic). User can download the source code from http://meme.nbcr.net/meme4_1/meme-download.html

BLAST: freely available software (academic). User can download the               source code from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml

 

cdk.jar: freely available software (academic). User can download the               source code from http://sourceforge.net/projects/cdk/

 

§  To uncompress gpsr.tar.gz, execute the following command


$ tar -zxf gpsr.tar.gz

 

§  To install package

 

$ cd gpsr

$ perl install.pl

 

During installation it will ask for the path of required softwares like Perl, BLAST, SVM old (version 5.0), SVM New (version 6.1), MEME/MAST etc. The entire program will be installed in the bin folder.

 

§  To update package

 

  $ perl update.pl

 

Once you have installed the gpsr package and wish to update the package or change/add the path of any software later, you may run update.pl instead of install.pl.

 

 

Individual Softwares

 

 

ESLPred2

 

Application:

ESLpred2 trained using organism specific and generalized datasets can be used for the prediction of eukaryotic subcellular localizations. The webserver is available at http://webs.iiitd.edu.in/raghava/eslpred2/

 

Introduction:

In this post genomic era, functional annotation and characterization of nearly millions of raw protein sequences, erupted by incredible sequencing projects, are some of the inescapable challenges that has been baffling the scientific community in order to bridge the mounting gap between number of unknown and annotated proteins. This crisis entails the development of computational methods that would help in predicting functions of proteins expeditiously as well as economically. One of the fundamental and popular indirect strategies for assigning function is the identification of subcellular compartments of proteins as knowledge about localization can provide important indications about protein functions. After PSORT the first method developed to predict the subcellular localizations, ample of novice, improved, generalized and organism specific prediction methods have been developed for predicting subcellular locations of eukaryotic and prokaryotic proteins, namely, NNPSL, PSORTB, FKNN, TargetP, SubLoc, SignalP, CELLO, LOCnet, PSLpred, HSLPred, PLOC, Mutiloc, Proteome Analyst, LOCtree, TSSub, BaCelLo and Esub8 using different datasets and protein input features. In 2004, our group has combined the information of similarity search with sequence composition based attributes (ESLPred) and achieved accuracy up to 88%. In ESLPred2, a systematic approach has been taken to improve the prediction quality of eukaryotic subcellular localizations using PSI-BLAST generated PSSM profiles along with compositional attributes and similarity search based information for the training of SVM. The present method has achieved a highest success rate for the prediction of localizations with good overall and average accuracy, and hence, compliments the existing subcellular localization prediction methods.

 

Datasets:

ESLPred2 was trained using the latest dataset, which was earlier used for developing BaCelLo method. The dataset was retrieved from SWISSPROT version 48.0 and divided into three subsets on the basis of kingdoms- animal with 2597 sequences; fungi with 1198 sequences and 491 sequences were from plant. The major attraction of this dataset was the stringent cut-off value of 30% used to reduce the similarity between sequences. The first two datasets covered 4 major localizations such as cytoplasm, mitochondria, nuclear, and extracellular, whereas, plant dataset included chloroplast class along with four major localizations. In addition, RH2427 dataset was also used to train a generalized model for prediction of eukaryotic proteins subcellular localizations.

 

Results

The hybrid approach based module which incorporated similarity search based information with amino acid composition of a single sequence (whole and N-terminal) and profiles for RH2427 dataset attained an overall accuracy to ~94% and average accuracy for four localizations to 93.1%. Using this hybrid approach, cytoplasmic, mitochondrial, nuclear, and extracellular proteins has been predicted with 89.6%, 90.7%, 96.4%, and 95.7% of accuracies respectively. Additionally, ESLpred2 has also been able to attain best accuracies of 80.8,75.9%, and 76.6% for kingdom specific animal, fungi and plant proteins respectively, which is the best accuracy reported till date for the same dataset. Hence, ESLpred2 provides more crucial and promising features for prediction of eukaryotic subcellular localizations coupled with kingdom specific prediction SVM models. An interesting feature of the present method is the hybrid of different protein features, such as composition of PSSM profile, whole and N-terminal composition of sequence and similarity search based results, which supported the assignment of the subcellular localization of proteins more reliably and with high accuracy irrespective of redundancy in the training datasets. The present method is able to complement all existing subcellular location prediction methods.

 

Usage of standalone version

perl eslpred2 -i <seq_file> -m <method> -k <organism> -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 3 modules for the prediction of subcellular localizations such as

a) Amino acid compositions (1);

b) PSSM (2)

c) Hybrid module for AAC, PSSM and PSI-BLAST based similarity (3).

Ø  Organism defines the models based on training dataset such as

a) A for Animal dataset;

b) F for Fungi dataset;

c) P for Plant dataset.

d) G for generalized dataset (RH2427)

Ø   Output_file defines the name of output file for storing results 

 

Publication:

Garg A and Raghava GPS (2008) ESLpred2: Improved method for predicting subcellular localization of eukaryotic proteins. BMC Bioinformatics, 9:503.

 

 

 

ESLPred

 

Application

ESLPred is a SVM-based method for the prediction of subcellular localization of eukaryotic proteins. The webserver is available at http://webs.iiitd.edu.in/raghava/eslpred/

 

Introduction

Large-scale genome sequencing projects make interpretation of genomic sequence data increasingly important, so does the need to functionally annotate this data. The determination of subcellular localization of a protein can provide important clues to elucidate the function of the protein. Therefore, prediction of subcellular localization of proteins is an important step in understanding the biochemical function of proteins. In the past, various methods have been developed to predict the subcellular location of proteins using different approaches. The similarity search in which a sequence is searched against an experimentally annotated database, is a technique commonly used to assign function to a protein, including its subcellular location. This approach fails in the absence of significant similarity between query and target protein sequences. Another way to predict subcellular localization of proteins is to identify sequence motifs such as signal peptide or nuclear localization signal. The major limitation of motif-based methods is that all proteins residing in a compartment do not have universal motifs.

To overcome these limitations, in the past numerous studies have been carried out to predict subcellular localization based on the features of protein sequence. The subcellular localization prediction methods are based either on recognition of N-terminal sorting signals or on the composition of amino acids. In ESLPred, a systematic attempt has been made to achieve higher prediction accuracy for subcellular localization of eukaryotic proteins from their different features. The SVM modules were developed based on the following features of a protein: (i) amino acid composition (commonly used in the literature for classification of proteins), (ii) overall physico-chemical properties (e.g. hydrophobicity, hydrophilicity, polarity) and (iii) dipeptide compositions (e.g. ala–ala, ala–leu, val–ser). In addition, a similarity search based module, EuPSI-BLAST, was also constructed using PSI-BLAST to predict the localization of a protein. Finally, a hybrid SVM module was developed using all three features of proteins mentioned above and prediction results of EuPSI-BLAST.

 

Datasets

The dataset used in devloping ESLPred was also used in the development of SubLoc and NNPSL. This dataset was generated from version 33.0 of SWISS-PROT by Reinhardt and Hubbard (RH2427). The dataset consisted of complete and non-redundant proteins with less than 90% sequence identity whose subcellular localization is experimentally determined. This dataset consisted of a total of 2427 eukaryotic proteins (1097 nuclear, 684 cytoplasmic, 321 mitochondrial and 325 extracellular proteins).

 

Results

Support vector machine (SVM) has been used to predict the subcellular location of eukaryotic proteins from their different features such as amino acid composition, dipeptide composition and physico-chemical properties. The SVM module based on dipeptide composition performed better than the SVM modules based on amino acid composition or physico-chemical properties. In addition, PSI-BLAST was also used to search the query sequence against the dataset of proteins (experimentally annotated proteins) to predict its subcellular location. In order to improve the prediction accuracy, we developed a hybrid module using all features of a protein, which consisted of an input vector of 458 dimensions (400 dipeptide compositions, 33 properties, 20 amino acid compositions of the protein and 5 from PSI-BLAST output). Using this hybrid approach, the prediction accuracies of nuclear, cytoplasmic, mitochondrial and extracellular proteins reached 95.3, 85.2, 68.2 and 88.9%, respectively. The overall prediction accuracy of SVM modules based on amino acid composition, physico-chemical properties, dipeptide composition and the hybrid approach was 78.1, 77.8, 82.9 and 88.0%, respectively. The accuracy of all the modules was evaluated using a 5-fold cross-validation technique. Assigning a reliability index (reliability index 3), 73.5% of prediction can be made with an accuracy of 96.4%.

 

Usage of standalone version

perl eslpred -i <seq_file> -m <method> -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 5 trained SVM modules for the prediction of subcellular localizations such as

a) Amino acid compositions (1);

b) Overall physico-chemical properties (2);

c) Dipeptide compositions (3);

d) PSI-BLAST similarity based (4);

e) Hybrid module (5).

Ø  Output_file defines the name of output file for storing results 

 

Publication

Bhasin M and Raghava GPS (2004) ESLpred: SVM Based Method for Subcellular Localization of Eukaryotic Proteins using Dipeptide Composition and PSI-BLAST. Nucleic Acids Research 32:W414-9.

 

 

HSLPred

 

Application

HSLPred is a SVM-based method for the prediction of subcellular localizations of human proteins. The webserver is available at http://webs.iiitd.edu.in/raghava/hslpred/

 

Introduction

The successful completion of a human genome project has yielded huge amount of sequence data. Analysis of this data to extract the biological information can have profound implications on biomedical research. Therefore, mining of biological information or functional annotation of piled up sequence data is a major challenge to the modern scientific community. Determination of functions of all of these proteins using experimental approaches is a difficult and time-consuming task. Traditionally, the similarity search-based tool has been used for functional annotations of proteins. This approach fails when unknown query protein does not have significant homology to proteins of known functions. The functions of the proteins are closely related to its cellular attributes, such as subcellular localization and its association with the lipid bilayer (subcellular localization) hence, the related proteins must be localized in the same cellular compartment to cooperate toward a common function. In addition, information on the localization of proteins with known function may provide insight about its involvement in specific metabolic pathways. Therefore, an attempt has been made to predict subcellular localization of proteins to elucidate the function. Several methods have been devised earlier to predict the subcellular localization of the eukaryotic and prokaryotic proteins using different approaches and data sets. To the best of our knowledge, there is no method for the prediction of subcellular localization of human proteins. Availability of sequence data of human genes in recent years demands a reliable and accurate method for prediction of subcellular localization of human proteins.

HSLpred is based on different features of the proteins such as amino acid and dipeptide composition of proteins. In addition, a similarity search-based module, HuPSI-BLAST, has also been developed, using PSI-BLAST to predict the localization of human proteins. Further, SVM module "hybrid1" has been developed using amino acid composition, traditional dipeptide composition, and results of PSI-BLAST prediction. The SVM modules based on higher order dipeptide compositions (i + 2, i + 3, and i + 4) and combinations of various feature-based modules have also been constructed. In addition, the performance of HSLPred has also been assessed on various mammalian and nonmammalian genomes and on an independent data set. It was observed that this method can predict the subcellular localization of human proteins and proteins from related genomes with high accuracy. In other words, our method can also be used for the prediction of subcellular localization of mammalian proteins.

 

Datasets

The dataset of human proteins used to devlop HSLpred was extracted from special release of SWISSPROT database. Final non-redundant data set consisted of a total of 3532 human proteins (840 cytoplasmic, 315 mitochondrial, 858 nuclear, 1519 plasma membrane). The dataset is available at webs.iiitd.edu.in/raghava/hslpred.

 

Results

SVM based modules for predicting subcellular localization using traditional amino acid and dipeptide (i + 1) composition achieved overall accuracy of 76.6 and 77.8%, respectively. PSI-BLAST, when carried out using a similarity-based search against a nonredundant data base of experimentally annotated proteins, yielded 73.3% accuracy. To gain further insight, a hybrid module (hybrid1) was developed based on amino acid composition, dipeptide composition, and similarity information and attained better accuracy of 84.9%. In addition, SVM modules based on a different higher order dipeptide i.e. i + 2, i + 3, and i + 4 were also constructed for the prediction of subcellular localization of human proteins, and overall accuracy of 79.7, 77.5, and 77.1% was accomplished, respectively. Furthermore, another SVM module hybrid2 was developed using traditional dipeptide (i + 1) and higher order dipeptide (i + 2, i + 3, and i + 4) compositions, which gave an overall accuracy of 81.3%. We also developed SVM module hybrid3 (final) based on amino acid composition, traditional and higher order dipeptide compositions, and PSI-BLAST output and achieved an overall accuracy of 84.4%.

 

Usage of standalone version

perl hslpred -i <seq_file> -m <method> -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 4 trained SVM modules for the prediction of subcellular localizations such as

a) Amino acid compositions (1);

b) Dipeptide compositions (2);

c) PSI-BLAST similarity based (3);

d) Hybrid module (a+b+c+d) (4).

Ø  Output_file defines the name of output file for storing results 

Publication

Garg A, Bhasin M and Raghava GP (2005) SVM-based method for subcellular localization of human proteins using amino acid compositions, their order and similarity search. J Biol Chem 280:14427-32.

 

 

 

PSLpred

 

Application

PSLpred is a SVM-based method for the prediction of subcellular localizations of prokaryotic proteins. The webserver is available at http://webs.iiitd.edu.in/raghva/pslpred/

 

Introduction

Prokaryotes are the causative agent of most of the deadly disease and widespread of epidemics, hence, biologists are paying much attention for the functional annotation of prokaryotic proteins. This may further guide the determination of virulence factors as well as new pattern of resistance for antiobiotic agents in pathogenic bacteria. Hence, prediction of protein subcellular localization (an alternative to functional annotation) of gram-negative bacteria would be very useful in the field of molecular biology, cell biology, pharmacology, and medical science. A number of methods such as PSORT I, PSORT-B and NNPSL have been developed for predicting subcellular localization of bacterial proteins based on different datasets and computational techniques. The accuracies reported by these methods vary between 60 and 81%. Recently, a support vector machines (SVM) based method, CELLO trained using n-peptide compositions has been developed for predicting subcellular localization of bacterial proteins. This method has achieved an overall accuracy of 89% that is better than existing methods for subcellular localization

of prokaryotic proteins. Despite the overall improved performance, CELLO predicts extracellular proteins with a fair accuracy of 78.9%, proteins that may represent important virulence factors in pathogenic microorganisms. PSLpred a SVM based method has been devloped for the prediction of subcellular localization of prokaryotic proteins using input features such as amino acid and dipeptide composition, physico-chemical properties along with similarity search based results.

 

Datasets

The data set used in the present study is the same as that used for developing the methods CELLO and PSORT-B, respectively. This data set has been generatedfrom SWISSPROT release 40.29 and consisted of a total of 1443 proteins belonging to different subcellular localizations. We have excluded 141 proteins residing in more than one subcellular locations and used the remaining 1302 proteins (248 cytoplasmic, 268 inner membrane, 244 periplasmic, 352 outer membrane and 190 extracellular) for the development of PSLpred.

 

 

Results

PSLpred is a hybrid approach-based method that integrates PSI-BLAST and three SVM modules based on compositions of residues, dipeptides and physico-chemical properties and predicts the subcellular localization of gram-negative bacterial proteins with an overall accuracy of 91.2%. The prediction accuracies of 90.7, 86.8, 90.3, 95.2 and 90.6% were attained for cytoplasmic, extracellular, inner-membrane, outer-membrane and periplasmic proteins, respectively. Furthermore, PSLpred was able to predict 74% of sequences with an average prediction accuracy of 98% at RI = 5. The performance of the hybrid module was compared with methods such as CELLO, PSORT-B, which were also developed from the same data set. It has been observed that overall performance of the hybrid module is nearly 2% higher than CELLO and 16% higher than that of PSORT-B. Hence PSLpred is more accurate for the subcellular localization of prokaryotic proteins.

 

Usage of standalone version

perl pslpred -i <seq_file> -m <method> -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 5 trained SVM modules for the prediction of subcellular localizations such as

a) Amino acid compositions (1);

b) Physico-chemical properties (2);

c) Dipeptide compositions (3);

d) PSI-BLAST similarity based (4);

e) Hybrid module (5).

Ø  Output_file defines the name of output file for storing results 

 

Publication

Bhasin M. Garg A and Raghava GPS (2005) PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics 21(10):2522-4.

 

 

 

SRTPred

 

Application

SRTPred is a SVM-based method for the classification of protein sequence as secretory or non-secretory protein. The webserver is available at http://webs.iiitd.edu.in/raghava/srtpred/

 

Introduction

Protein secretion is a universal process which occurs in all organisms and has tremendous importance to biological research. In case of pathogenic microorganisms, secretory pathways deliver virulence factors to their sites of action, soluble extracellular enzymes into the surrounding medium, or for specifically targeting proteins to the host cell. In several instances, protein secretion pathways are similar to those involved in assembly of bacterial appendages. Further, several secretory proteins has been identified as a major target protein for the development of drugs. Hence, development of automatic method for the prediction of secretory proteins would be a help for the studies aim towards deciphering secretory pathways and also lead to the identification of novel drug targets with greater value for biomedical research.

Until now, many methods have been developed for the classification and prediction of subcellular localizations of proteins based on signal peptide (SPs), mainly SignalP and pTarget. TargetP is a neural-network based method that discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathways and other localizations with a success rate of 85.3% (overall) and sensitivity of 0.96 for non-plant secretory proteins. Whereas, neural network based method, SignalP (version 3.0) method has been able to achieve high sensitivity of 0.99 and overall accuracy of 0.93 for eukaryotic signal peptide discrimination. Though achieving higher prediction accuracy for classical secreted proteins, these methods, unfortunately fail during the prediction of proteins without SP. Hence, non-classical secreted proteins also demand automated method for the prediction. Recently, a webserver SecretomeP has been developed to predict non-classical secreted proteins, based on an idea that extracellular proteins share certain features regardless of the pathway used to secrete them. It is a neural network based method that has used several features of protein such as number of atoms, positively charged residues, propeptide cleavage site, protein sorting, low complexity regions, and transmembrane helices as an input to train network. Despite considering large number of protein features, the method has achieved a false positive prediction that is less than 5% at a low sensitivity value of 40%. Till date, there is not any method available that can predict secretory proteins, irrespective of pathways/SPs, with better accuracy. SRTPred is an automated method that can predict secretory proteins (irrespective of N-terminal SP) based on different features of whole protein sequence.

Dataset

The data set used in the present study, consisted of 6975 mammalian protein sequences. Out of which 3321 sequences were extracellular proteins secreted via classical and non-classical pathways (positive examples), whereas the remaining 3654 proteins were annotated as cytoplasmic and/or the nuclear (negative examples). Previously, the same dataset was used to develop a method SecretomeP and available publicly at http,//www.cbs.dtu.dk/services/SecretomeP-1.0/datasets.php. The sequences were extracted from Swiss-Prot database on the basis of subcellular localization annotations in the comment block. 

 

Results

SRTpred is a systematic attempt to predict secretory proteins irrespective of presence or absence of N-terminal signal peptides (also known as classical and non-classical secreted proteins respectively), using machine-learning techniques; artificial neural network (ANN) and support vector machine (SVM). We trained and tested our methods on a dataset of 3321 secretory and 3654 non-secretory mammalian proteins using five-fold cross-validation technique. First, ANN-based modules have been developed for predicting secretory proteins using 33 physico-chemical properties, amino acid composition and dipeptide composition and achieved accuracies of 73.1%, 76.1% and 77.1%, respectively. Similarly, SVM-based modules using 33 physico-chemical properties, amino acid, and dipeptide composition have been able to achieve accuracies 77.4%, 79.4% and 79.9%, respectively. In addition, BLAST and PSI-BLAST modules designed for predicting secretory proteins based on similarity search achieved 23.4% and 26.9% accuracy, respectively. Finally, we developed a hybrid-approach by integrating amino acid and dipeptide composition based SVM modules and PSI-BLAST module that increased the accuracy to 83.2%, which is significantly better than individual modules. We also achieved high sensitivity of 60.4% with low value of 5% false positive predictions using hybrid module.

 

Usage of standalone version

perl srtpred -i <seq_file> -m <method> -t <threshold value) -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 5 trained SVM modules for the prediction of secretory proteins such as

a) Amino acid compositions (1);

b) Properties based (2);

c) Dipeptide compositions (3);

d) PSI-BLAST similarity based (4);

e) Hybrid module of a+c+d (5).

 

Ø  Threshold_value defines the selection of threshold value in the range of -1.5 to 1.5

Ø  Output_file defines the name of output file for storing results 

 

Publication

Garg A and Raghava GPS (2008) A machine learning based method for the prediction of secretory proteins using amino acid composition, their order and similarity-search. In Silico Biology 8:129-40.

 

 

OxyPred

 

Application

OxyPred is a SVM based method to predict the Oxygen Binding Proteins such as Erythrocruorin, Hemoglobin, Myoglobin, Hemerithrin, Leghemoglobin and Hemocyanin. The webserver is available at http://webs.iiitd.edu.in/raghava/oxypred/

 

Introduction

Oxygen-binding proteins are widely present in eukaryotes ranging from non-vertebrates to humans. Moreover, these proteins have also been reported to be present in many prokaryotes and protozoans. The occurrence of oxygen-binding proteins in all kingdoms of organisms, though not in all organisms, shows their biological importance. Extensive studies on oxygen-binding proteins have categorized them into six different broad types, including ery-throcruorin, hemerythrin, hemocyanin, hemoglobin, leghemoglobin, and myoglobin, each has its own functional characteristics and structure with unique oxygen-binding capacity. These oxygen-binding proteins are crucial for the survival of any living organism. With the advancement in sequencing technology, the size of protein sequence databases is growing at an exponential rate. Thus it is much needed to develop bioinformatics methods for functional annotation of proteins, particularly for identifying oxygen-binding proteins. We have developed a reliable SVM-based method for predicting and classifying oxygen-binding proteins using different residue compositions.

 

Dataset

The sequences of oxygen-binding proteins and non-oxygen-binding proteins from the Swiss-Prot database (http://www.expasy.org/sprot/). In order to obtain a high-quality dataset, we removed all those proteins annotated as “fragments", “isoforms", “potentials", “similarity", or “probables" and created a non-redundant dataset where no two proteins have a similarity more than 90% using PROSET software. Our final dataset consisted of 672 oxygen-binding proteins and 700 non-oxygen binding proteins. These 672 oxygen-binding proteins were then classified into six different classes, consisting of 20 erythrocruorin, 31 hemerythrin, 77 hemocyanin, 486 hemoglobin, 13 heghemoglobin, and 45 myoglobin proteins.

 

Results

SVM modules were developed using amino acid composition and dipeptide composition for predicting oxygen-binding proteins and achieved maximum accuracy of 85.5% and 87.8%, respectively. Secondly, SVM module was developed based on amino acid composition, classifying the predicted oxygen-binding proteins into six classes with accuracy of 95.8%, 97.5%, 97.5%, 96.9%, 99.4%, and 96.0% for erythrocruorin, hemerythrin, hemocyanin, hemoglobin, leghemoglobin, and myoglobin proteins, respectively. Finally, a module was developed using dipeptide composition for classifying the oxygen-binding proteins, and achieved maximum accuracy of 96.1%, 98.7%, 98.7%,

85.6%, 99.6%, and 93.3% for the above six classes, respectively.

 

Usage of standalone version

perl oxypred -i <seq_file> -m <method> -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 2 trained SVM modules for the prediction such as

a) Amino acid compositions (1);

b) Dipeptide compositions (2);

Ø  Output_file defines the name of output file for storing results 

 

Publication

Muthukrishnan S, Garg A and Raghava GPS (2007) OxyPred: Prediction and Classification of Oxygen-Binding Proteins. Genomics, Proteomics & Bioinformatics 5:250-2

 

 

DPROT

 

Application

DPROT is a SVM based method to predict disordered proteins using evolutionary information. The webserver is available at http://webs.iiitd.edu.in/raghava/dprot/

 

Introduction

The knowledge of three dimensional (3D) structure of a protein is essential to deduce its biological function. Since, prediction of secondary structure is an intermediate step in structure determination, hence, in the past number of secondary and super secondary structure prediction methods have been developed by our. However, past few years have seen a growing interest in structural studies of proteins, focusing comprehensively on the study of proteins which are structurally disordered, often, known as disordered proteins. These proteins have been gaining high attention from biologists, since their involvement in various physiological disorders which could be protein deposition diseases such as Alzheimer's and Parkinson's diseases became evident. From the structure point of view, a disordered protein, or disordered regions, are those lacking a specific tertiary structure and is composed of an ensemble of conformations, usually with distinct and dynamic Ф and ψ. These proteins in their purified state at neutral pH, either have been shown experimentally or are predicted to lack ordered structure. Existence of disorder is determined by overall protein dynamics rather than by local secondary structure. These proteins are also referred as “natively unfolded” or “intrinsically unstructured”.

Several predictors have been developed in the past for instance PONDR (Jones et al. 2003), DISOPRED2 (Ward et al. 2004), GlobPlot (Linding et al. 2003), DISEMBL (Linding et al. 2003), FoldIndex (Sussman et al. 2005) and RONN (Yang et al. 2005) etc. for predicting disorder proteins/regions. All these predictors exploit various attributes of the protein sequence such as amino acid compositions, flexibility, charge, hydropaths, PSIBLAST profiles, propensities for secondary structure and random coils etc. On the other hand, IUPRED (Dosztanyi et al. 2005) which is based on inter-residue interactions, predicts regions that lack a well defined 3D structure under native conditions, whilst, FoldUnfold (Galzitskaya et al. 2006), predicts disordered regions  by estimating the number of contacts of the whole protein. Recently, a predictor POODLE has been developed (Shimizu et al. 2007), which can predict disordered proteins with a high sensitivity value of 72.3% and an accuracy of 97.7%. POODLE is based on joachims’ spectral graph transducer (SGT), which is a binary classification based on semi-supervised learning.  Despite gaining such higher prediction accuracy, the method seems to be insensitive for the set partially disordered proteins. This insensitivity might be due to the utilization of single protein feature namely amino acid composition for prediction. 

The present study has been undertaken to further improve the prediction performance for classifying ordered and disordered proteins with an introduction to new input feature like secondary structure composition, along with conventionally used protein features such as amino acid composition, dipeptide composition, and Position Specific Scoring Matrices (PSSM) composition. However, best performance was observed for PSSM based module capturing the multiple sequence alignment information for the prediction of disordered proteins, hence, the module has been implemented on web server and standalone version.

 

 Dataset

A representative dataset consisting of 608 proteins: 526 ordered and 82 disordered proteins. The same dataset was earlier used to develop the POODLE web server. Its raw dataset retrieved from Disprot (version 3.3), was later on processed by following an intensive protocol. Additionally, a data set of 417 partially disordered proteins was also used for independent testing.

 

Results

The association of structurally disordered proteins with a number of diseases has engendered an enormous interest and hence, demands a prediction method which would comprehend their study at molecular level expeditiously. DPROT is computational method for prediction of disordered proteins using sequence and profile compositions as input features for the training of SVM models. First, we developed the amino acid and dipeptide composition based SVM modules, which were able to yield sensitivity of 75.6 and 73.2% along with MCC values of 0.75 and 0.60 respectively. In addition, the use of predicted secondary structure content (coil, sheet and helices) in the form of composition values attained 76.8% and 0.77 of sensitivity and MCC values. Finally, training of SVM models using evolutionary information hidden in multiple sequence alignment profile improved the prediction performance by achieving sensitivity value of 78% and MCC of 0.78. Furthermore, the same SVM module when evaluated on an independent dataset of partially disordered proteins provided 86.6% of correct predictions.

 

Usage of standalone version

perl dprot -i <seq_file>  -t <threshold value) -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Threshold_value defines the selection of threshold value in the rangeof -1.5 to 1.5

Ø  Output_file defines the name of output file for storing results 

 

Publication

Sethi D, Garg A and Raghava GPS (2008) DPROT: Prediction of Disordered Proteins using Evolutionary Information. Amino Acids 35:599-605.

 

 

NRpred

 

Application

NRpred is a SVM based tool for the classification of nuclear receptors on the basis of amino acid composition or dipeptide composition. The webserver is available at http://webs.iiitd.edu.in/raghava/nrpred/

 

Introduction

The recognition of nuclear receptors is crucial because many of them are potential drug targets for developing therapeutic strategies for diseases like breast cancer and diabetes. Nuclear receptors are one of the most abundant classes of transcriptional regulators, which regulate diverse functions during reproduction, metabolism and development. Nuclear receptors function as ligand activated transcriptional factors, providing a direct link between signaling molecules that control these processes and transcriptional responses. Besides this, nuclear receptors share a common structural organization. All nuclear receptors consist of six distinct regions or domians: N- and C- terminal highly variable regions (A/B & F domains) that contain one or more transactivation regions, a central well conserved DNA binding domain (C), a non conserved hinge region (D) that contains Nuclear Localization Signal (NLS) and a moderately conserved ligand binding domain (E) (4). The DNA binding domain (C region) of nuclear receptors consists of two zinc fingers, which act as a signature for this superfamily. The presence of these zinc fingers facilitate the recognition of nuclear receptors from genome sequence using simple similarity based search tools like BLAST and FASTA . On the other hand, the major limitation of these search tools is that they are not able to classify the subfamilies of nuclear receptors. The nuclear receptors have been classified to seven subfamilies, which include thyroid and estrogen hormone like receptor according to nucleaRDB database. However, classification of these subfamilies is difficult by using the phylogeny or BLAST based tools due to scarcity of data for some subfamilies. Thus, there is a crucial need for methods to enable automated assignment of nuclear receptor subfamilies. In this report, we have made an attempt to develop a method for recognizing the subfamilies of nuclear receptors. We are able to design a method for recognizing the four subfamilies of nuclear receptors: Thyroid hormone like (TR,RAR, ROR), HNF4-like (HNF4, RXR, TLL, Coup, USP), Estrogen like (ER, ERR, GR, MR, PR,

AR) and Fushi tarazu-F1 like (SFI, FTF, FTZ-F1). Sequences for the other three subfamilies are not available in significant number (less than 10). The classification of nuclear receptors to various subfamilies was done on the basis of amino acid composition and dipeptide composition. The amino acid and dipeptide composition are simplistic approaches to produce patterns of fixed length from the protein sequences of varying length. In the past, the amino acid composition has been used to predict the domains structural class and subcellular localization of proteins. The dipeptide composition is also widely used to encapsulate the global information and giving a fixed pattern length of 400. In the past, dipeptide composition has been used for the prediction of subcellular localization of proteins and for fold recognition. In this study, Support Vector Machines (SVM) was applied to classify nuclear receptors.

 

Dataset

The data for four subfamilies of nuclear receptors was obtained from nucleaRDB database available at http://www.receptors.org/NR/. All the entries, which were not marked as fragments, were extracted from the database by text parsing method. The initial dataset had 577 sequences belonging to four subfamilies of nuclear receptors. Redundancy was reduced so that no sequence had >=90% sequence identity with any other sequence in the data set, using PROSET software. The final dataset contains 282 sequences belonging to different subfamilies of nuclear receptors.

 

Results

The performance of all classifiers was evaluated using 5-fold cross validation test. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition and dipeptide composition based classifiers were 82.6% and 97.5%, respectively. Therefore, our results proven that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition.

 

Usage of standalone version

perl nrpred -i <seq_file> -m <method> -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 2 trained SVM modules for the prediction such as

a) Amino acid compositions (1);

b) Dipeptide compositions (2);

Ø  Output_file defines the name of output file for storing results 

 

Publication

Bhasin M and Raghava GPS (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279:23262-6

 

 

PLPred

Application

PLPred is a SVM based method to predict and classify plastids. The webserver is available at http://webs.iiitd.edu.in/raghava/plpred/

 

Introduction

Plastids are characteristic plant cell organelles that perform essential biosynthetic and metabolic functions. These include photosynthetic carbon fixation, and the synthesis of amino acids, fatty acids, starch and secondary metabolites such as pigments. On the basis of their structure, pigment composition (colour), metabolism and function, plastids are classified as chloroplasts in photosynthetically active tissues, chromoplasts in fruits and petals, amyloplasts in roots, etioplasts in dark-grown seedlings and elaioplasts that are found in the seed endosperm. Although plastids are of significant biological interest, our current understanding of the metabolite functions and capacities of different plastid types is still limited. However, Proteomics is a powerful approach to map the complete set of plastid proteins and to infer plastid-type specific metabolite functions, only a few proteomic approaches have been reported. Besides time consuming, the experimental approaches face several other constraints; for example, the chloroplast proteome analysis is nearing saturation because the detection of new proteins is constrained by highly abundant photosynthetic proteins that dominate the proteome of photosynthetically active chloroplasts. To circumvent these constraints and to increase proteome coverage, the development of highly efficient computational prediction tools is another complementary approach to provide useful global information about the possible evolution of the plastid proteome. PLpred is an attempt in this direction which is a Support Vector Machine (SVM) based two-phase prediction tool for identifying as well as classifying the plastid proteins.

Various features of a protein sequence viz. Amino acid composition, Dipeptide composition and Split Amino Acid Composition (SAAC) were exploited in the development of this prediction method. Secondly, the similarity search-based PSI-BLAST module was also developed. In addition, N-terminal and C-terminal amino acid composition based SVM modules as well as the Hybrid-based classifiers were also developed in order to encapsulate more comprehensive information from a protein sequence. Conclusively, the best modules were selected and made available on this server for classification of plastid proteins.

 

Dataset

To infer various plastid-type specific functions, only a few proteomic approaches have been reported and thus, very less experimentally proved plastid protein sequences are available in the public databases. Protein sequences for Etioplast and Chloroplast were downloaded from PLprot database. For Amyloplast and Chromoplast sequences, whole of the 'UniProt' was searched for the available sequences. A total of 1033 protein sequences were extracted from these two databases for the said four plastid-types. For generating data for phase-I training process, all the above 1033 plastid-type protein sequences were combined to form one 'positive dataset' for developing various phase-I prediction classifiers. For generating 'negative dataset', we downloaded some experimentally annotated sequences belonging to cytoplasm and nucleus cellular localizations. As both the cytoplasmic and nucleus targeted proteins lack signal peptides in their N-terminus region as compared to the plastid proteins which always consist of N-terminal targeting peptides, these sequences were considered as better option for creating 'negative dataset'. Hence, the 'negative dataset' for training in phase-I consisted of 103 cytoplasmic sequences from rice, 226 cytoplasmic sequences from arabidopsis and 704 nuclear proteins from arabidopsis (Total = 1033 sequences).

 

Results

The present prediction tool is a two-phase process and was developed in two stages. In the first stage, when a user submits a query sequence, it is firstly predicted as plastid or non-plastid protein (phase-I). If the query protein is predicted as 'Plastid' through phase-I after that it will be passed to the next stage, which is the classification stage (phase-II). Here, the query protein will be classified to one of its plastid-type (Chloroplast, Chromoplast, Etioplast or Amyloplast) class. For developing hybrid module, we combined the traditional amino acid composition technique and the dipetide composition with the four-parts based amino acid composition along with the similarity-based Psi-Blast approach. Thus, the SVM input vector pattern in this case was 505 (20 for amino acid, 400 for dipeptide, 80 for four-parts based amino acid composition and 5 for Psi-Blast output as binary representation). Best results were again obtained with the RBF kernel with an overall accuracy of 90.13% and an overall MCC of 0.76.

 

Usage of standalone version

perl plpred -i <seq_file> -m <method> -t <threshold value) -o <output_file>

Ø  Seq_file is a file containing protein sequences (single or multiple) in fasta format.

Ø  Method defines the 5 trained SVM modules for the prediction of plastids such as

a) Amino acid compositions (1);

b) Dipeptide compositions (2);

c) Split four parts compositions (3);

d) PSI-BLAST similarity based (4);

e) Hybrid module of a+b+c+d (5).

Ø  Threshold_value defines the selection of threshold value in the range of -1.5 to 1.5

Ø  Output_file defines the name of output file for storing results 

 

 

AntiBP

Application:

AntiBP is a server that predicts whether a peptide possesses antibacterial properties or not. The web server can be accesses through http://webs.iiitd.edu.in/raghava/antibp.

 

Introduction

Antibacterial peptides are important components of the innate immune system, used by the host to protect itself from different types of pathogenic bacteria. Over the last few decades, the search for new drugs and drug targets has prompted an interest in these antibacterial peptides. We analyzed 486 antibacterial peptides, obtained from antimicrobial peptide database APD, in order to understand the preference of amino acid residues at specific positions in these peptides. It was observed that certain types of residues are preferred over others in antibacterial peptides, particularly at the N and C terminus. These observations encouraged us to develop a method for predicting antibacterial peptides in proteins from their amino acid sequence.

 

Results

First, the N-terminal residues were used for predicting antibacterial peptides using Artificial Neural Network (ANN), Quantitative Matrices (QM) and Support Vector Machine (SVM), which resulted in an accuracy of 83.63%, 84.78% and 87.85%, respectively. Then, the C-terminal residues were used for developing prediction methods, which resulted in an accuracy of 77.34%, 82.03% and 85.16% using ANN, QM and SVM, respectively. Finally, ANN, QM and SVM models were developed using N and C terminal residues, which achieved an accuracy of 88.17%, 90.37% and 92.11%, respectively. All the models developed in this study were evaluated using five-fold cross validation technique. These models were also tested on an independent or blind dataset.

 

Usage of standalone version

antibp -i <seq_file> -t <terminus> -a <approach> -t <threshold> -o <output_file>

-i      inputFile (in Fasta format)

-t         Terminus to be used for pediction (N or C or NC).

-a        Approach used for prediction (SVM or ANN or QM).

-t      [optional]   Threshold for Prediction [default value is 0 for SVM approach, 0.6 for ANN and -0.2 for QM based approach]

-o     Output Result file

 

 

Reference

Sneh Lata, B K Sharma, G P S Raghava. Analysis and prediction of antibacterial peptides. BMC Bioinformatics 2007, 8:263.

 

 

PolyApred

 

Application

PolyApred is a support vector machine (SVM) based method for the prediction of polyadenylation signal (PAS) in human DNA sequence. The webserver is available at

http://webs.iiitd.edu.in/raghava/polyapred/

 

Introduction:

Polyadenylation signal plays key role in determining the site for addition of polyadenylated tail to nascent mRNA and its mutation(s) are reported in many diseases. Identification of poly (A) sites is important to determine the gene boundary like, the last exon and 3¢ UTR, which plays critical role in mRNA stability and localization. In the past, a number of methods have been developed for predicting poly(A) signals in a given nucleotide sequence by exploiting nucleotide feature around PAS signals.

In this method we utilized the features of region specific nucleotide frequency around the PAS signals and achieved highest accuracy.

 

Dataset

The investigations were performed on two different dataset: (a) Positive dataset containing 2327 sequences and each sequence is 206 nt long having poly (A) signal at the centre (101 to 106 nt). (b) Negative dataset containing 2333 sequences and each sequence is 206 nt long extracted from coding region of gene that have AATAAA at the centre (101 to 106 nt).

 

Results

In this study, Support Vector Machine (SVM) models have been developed for predicting poly(A) signals in a DNA sequence using 100  nucleotides, each upstream and downstream of this signal. Here, we introduced a novel split nucleotide frequency technique, and the models, thus, developed achieved maximum Matthews correlation coefficient (MCC) of 0.58, 0.69, 0.70 and 0.69 using mononucleotide, dinucleotide, trinucleotide, and tetranucleotide frequencies, respectively. Finally, a hybrid model developed using combination of dinucleotide, 2nd order dinucleotide and tetranucleotide frequencies, and achieved maximum MCC of 0.72. Moreover, for independent datasets this model achieved a precision ranging from 75.8 - 95.7% with a sensitivity of 57%, which is better than any other known methods.

 

 

Usage of standalone version

perl polyapred -i inputFile -t threshold -o Output_Result_File

 

-i          inputFile (in Fasta format)

-t          Threshold for SVM based [default = 0]

-o         Output Result file

 

PolyApred program: In this query sequence (in Fasta format) leads to the following path

 

1.      bin/fasta2sfasta: present in the bin directory of the package to make multi-fasta format sequence into simple-fasta (SFASTA) format

2.      bin/mot_polya: Take 100 nt upstream and 100 nt downstream of a putative PAS signal (six nt). Each 100 nt long sequence is divided into two equal region (50 nt).

3.      bin/ freq_polya: calculate nucleotide frequency of each region and make input for SVM.

4.      SVM classify with model svm_models/polyapred/model_polya

 

           

Publication

Ahmed F, Kumar M, Raghava GPS. (2009) Prediction of polyadenylation signals in human DNA sequences using nucleotide frequencies. In Silico Biology. 9:007.

 

 

ABCpred

 

Application

The aim of ABCpred server is to predict B cell epitope(s) in an antigen sequence, using artificial neural network. The webserver is available at

http://webs.iiitd.edu.in/raghava/abcpred/

 

Introduction

B-cell epitopes play a vital role in the development of peptide vaccines, in diagnosis of diseases, and also for allergy research. Experimental methods used for characterizing epitopes are time consuming and demand large resources. The availability of epitope prediction method(s) can rapidly aid experimenters in simplifying this problem. The standard feed-forward (FNN) and recurrent neural network (RNN) have been used in this study for predicting B-cell epitopes in an antigenic sequence.

 

Dataset

B-cell epitopes were obtained from B cell epitope database (BCIPEP), which contains 2479 continuous epitopes, including 654 immunodominant, 1617 immunogenic epitopes. All the identical epitopes and non-immunogenic peptides were removed; finally we got 700 unique experimentally proved continuous B cell epitopes. The dataset covers a wide range of pathogenic group like virus, bacteria, protozoa and fungi. Final dataset consists of 700 B-cell epitopes and 700 non-epitopes or random peptides (equal length and same frequency generated from SWISS-PROT).

Result

The server is able to predict epitopes with 65.93% accuracy using recurrent neural network.Users can select window length of 10, 12, 14, 16 and 20 as predicted epitope length. It presents the results in tabular frame, which will provide sequence name, pattern, prediction score and its position.

 

Usage of standalone version

perl polyapred -i inputFile -t threshold –w 16 -o Output_Result_File

 

-i          inputFile (in Fasta format)

-t          Threshold [ 0.1 to 1,  default = 0.5]

-w       Window length [10, 12, 14, 16, 18, or 20]

-o         Output Result file

 

ABCpred program: In this query sequence (in Fasta format) leads to the following path

 

1.      bin/fasta2sfasta: present in the bin directory of the package to make multi-fasta format sequence into simple-fasta (SFASTA) format

2.      bin/ seq2motif_simple: create motifs by sliding window of defined length

3.      bin/ motif2_binsnns.pl: make binary input for ANN from the motif file

  1. ANN classify with model svm_models/abcpred/neural10- neural 20

 

Publication:

Saha, S and Raghava G.P.S. (2006) Prediction of Continuous B-cell Epitopes in an Antigen Using Recurrent Neural Network. Proteins,65(1),40-48.

 

 

 

 

WebCdk

 

Application:

In the process of drug development and QSAR model building there is need to calculate various descriptors of a molecule. These descriptors are used as a feature vector. There are many free as well as paid softwares which are generally used for descriptor calculation. Presented program is a java based standalone version of WebCdk web server which uses the cdk descriptor calculation tool, making it more user friendly, faster and with support of commonly used molecular file formats. Program can calculate geometrical, electrical, topological and constitutional descriptors for a given molecule and can handle many molecules at a time.

 

Running webCdk locally:

 

Program to Calculate descriptors (Topological, Electrical, Geometrical and Constitutional) for smile, mol and sdf format file.

 

README.txt                                                 README file enclose in the gpsr/src/webcdk

test.smi, test.mol, test.sdf                               few test files

 1)   /gpsr/src/webcdk/runWebCdk                 this is main program to run

 2) /gpsr/src/webcdk/packageWebCdk.class JAVA program, required by main perl program and

                                                                        CLASSPATH should be defined for this

3) /gpsr/src/webcdk/packageWebCdk.java    Source code for 'packageWebCdk' program

4) /gpsr/src/webcdk/cdk-1.0.3.jar                    cdk jar file on which WebCdk is based

 

 

Important Instructions for running this program

 

 For running this program user need to have JAVA and CDK installed in the system and set the path for JAVA accordingly.

 User can check the path of java by typing 'env' from the command prompt.

 In the 'PATH' field java is set e.g. /usr/java/jdk1.6.0_06/bin:/usr/java/jdk1.6.0_06/jre/bin

 Set the path by adding in the .bashrc or .bash_profile file

 

 

or for the bash shell user type 'export PATH=$PATH:/usr/jdk/jdk1.5.0_06/bin/:/usr/jdk/jdk1.5.0_06/jre1.6.0_06/bin/:/usr/jdk/jdk1.5.0_06/jre/bin/' (JAVA Directory)

set the webcdk directory in ur path as export PATH=$PATH:/gpsr/src/webcdk/

 

## Setting CLASSPATH for CDK

 

1)      The CDK should be in the 'CLASSPATH' field.

 user can check by 'env' or add in the the CLASSPATH field of .bashrc or .bash_profile file.

or for the bash shell user type 'export CLASSPATH=$CLASSPATH:/home/user/gpsr/src/webcdk/cdk-1.0.3.jar' (CDK Directory)

 

2)       Add the webcdk installation directory in the CLASSPATH as mentioned above

export CLASSPATH=$CLASSPATH:/home/user/gpsr/src/webcdk/

 

     3) eg. For the bash shell user type 'export CLASSPATH=$CLASSPATH:/home/user/gpsr/src/webcdk/packageWebCdk.class'   (webcdk package Directory)

 

Without setting path for JAVA user will probably get the error like 'Exception in thread "main" java.lang.NoClassDefFoundError: packageWebCdk'

 

Usage:

 

perl runWebCdk -i inputFile -f fileFormat -d descriptor -o resultFile

 

-i                                  InputFile in smile, mol or sdf format

-f    [smile|mol|sdf]     Input file format; 'smile' for smile, 'mol' for mol and 'sdf' for sdf file format

-d   [a|t|e|g|c]               Descriptor required to calculate; 'a' for total 178 descriptors, 't' for topological, 'e'

for electrical, 'g' for geometrical and 'c' for constitutional descriptors respectively

-o                                OutPut result file

 

Reference:

 

http://webs.iiitd.edu.in:8081/webcdk/

 

 

NADbinder

 

Application:

 

The program predicts the NAD (Nicotinamide adenine dinucleotide) binding residues in a protein. NAD along with other small molecular cofactors like FAD, ATP etc play a very important role in the regulation of enzyme activity. We hope that presented tool will aid in the advancement of ligand-protein interaction studies.

 

Introduction:

 

In the post-genomic era understanding protein-ligand interaction is very promising because many proteins recruit small molecular ligands or co-factors such as ATP, NAD and FAD etc. for their function. The first step for understanding protein–ligand interaction would be to analyze binding of these ligands to the specific amino acid residues. In the present study we have developed 2 modules for the prediction of NAD binding residue. One approach is using binary feature and second is using evolutionary information coupled with SVM (support vector machine) to classify an amino acid residue as interacting or non-interacting. We got initial dataset of NAD interacting proteins from PDB (Protein Data Bank) by using LPC (Ligand protein contact) tool. We had 1545 amino acid sequences reported to bind to NAD but with redundancy. After reducing the redundancy at a cutoff of 40% we were left with just 195 sequences. With the help of Binary features we were able to achieve an accuracy of 74% while evolutionary information in the form of PSSM (Position specific scoring matrix) gave an accuracy of 85%.

 

Program description:

Usage:

perl nadbinder -i inputFile -m method -t threshold -o Output_Result_File

-i                      inputFile (in Fasta format)

-m [b|p]            [optional]      'b' for Binary method   'p' for PSSM based prediction [default=p]

-t                      [optional]   Threshold for SVM based Prediction [default =-0.2]

-o                     Output Result file

 

 

Supporting programs needed: (for each program detail description see the package documentation)

 

There are two approaches used in the program

  1. Binary approach -

Here binary feature is used as a input for SVM ie sequences are converted into 1 and 0 matrix of 21xwindow length vector size, where each residue comes at the center of the motif.

The steps are as follows-

  1. bin/fasta2sfasta -i seq_temp -o seq_temp.sfasta (sequence is converted to SFASTA format)
  2. bin/seq2motif -i seq_temp.sfasta -w 17 -x y -o temp.mot (SFASTA seq to motifs with X aa)

5.      bin/motif2bin -i temp.mot -x y -o temp.bin (Motif to Binary conversion ie 0 and 1 format)

6.      bin/col2svm -i temp.bin -o temp.svm -s 0 (column format)

7.      svm classify with svm_models/NADbinder/model_binary_nadbinder model file

8.      bin/count_pred_binary_center.pl -p temp.score -s seq_temp.sfasta -o result -t thresold (for comparing the predicted score and threshold for each center residue).

 

  1. PSSM based approach -

Here query sequence is converted into PSSM matrix, parsed, normalized, converted into patterns according to window size in such a manner that each center residue's matrix value is spanned by adjacent residues value, making vector size 20x window length for SVM input.

 

  1. bin/seq2pssm_imp -i seq_temp -o temp.pssm -d swissprot (query sequence is converted into PSSM matrix by using blastpgp and makemat programs)
  2. bin/pssm2pat -i temp.pssm -w 17 -o temp.pat (PSSM to patterns of 17 window size)
  3. bin/col2svm -i temp.pat -o temp.svm -s 0 (column to SVM readable format, 20x17 vector)
  4. svm classify with svm_models/NADbinder/model_pssm_nadbinder model file
  5. bin/count_pred_binary_center.pl -p temp.score -s seq_temp.sfasta -o result -t thresold (for comparing the predicted score and threshold for each center residue).

 

Sample input:

 

>1AF3_B|PDBID_CHAIN_SEQUENCE

MSQSNRELVVDFLSYKLSQKGYSWSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLADSPAVNGATGHSS

SLDAREVIPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPGTAYQSFEQVVNELFRDGVNWGRIVAFFSF

GGALCVESVDKEMQVLVSRIASWMATYLNDHLEPWIQENGGWDTFVDLYG

 

Sample output:

 

lowercase: non-interacting residues ;  UPPERCASE followed by '*' : INTERACTING RESIDUES

>1AF3_B_PDBID_CHAIN_SEQUENC     Length = 196

msqS*nR*E*lV*vdF*lsyklS*qkG*Y*sW*S*qfS*dveenrteaP*eetepereT*psainG*npswhL*adsP*aV*ngatG*hsssldarevi

pmaavK*qalR*eaG*D*eF*elryrrafsD*L*tsqlhitpG*taY*Q*sF*eqV*vnE*lF*R*D*G*V*nwG*rI*V*aF*F*sF*gG*A*l

cV*esvdK*emqvL*vsR*iaswM*A*T*Y*lnD*hlE*pwiqE*N*G*gW*D*tF*V*dL*yg

 

Detail Residue wise View

------------------------------------------------

Pos     Residue  Score          Prediction

------------------------------------------------

1       m       -0.58103577     non-interacting

2       s       -0.47046973     non-interacting

3       q       -0.46166293     non-interacting

4       S*      0.29743143      INTERACTING

5       n       -0.29965193     non-interacting

6       R*      -0.18128689     INTERACTING

7       E*      0.26258585      INTERACTING

8       l       -0.41980352     non-interacting

9       V*      -0.18798209     INTERACTING

10      v       -0.30472188     non-interacting

11      d       -0.74795141     non-interacting

12      F*      0.15331554      INTERACTING

13      l       -1.3255221      non-interacting

14      s       -0.43654671     non-interacting

15      y       -0.30118048     non-interacting

16      k       -1.0193331      non-interacting

17      l       -1.0320422      non-interacting

18      S*      0.3288692       INTERACTING

19      q       -0.70314709     non-interacting

20      k       -0.60352651     non-interacting

21      G*      0.46728328      INTERACTING

22      Y*      0.76776858      INTERACTING

Reference:

webs.iiitd.edu.in/raghava/NADbinder

 

 

 

MITPRED

 

Application:

 

The program is able to classify any query protein into Mitochondrial or Non-mitochondrial localization. Stand-alone version is very useful for running whole proteome of an organism for the annotation purpose.

 

Introduction:

MitPred is a stand-alone program of web-server specifically trained to predict the proteins which are destined to localize in mitochondria in yeast and animals particularly. The prediction is made on basis of either occurrence of Pfam domain(s) or homology to an experimentally annotated proteins or ab-initio prediction on the basis of amino acid composition. Domain search is being done my HMMER (hidden Markov Models based search) while homology search by BLAST. Since both of these methods rely on the presence of experimentally annotated examples which can be limiting in their absence, hence provision of SVM based prediction is also kept.

 

Programs needed to run Mitpred locally:

 

Main program, mitpred, present in bin folder of the package

 

Usage: perl mitpred -i inputFile -m model -t threshold -o Output_Result_File

 

-i          inputFile (in Fasta format)

-m        Model ('svm' for SVM based or 'blast' for BLAST based prediction or 'pfam' for Pfam based)

-t          [optional]   Threshold for SVM based [default =0.5]

            E-value selected for Blast based model [default=1e-4]

-o         Output Result file

 

Mitpred program is for the prediction of mitochondrial proteins. It offers 3 models/methods-

 

  1. SVM based
  2. Blast search + SVM based
  3. Pfam search + SVM based

 

1)      SVM Based : in this model query sequence (in Fasta format) leads to the following path-

Split amino acid composition is taken as a feature vector where query sequence is divide into 3 halves (split) and each part (n, rest and c) composition is calculated and given to SVM.

(For each program details see the package documentation)

 

9.      bin/fasta2sfasta  : present in the bin directory of the package to make multi-fasta format sequence into simple-fasta (SFASTA) format

10.  bin/pro2aac_nt : calculate N terminal, 25 aa composition

11.  bin/pro2aac_rest: calculate composition excluding n 25 and c 25

12.  bin/pro2aac_ct : calculate composition C terminal, 25 aa

13.  bin/add_cols : Add all 3 composition files (output of steps 2,3,4) in 2 steps creating matrix of 3*20 ie 60.

14.  bin/col2svm: Converting this to SVM readable format

15.  SVM classify with model svm_models/mitpred/model_file

 

2)      BLAST Search + SVM:

 

Hybrid approach in which first query sequence is subjected to Blast against the mitochondrial and non-mitochondrial database 'mitp_dbase' (present in the blastdb/mitpred/ folder) and the result is parsed and checked for the top hits.

If blast returns positive or negative hit (as database sequences are already tagged as positive and negative), then Prediction directly assigns the query as mitochondrial or non-mitochondrial respectively. If blast returns no hit, then that query will be subjected to ab initio SVM prediction (by using above mentioned feature ie split amino acid composition).

 

Programs needed:  (for each program details see the package documentation)

 

  1. bin/fasta2sfasta  : present in the bin directory of the package to make multi-fasta format sequence into simple-fasta (SFASTA) format
  2. Convert each sfasta sequence to a fasta file as blast can take take one sequence at a time and in fasta format.
  3. /blastall -p blastp -i inputSequence -e thresold -d /mitpred/mitp_dbase -o blast.out : Doing Blast
  4. perl bin/take_blast_hit blast.out > take_blast_temp : to parse the blast output
  5. If hit is positive or negative then declare the query as mitochondrial or non-mitochondrial  respectively
  6. If hit is No hit then Do SVM as above
  7. Compare the predicted score with Threshold selected , If score is more or equal to threshold then Positive otherwise Negative

 

3)      Pfam Search + SVM:

 

In this module query is first subjected to Pfam search by using hmmpfam program against a profile database (mitpred/mitpred_v2.hmm) of mitochondrial and non-mitochondrial domains created by hmmbuild. Then result is parsed and if hit domain is found, may be of mitochondrial or non-mitochondrial then declare this query as as mitochondrial or non-mitochondrial respectively.

If No domains or mitpred curated domains are found then SVM prediction is exploited same as above.

 

Programs needed: (for each program details see the package documentation)

 

  1. /hmmpfam -E 1e-5 /mitpred/mitpred_v2.hmm inputFastaSeq >temp_pfam : Pfam search
  2. perl bin/parse_result_hmm . > temp_hmm_parse_result : Parsing output
  3. perl bin/domain_mitpred . > temp_domain : Domain assignment with /mitpred/domain.dat file

In domain assignment program searches for the hit domain in a file ie /mitpred/domain.dat where it’s already classified that which domain is present in mitochondrial and non-mitochondrial or which are shared domains.

     With shared or No Domain found results program does SVM as above.

 

Reference:

Webserver: webs.iiitd.edu.in/raghava/mitpred

Kumar M, Verma R, Raghava GPS. (2005) Prediction of mitochondrial proteins using support vector machine and hidden Markov model. J Biol Chem. 281:5357-63.

 

 

 

 

NpPred

 

Application:

 

Program NpPred is for the prediction of Nuclear and non-nuclear proteins. This program will aid in the annotation of uncharacterized proteins.

 

Introduction:

NpPred is a method developed for predicting nuclear proteins. This method has been developed on a non-redundant dataset consists of 2710 nuclear and 7662 non-nuclear proteins. During development of NpPred we developed number of SVM based methods using various types of composition (amino acid, dipeptide, split) and achieved maximum accuracy 85.47% when evaluated using fivefold cross-validation. Using hybrid approach (Pfam domain and SVM) accuracy increased to 94.61%. This method performed better than existing methods when evaluated on independent dataset obtained from BaCelLo (Pierleoni et al., 2006) and NucPred (Brameier et al., 2007). In this server we have given 2 approaches for prediction (a) SVM module developed using N-terminal 25 and remaining residues amino acid composition and (b) Hybrid approach combining SVM module and HMM profile. We hope this method will be useful for researcher working on field of genome annotation.

 

Programs needed to run NpPred locally:

 

Main program, nppred, present in bin folder of the package

 

Usage: perl bin/nppred -i inputFile -m model -t threshold -o Output Result File

 

-i      inputFile (in Fasta format)

-m    [optional]   Model ['svm' for SVM based or 'pfam' for Pfam based]

-t      [optional]   Threshold for SVM based [default =0.5]

-o     Output Result file

 

NpPred program is for the prediction of nuclear proteins. It offers 2 models/methods-

 

  1. SVM based
  2. Pfam search + SVM based

 

1)      SVM Based : in this model query sequence (in Fasta format) leads to the following path-

Split amino acid composition is taken as a feature vector where query sequence is divide into 3 halves (split) and each part (n, rest and c) composition is calculated and given to SVM.

(For each program details see the package documentation)

16.  bin/fasta2sfasta  : present in the bin directory of the package to make multi-fasta format sequence into simple-fasta (SFASTA) format

17.  bin/pro2aac_nt : calculate N terminal, 25 aa composition

18.  bin/pro2aac_rest: calculate composition excluding n 25

19.  bin/add_cols: Add 2 composition files (output of steps 2,3) creating matrix of 2*20 ie 40.

20.  bin/col2svm: Converting this to SVM readable format

21.  SVM classify with model svm_models/nppred/model_file

           

2)      Pfam Search + SVM:

 

In this module query is first subjected to Pfam search by using hmmpfam program against a profile database (nppred/nppred.hmm) of mitochondrial and non-mitochondrial domains created by hmmbuild.

Then result is parsed and if hit domain is found, may be of Nuclear or non-nuclear

then declare this query as as  Nuclear or non-nuclear respectively.

If No domains or nppred curated domains are found then SVM prediction is exploited same as above.

 

Programs needed: (for each program details see the package documentation)

 

  1. /hmmpfam -E 1e-5 /nppred/nppred.hmm inputFastaSeq >temp_pfam : Pfam search
  2. perl bin/parse_result_hmm . > temp_hmm_parse_result : Parsing output
  3. perl in/domain_nppred . > temp_domain : Domain assignment with /nppred/domain.dat file

In domain assignment program searches for the hit domain in a file ie /nppred/domain.dat where it’s already classified that which domain is present in nuclear or non-nuclear or which are shared domains.

     With shared or No Domain found results program does SVM as above

 

Reference:

http://webs.iiitd.edu.in/raghava/nppred/

 

Kumar, M. and Raghava, G.P.S. Prediction of Nuclear Proteins using SVM and HMM Models. BMC Bioinformatics. 2009 Jan 19; 10(1):22.

 

 

Pprint

 

Application:

 

There are many proteins or factors which are involved in the gene regulation process and work very efficiently. These factors are important areas of research in modern biology due to direct application in many diseases. The present program address the very common question asked by a biologist that what are the residues interacting with RNA in a RNA interacting proteins.

 

Introduction:

Pprint (Prediction of Protein-RNA Interaction) is a stand-alone version of a web-server for predicting RNA-binding residues of a protein. The prediction is done by SVM model trained on PSSM profile generated by PSI-BLAST search of 'swissprot' protein database. The SVM model is trained and tested on a set of 86 non-homologous protein chains with 5-fold cross-validation. It has predicted RNA-interacting amino acids with prediction accuracy 75.53% and MCC value of 0.44 during training and testing.
It takes amino acid sequence in FASTA format as input and predict the RNA-interacting residues. The residues in the query sequence predicted as RNA-interacting residues are printed in Upper case and non-interacting residues are in lowercase. Below the amino acid sequence, residue-wise detail prediction is also given in tabular format. This table contains three columns (i) amino acid residue, (ii) SVM score and (iii) prediction. The prediction result depends on the threshold value specified by the user. The default threshold is set as -0.2. To get prediction with less number of false positives, the user should choose higher threshold. For prediction with less number of false negatives, threshold should be very low.

 

Usage:

 

perl bin/pprint -i inputFile -t threshold -o Output Result File

-i      inputFile (in Fasta format)

-t      [optional]   Threshold for SVM based Prediction [default = -0.2]

-o     Output Result file

 

Programs Needed:

 

For running pprint user need following programs- (for details of each program see the program manual)

            pprint:  Main running Program present in the bin directory of the package

           

  1. bin/fasta2sfasta  : present in the bin directory of the package to make multi-fasta format sequence into simple-fasta (SFASTA) format ( for details see the package documentation)
  2. bin/seq2pssm_imp_pprint : program to generate PSSM matrix from sequences
  3. bin/pssm2pat_pprint : program to generate defined length patterns from pssm matrix
  4. bin/col2svm : for converting pssm column values to svm readable format
  5. model_pprint : present in /svm_models/pprint/ folder , to run SVM classify and get predicted scores
  6. Compare the predicted score with threshold selected for each motif and predict Interacting or Non-interacting

 

Sample output

 

pprint:: Prediction of RNA-interacting residues Result ##  No of sequences = 2 ##  Threshold = -0.2

 

Lowercase: Interacting residues        Uppercase: Non-interacting residues

 

>1AF3_B_PDBID_CHAIN_SEQUENC     Length = 196 amino acids

msqSNRELVVDFlSykLSqKgySwSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLADSPAVNGATGHSSSLDAREV

IPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPGTAyQSFEQVVNELFrdgvNwGrIVAFFSfGGALCVESVDKEM

QVLVSRIASWMATYLNDHLEPwIQengGWDTFVDLYG

 

Residue wise detail prediction

 

Amino Acid      SVM Score       Prediction

 

M                     0.054416444     Interacting

S                      -0.024923589    Interacting

Q                     -0.16793406     Interacting

S                      -0.66043539     Non-Interacting

N                     -0.55410943     Non-Interacting

R                     -0.29945995     Non-Interacting

E                      -1.0637555      Non-Interacting

L                      -0.40939491     Non-Interacting

V                     -0.72646616     Non-Interacting

V                     -1.5926368      Non-Interacting

D                     -0.90241588     Non-Interacting

F                      -0.61350159     Non-Interacting

L                      0.72456004      Interacting

S                      -0.39280286     Non-Interacting

Y                     -0.084172651    Interacting

K                     0.91123144      Interacting

L                      -0.74135724     Non-Interacting

. . . . . . . .. . . . . . . . . . . . .. . . . .  . ..  . . . .. . . . .  . .. . . . .. . . .. .

 

Reference:

webs.iiitd.edu.in/raghava/pprint

Kumar, M., Gromiha, M.M. and Raghava, G.P.S. Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins: Structure, Function and Bioinformatics. 2008 Apr; 71(1):189-94.

 

 

 

SPpred

 

Application:

Solubility of protein is an important issue while doing the protein over expression studies in Escherichia coli because heterologous proteins may or may not be soluble enough to show activity or may result in to protein aggregates. Therefore a computational tool SPpred has been developed to predict the solubility of any protein before going into the real experimentation.

 

Introduction:

SPpred (Protein Solubility prediction), a program for predicting solubility of a protein on over expression in Escherichia coli. The prediction is done by SVM model based on splitted amino acid composition. The SVM model is trained and tested on a set of 192 proteins with 5-fold cross-validation. The prediction accuracy and MCC value are ~75% and 0.504 respectively during training and testing.
It takes amino acid sequence in FASTA format as input and predicts whether the given protein is soluble or form inclusion body on over expression. The prediction result depends on the threshold value specified by the user. The default threshold is set as -0.1. To get prediction with less number of false positives, the user should choose higher threshold. For prediction with less number of false negatives, threshold should be very low.

 

Method:

1) First query sequence is splitted into 4 parts.

2) Amino acid composition of each part is calculated.

3) All 4 feature vectors are added making it to vector size=80.

4) By using Col2svm program converted to SVM readable format, and then given to SVM for classification.

5) Based on the prediction score and threshold selected by the user prediction is done.

 

Usage:

perl bin/sppred -i <i/p file> -t <threshold> -o <o/p file>

-i      Input Sequence in FASTA format

-t      SVM threshold

-o      Output Result File

Reference

http://webs.iiitd.edu.in/raghava/sppred/

 

 

ISSPred

 

Application:

ISSPred program is for the identification of intein (protein splicing) and their N-C teminal splice site. Program has 3 different modules for the classification of Intein and non-intein containing proteins, Intein domain and their splice sites.

 

Introduction:

Protein Post-translational Modification (PTM) is a common phenomenon in biology which regulates the function of proteins. Protein Splicing is a unique PTM in that it leads to cleavage of protein into internal (intein domain) and flanking (extein domain) fragments. Extein sequences later ligate together to form fully functional active protein.  Identification of intein and their splice sites aid in the annotation of uncharacterized proteins. In this study, attempts have been made to predict intein proteins, domains, and their sites. In order to predict Intein proteins, we analyzed amino acid composition of intein proteins/domain and observed preference for certain type of residues. Support Vector Machine (SVM) models have been developed for predicting intein proteins using amino acid and dipeptide composition and achieved maximum MCC 0.63 and 0.77 respectively. Secondly SVM models have been developed for predicting intein domains in protein using amino acid and dipeptide composition and achieved maximum MCC 0.76 and 0.87 respectively. Finally SVM models were developed for predicting splice sites using different window length and achieved maximum MCC 0.87 and 0.93 for N-splice and C-splice sites respectively. This study is the first attempt to predict intein proteins, domains and their splice sites. Based on above models a prediction server ISSPred has been developed, which is available at http://webs.iiitd.edu.in/raghava/isspred/.

 

Usage:

perl isspred -i inPutFile -p prediction -m model -t threshold -o outPutFile

 

-i          InputFile (multi-fasta format)

-p         Prediction [d|p|s]

['d' for intein domain; 'p' for Intein-protein and 's' for Inteins's N-C Splice Site Prediction]

-m        [optional]         Model to use [a|d|n|c|nc][Default is 'd' if -p= d or p and 'nc' if -p is 's']         ['a' amino acid composition; 'd' dipeptide composition if -p is 'd' or 'p']

            ['n'= N splice, 'c'= C splice and 'nc' = both NC splice site prediction if -p is 's']

-t          [optional]         Threshold selected e.g -1.0 to +1 [default is -0.4 if -p= d ; -0.6 if -p= p and -0.9 if -p= s]

-o         Output result File

 

Program description: (for detailed description see the package documentation)

ISSPred is different from most of other programs in that it predicts the splice site (ie between two amino acid residues) unlike the center amino acid residue prediction used in Pprint, NADbinder etc.

 

ISSPred has 3 modules-

Amino acid composition:

This feature has been used for both Intein domain and Intein Protein prediction.

1) bin/pro2aac -i seq.sfasta -o seq.comp  (sequence to amino acid composition)

2) bin/col2svm -i seq.comp -o seq.svm -s 0 (column to SVM Readable format)

3) svm_classify seq.svm svm_models/isspred/model_dipep_prot seq.pred > svmlog.out

4) bin/count_pred_isspred -s seq.sfasta -p seq.pred -o outPutFile -t thresold

 

Dipeptide composition:

 

Feature has been used for both Intein domain and Intein Protein prediction.

 

1) bin/pro2dpc -i seq.sfasta -o seq.aac (sequence to Dipeptide composition)

2) bin/col2svm -i seq.comp -o seq.svm -s 0 (column to SVM Readable format)

3) svm_classify seq.svm svm_models/isspred/model_dipep_prot seq.pred > svmlog.out

4) bin/count_pred_isspred -s seq.sfasta -p seq.pred -o outPutFile -t thresold

 

Splice (N or C) binary patterns:

            Feature has been used in N and C terminal Splice site prediction.

1) bin/seq2motif -i seq.sfasta -w 16 -o seq.motif (sequence to motif)

2) bin/motif2bin -i seq.motif -x n -o seq.bin (motif to binary)

3) bin/col2svm -i seq.bin -o seq.svm -s 0 (column to SVM readable format)

4) svm_classify seq.svm /svm_models/isspred/model_nsplice seq.pred >svmlog.out

5) /bin/count_pred_binary -p seq.pred -s seq.sfasta -m seq.motif -o outPutFile -t thres -a N-     Splice

 

Reference

http://webs.iiitd.edu.in/raghava/isspreda

 

 

GSTPred

 

Application:

GSTPred is a standalone package for Glutathione S-transferase protein (GST) prediction webserver GSTPred. GSTPred trained using generalized GST proteins datasets can be used for the prediction of GST proteins. The webserver is available at http://webs.iiitd.edu.in/raghava/gstpred/

 

Introduction:

Glutathione S-transferases (GSTs) are a group of ubiquitous and multifunctional enzymes found in both prokaryotes and eukaryotes. Another important function of GSTs are making a cell drug resistant by avoidance of apoptotic cells death, altered expression of multi-drug resistance-associated proteins or drug metabolism or uptake, and/or over-expression of GSTs. GSTs are involved in drug resistance by either i) participation in detoxification process with GSH or ii) increasing the pumping out of drug molecule from the cell or iii) inhibition of MAP kinase pathways. Overexpressions of specific GSTs in mammalians cells cause anti-cancer drug (alkylating agent used in cancer chemotherapy) resistance. First time we developed model for predicting GST proteins using SVM.

 

Datasets:

All sequences used in this study were downloaded from Swissprot database. All proteins were manually examined to retain only sequences, which have high quality annotation. For this we removed all sequences that was labeled as ‘fragment’ or annotated as putative or by similarity. We got total 137 proteins, which were experimentally annotated as ‘GST protein’. The sequence redundancy of dataset was further removed by using CD-HIT such that no two proteins have sequence identity more than 90%. The final dataset contains total 107 GST protein sequences. Negative dataset was compiled by randomly selecting 107 proteins keeping in mind that they were experimentally annotated as non-GST protein and they didn’t have sequence identity more that 90%. Here we are trying to developed broad-spectrum method for GSTs prediction hence we used both prokaryotes and eukaryotes (plant, fungi, animals) proteins in our study.

 

Results: We have used a dataset of GST and non-GST proteins for training and the performance of the method was evaluated with five-fold cross-validation technique. First a SVM based method has been developed using amino acid and dipeptide composition and achieved the maximum accuracy of 91.59% and 95.79% respectively. In addition we developed a SVM based method using tripeptide composition and achieved maximum accuracy 97.66% which is better than accuracy achieved by HMM based searching (96.26%). Based on above study a web-server GSTPred has been developed

 

Usage of standalone version

 

perl gstpred -i <inputFile> -t <threshold > -m <mode> -o <output_Result_File>

 

-i          inputFile (in Fasta format)

-m        mode (monopeptide, dipeptide or tripeptide composition based)

-t          Threshold for SVM based [default = 0]

-o         Output Result file

GSTPred program: In this query sequence (in Fasta file format) leads to the following path

gpsr_home/bin/fasta2sfasta -i $input_file -o gst.sfa

gpsr_home/bin/pro2dpc -i gst.sfa -o gst00.aac

gpsr_home/bin/col2svm -i gst00.aac -o gst01.svm -s 0

gpsr_svm_classify  gst01.svm gpsr_home/src/svm_models/gstpredmodel gst01.pr

 

Publication: Nitish Mishra, Manish Kumar, Dr. G. P. S. Raghava Support Vector Machine based Prediction of Glutathione S-transferases. Protein and Peptide Letters. 14 (6), 2007, pp. 575-580

 

 

 

TBPred

 

Application:

 

The program is able to classify any query protein of mycobacterium sp. into any four different localizations. Stand-alone version is very useful for running whole proteome of an organism for the annotation purpose.

 

Introduction:

 

TBPred is a stand-alone program of web-server specifically trained to predict the subcellular locations of mycobacterial proteins. There are four methods to predict the locations, namely amino acid composition, dipeptide composition, Position Specific Scoring Matrix (PSSM), and Hybrid method. In hybrid method the result of MAST (motif alignment and search tool) has been given preference and if no hit comes from MAST, the decision is taken on the basis of PSSM composition. TBPRED predicts a protein into any of four locations namely, cytoplasmic, integral membrane, secretory, and membrane attached by lipid-anchor.

Programs needed to run TBpred locally:

 

Main program, tbpred, present in bin folder of the package

 

NOTE: tbpred program requires MEME/MAST software in addition. While during installation user must provide the local path of the respective softwares or tools.

 

Usage: perl tbpred -i inputFile -m method -o Output_Result_File –e E-valueThreshold

 

 

-i          inputFile (in Fasta format)

 

-m        Method

            <A> for Amino acid composition based method

            <D> for Dipeptide composition based method

            <P> for PSSM composition based method

            <H> for hybrid (MEME/MAST and PSSM) based method

 

-o         Output Result file

 

-e         E- value threshold for Hybrid based method [default =0.001]

            Applied only when –m <H> is used.

      

 

tbpred program is for the prediction of mycobacterial proteins’ locations. It offers 4 different SVM based models/methods-

 

 

Amino acid composition

Dipeptide composition

PSSM composition

Hybrid method

 

Following are the programs used to run to complete the prediction-

1) Amino acid composition

 

 (For each program’s detail see the package documentation)

·         bin/fasta2sfasta  : program to make fasta format into single fasta format

·         bin/pro2aac : calculate whole protein amino acid composition

·         bin/col2svm: Converting composition file to SVM readable format

·         SVM classify with model svm_models/tbpred/model_file

 

2) Dipeptide composition

 

            (For each program details see the package documentation)

·         bin/fasta2sfasta  : program to make fasta format into single fasta format

·         bin/pro2aac : calculate whole protein amino acid composition

·         bin/col2svm: Converting composition file to SVM readable format

·         SVM classify with model svm_models/tbpred/model_file

 

3) PSSM composition

           

·         bin/fasta2sfasta  : program to make fasta format into single fasta format

·         bin/seq2pssm_imp : calculate position specific scoring matrix for the protein

·         bin/pssm_n1 : normalization of the scores got from seq2pssm_imp

·         bin/pssm_comp : for each protein seq it forms a fixed length composition pattern of 400

·         bin/col2svm: Converting composition file to SVM readable format

·         SVM classify with model svm_models/tbpred/model_file

 

4) Hybrid method (MAST/PSSM)

·         mast program

 

            If mast is unable to classify at a given E-value threshold, then decision is taken from PSSM             composition based method.   

Reference:

Webserver: webs.iiitd.edu.in/raghava/tbpred

 

Citation:

Rashid M, Saha S, Raghava GPS (2007): Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformantics, 8:337.

 

 

PSEAPred2

 

Application:

The program is able to classify Plasmodium falciparum query protein into Secretory or Non-secretory in localization. Stand-alone version is very useful for running whole proteome of an organism for the annotation purpose.

 

Introduction:

PSEAPred2 is a stand-alone program of web-server specifically trained to predict the Plasmodium falciparum proteins which are destined as secretory or non-secretory protein. The prediction is made on basis of support vector machine [SVM] based and motif based.

 

Usage of standalone version

perl pseapred2 -i inputFile -t threshold -o Output_Result_File

 

-i          inputFile (in Fasta format)

 

-t          Threshold for SVM based [default =0.0]

 

-o         Output Result file

 

 

 

Pseapred2 program is for the prediction of secretory proteins. It predicts on 2 methods

Motif based

SVM based

 

Motif based: This model runs on MAST software, which takes a file in fasta input and gives a mast output file.

 

SVM based: in this model query sequence (in Fasta format) leads to the following path. The results are given on 3 models.

 Programs needed:  (for each program details see the package documentation)

 

bin/fasta2sfasta: present in the bin directory of the package to make multi-fasta format sequence into single fasta.

bin/pro2dpc -i file_sfasta -o comp_aa

bin/pro2aac_split -i file_sfasta -o comp_split -n3

bin/col_mult.pl -i comp_split -o comp_split_main -n 0.01

bin/add_cols -i comp_aa -c comp_split_main -o comp_final

bin/col2svm -i comp_final -o comp_svmpat -s 0

SVM classify with model svm_models/pseapred2/model-hyb svm_temp > svm.out

 

 

Subsequent steps followed: (model2)

 

  bin/pro2aac -i file_sfasta -o comp_aa1

  bin/col_mult.pl -i comp_aa1 -o comp_aa1_final -n 0.01

  bin/col2svm -i comp_aa1_final -o comp_svmpat1 -s 0

  SVM classify with model svm_models/pseapred2/main-model svm_temp1 > svm1.out

 

Model3:

/bin/pro2aac -i file_sfasta -o comp_aa11

/bin/col_mult.pl -i comp_aa11 -o comp_aa11_final -n 0.01

/bin/col2svm -i comp_aa11_final -o comp_svmpat11 -s 0

 

Reference:

 

Webserver: webs.iiitd.edu.in/raghava/pseapred2