Supplementary Frame for RSLpred

PREDICTION APPROACHES FOLLOWED

Different approaches have been used for the subcellular localization of rice proteins in the present investigation (see detailed results in supplement material). These approaches are based on various protein features. However, we have provided only the best five classifiers for real-time predictions to the end users. Here, we would like to mention that though the PSSM-based module is statistically best over all the modules developed, it is bit slower due to PSI-BLAST searches on non-redundant database; and therefore, for larger analysis, users may opt for faster modules like amino acid composition-based classifier. For more flexibility, we have provided four other good performing modules e.g. if the user wish to use terminal-based information of his/her query sequence for prediction purpose, he/she may opt for splitted amino acid composition module which is based on N-Centre-C terminal composition of the protein sequence. If the user wishes to utilize the sequence order effects of the query sequence, the dipeptide composition-based module may be used for prediction. The method followed for developing these five classifiers is briefly discussed here:

Amino acid composition is the fraction of each amino acid in a protein. The calculation of this traditional amino acid composition generates the 20 dimensional input vectors which were used to train four types of SVM models for the four types of subcellular localizations.

Dipeptide composition was used to encapsulate the global information about each protein sequence, which gives a fixed pattern length of 400 (20 X 20). This representation encompasses the information about amino acid composition along local order of amino acid.

Hybrid approach - I To improve the prediction accuracy, we adopted various hybrid approaches by combining different features of a protein sequence. In the first step, we developed a hybrid module by combining amino acid composition and dipeptide composition. The SVM input vector pattern was 420 (20 for amino acid and 400 for dipeptide composition).

Splitted Amino Acid Composition (SAAC) Further, we divided each of the protein sequence into three parts viz. N-terminal (25 residues), centre portion and the C-terminal (25 residues) part. The amino acid composition was calculated for each part separately so that we have finally 60 (20 x 3) SVM vector pattern.

Position Specific Scoring Matrix-based SVM is another module constructed by combining the evolutionary information stored in the matrix called as PSSM which is a method for detecting distantly related proteins by sequence comparison. The idea of adopting PSSM extracted from sequence profiles generated by PSI-BLAST as input information was first proposed by David Jones. This information is expressed in a position-specific scoring table (profile), which is created from a group of sequences previously aligned by PSI-BLAST. The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence. It differs from other methods of sequence comparison in common use because any number of known sequences can be used to construct the profile, allowing more information to be used in the testing of the target sequence.

The PSSM of a protein sequence extracted from the profile of PSI-BLAST was used to generate a 400-dimensional input vector to the SVM by summing up all rows in the PSSM corresponding to the same amino acid in the primary sequence. After that, every element in this input vector was divided by the length of the sequence and then scaled to the range of 0–1 by using the standard sigmoid function: (X - minimum)/(maximum - minimum); where X is the individual PSSM score of each amino acid.

RELIABILITY INDEX (R.I.) [ HOME ] [ SUBMIT ] [ GO TO TOP ]

The reliability index (RI) is a commonly used measure of prediction that provides confidence about a prediction to the users. In this study, we have followed the simple strategy of Hua, S. and Sun, Z. (2001) for assigning the reliability index (RI). The RI assignment is a useful indication of the level of certainty in the prediction for a particular sequence. The RI was assigned according to the difference between the highest and second highest SVM output scores. Therefore, we also computed the reliability score of our prediction method based on the hybrid approach using the following equation: