|
Supplementary Material
Benchmark Data
The training / testing dataset contains 910 proteins classified into 4 subcellular locations (Chloroplast, Cytoplasm, Mitochondria and Nucleus) according to the information available in CC -!- SUBCELLULAR LOCATION (comment type) line in Uniprot Knowledgebase (SwissProt / TrEMBL). None of the proteins has >90% sequence identity to any other in the same subset (subcellular location). The Uniprot Knowledgebase entry ID's of these 910 protein sequences are available for download (Click here).
RSLpred predictions
The complete rice proteome was downloaded from two sources (www.ebi.ac.uk/integr8/ and www.tigr.org) for performing RSLpred predictions. As the number of protein entries varied from both the sources, predictions were made on both the datasets (EBI and TIGR) separately using model files from the faster and traditional amino acid composition-based module. The predictions at high score threshold (>0.5) are presented here (see text for details). 1. Click here to download Uniprot Knowledgebase ID's of protein sequences predicted by RSLpred on EBI proteome dataset. 2. Click here to download TIGR Locus Identifiers of protein sequences predicted by RSLpred on TIGR proteome dataset.
Additional information on various modules developed
In the present study, several classifiers were developed for the subcellular localization prediction of rice proteins. The individual results of each module are briefly described here: Amino acid composition This traditional composition based SVM module was able to predict with an overall accuracy of 81.43% with RBF kernel (g=200, C=3, j=5). The results obtained are shown below: |
|
Dipeptide composition was used to utilize the sequence order information in predicting the subcellular localization. Best results were obtained at RBF kernel (g=225, C=2, j=2). The SVM module predicted an overall accuracy of 80.88% which was almost at par with the amino acid composition based SVM module. The results obtained after 5-fold cross-validation for dipeptide composition are shown below:
Pseudo amino acid composition The PseAA approach not only reflects the total amino acid composition, but also incorporates, to a considerable degree, the sequence-order effects through a series correlation factors. This representation, which also gives a fixed pattern length of 400 (20 x 20), encompasses the information of amino acid composition along with the pseudo order of amino acids. Here, various pseudo dipeptides (also called higher order dipeps) such as i + 2, i + 3, i + 4 and i + 5 were generated in order to observe the interaction of the ith residue with the 3rd, 4th, 5th and 6th residue in the sequence. However, when the frequencies of all the (i + n) dipeps were combined and divided by the sum of all possible dipeptides of each (i + n) to again form a combined fixed length pattern of 400 for use in SVM, it achieved an overall accuracy of 82.97% (g = 300, C = 2, j = 7) which revealed an increase in overall accuracy of about 2% over the (i+1) dipeptide composition-based SVM module. We designated this as cumulative higher order dipep composition. The results obtained after 5-fold cross-validation for cumulative PseAA composition-based approach are shown below:
Four parts composition based Here, the protein sequence was divided into four parts so that a fixed SVM pattern length of 80 (20 x 4) was formed in order to encapsulate more global information about each protein sequence. The best results were achieved with the RBF kernel (g=10, C=3, j=1). This four parts composition based SVM module predicted an overall accuracy of 81.10% which was also at par with the amino acid composition based SVM module. The results obtained after 5-fold cross-validation for four parts composition are shown below:
PSI-BLAST is another module developed in which a query sequence was searched based on its similarity against the non-redundant database of 910 classified proteins. In the present study, PSI-BLAST was used instead of normal standard BLAST because it has the capability to detect remote homologies. It carries out an iterative search in which sequences found in one round were used to build score model for the next round of searching. The module returns the subcellular localization of protein, SWISS-PROT number and sequence of protein having similarity to query sequence. Three iterations of PSI-BLAST were carried out at a cut-off E-value of 0.001. This module could predict any of the four localizations depending upon the similarity of the query protein to the proteins in the dataset. The module would return "unknown subcellular localization" if no significant similarity was obtained. With PSI-BLAST, an overall accuracy of 68.35% was achieved. The individual accuracy obtained for four types of subcellular localization are shown below:
Position Specific Scoring Matrix This PSSM based SVM module achieved the best overall accuracy of 87.10% with the RBF kernel (g=45, c=2, j=6) as compared to all the methods attempted by us. The individual accuracy obtained for four types of subcellular localization are shown below:
Hybrid approach-I This combined amino acid and dipeptide composition-based SVM classifier was able to achieve an overall accuracy of 82.53% which was about 1% superior over the amino acid composition based SVM method (kernel=RBF, g=100, c=6, j=2). The individual accuracy obtained for four types of subcellular localization are shown below:
Hybrid approach-II Secondly, we developed another hybrid module by combining amino acid composition and PSSM based matrix. The SVM input vector pattern was 420 (20 for amino acid and 400 for PSSM). Best results were obtained with RBF kernel (g=45, c=2, j=4) with an overall accuracy of 84.84% which was about 2% superior over the hybrid approach - I based SVM method. The individual accuracy obtained for four types of subcellular localization are shown below: [ HOME ] [ SUBMIT ] [ TOP ]
Hybrid approach-III Further, we attempted another hybrid module by combining amino acid composition, dipeptide composition and PSSM matrix. The SVM input vector pattern increased to 820 (20 for amino acid, 400 for dipeptide composition and 400 for PSSM). Best results were obtained with RBF kernel (g=35, c=4, j=2) with an overall accuracy of 84.51% which was almost at par with hybrid approach-II. However, the overall accuracy of all the hybrid approaches attempted could not exceed the overall accuracy obtained with PSSM based SVM module alone. The individual accuracy obtained for four types of subcellular localization are shown below:
N-terminal amino acid composition Most proteins have sorting signals that relies on the presence of an N-terminal targeting sequence that is recognized by a translocation machinery. These signals are responsible for targeting proteins to various subcellular localizations in the cell. Therefore, we developed a SVM module based on the N-terminal amino acid composition of each protein having 20 SVM input vector pattern. The SVM module was developed at various levels of N-terminal residue length (10, 15, 20, 25 and 30 amino acids) in order to achieve maximum accuracy. Best results were obtained at 25 residue length with RBF kernel (g=30, c=1, j=3) with an overall accuracy of 70.88%. The individual accuracy obtained for four types of subcellular localization are shown below: [ HOME ] [ SUBMIT ] [ TOP ]
C-terminal amino acid composition We also developed a SVM module based on the C-terminal amino acid composition of each protein having 20 SVM input vector pattern. Here, we also altered the C-terminal residue length (10, 15, 20, 25 and 30 amino acids) in order to achieve maximum accuracy. Best results were obtained with RBF kernel (g=40, c=1, j=3) with an overall accuracy of 64.18%. This accuracy was far lower than the N-terminal based accuracy indicating the presence of sorting signals in the N-terminus region of a protein sequence. The individual accuracy obtained for four types of subcellular localization are shown below:
Splitted Amino Acid Composition (SAAC) Best results were obtained with RBF kernel (g=6, c=2, j=2) with an overall accuracy of 79.78%. Within the terminal based SVM approaches, this method had the highest accuracy as compared to the accuracy of other terminal based SVM modules. The individual accuracy obtained for four types of subcellular localization are shown below: [ HOME ] [ SUBMIT ] [ TOP ]
N-terminal + Remaining part amino acid composition We also attempted a SVM module based on the division of protein into two parts viz. N-terminus and the remaining part of the sequence. The amino acid composition was calculated separately for both the parts so that it gave 40 (2 x 20) SVM vector pattern. Maximum overall accuarcy of 79.23% was obtained with RBF kernel (g=18, c=2, j=3). The individual accuracy obtained for four types of subcellular localization are shown below:
C-terminal + Remaining part amino acid composition Similarly, we attempted a SVM module based on the division of protein into two parts viz. C-terminus and the remaining part of the sequence. The amino acid composition was calculated separately for both the parts so that it gave 40 (2 x 20) SVM vector pattern. Maximum overall accuarcy of 76.48% was obtained with RBF kernel (g=15, c=2, j=2). This accuracy was lower than the N-terminal+remaining part composition based accuracy. The individual accuracy obtained for four types of subcellular localization are shown below:
|