Algorithm behind TBpred

About Dataset:
Current dataset of mycobacterial proteins along with their subcellular localization has been developed from          along with their subcellular localization. Out of 1365 proteins, non-experimental qualifier "by similarity" is excluded resulting in 882 proteins. Among 13 different subcellular compartments , 4 major sites have been selected containing reasonable number of samples.

Subcellular LocalizationSample Number
1.Cytoplasmic340
2.Integral Membrane402
3.Secreted50
4.Attached to the membrane by lipid anchor60

Support Vector Machine (SVM):
SVMlight has been used in the present study in classification mode.Several parameters may be tuned for their appropriate values to get optimum results.Among different inbuilt kernels three have been used namely linear,polynomial and RBF.Subcellular localization prediction is a multi-class approach. For a defined protein feature, four types of SVM modules have been developed each belonging to a specific subcellular localization.The nth SVM model learns from nth class samples with positive labels and rest other samples with negative labels.Prediction of an unknown sample is based upon the maximum score out of four scores, generated by four models specific to four different subcellular compartments.

Evaluation of prediction performance of TBpred:
The performance of this method is evaluated by 5-fold cross-validation technique.The whole data is partitioned in 5 sets in such a manner that no two proteins from different sets shows sequence similarity greater than 36%.The training is done on four sets and remaining one is used for testing.In order to test each and every protein this process is carried out 5 times, each time using distinct set for testing.Evaluation of performance of different SVM modules has been done by calculating accuracy and Matthew's correlation coefficient (MCC) by the following equations:

where, x can be any subcellular location (cytoplasmic, mitochondrial, nuclear, or plasma membrane), exp(x) is the number of sequences observed in location x, p(x) is the number of correctly predicted sequences of location x, n(x) is the number of correctly predicted sequences not of location x, u(x) is the number of under predicted sequences and o(x) is the number of over-predicted sequences.

Various Prdiction Approahes:
In this study mainly three approaches have been studied, based on different features of proteins.

  • Amino Acid Composition is the fraction of each amino acid present in a protein.SVM is trained with 20 dimensional input vector for each protein.Overall prediction accuracy of this SVM module (kernel-RBF,g = 0.1 ,c = 600, j = 5) was 82.51%.

    Subcellular LocalizationAccuracy(%)MCC
    cytoplasmic88.820.77
    Integral Membrane86.070.71
    Secreted44.000.57
    Attached to membrane by a lipid anchor55.000.58

  • Dipeptide Composition is fraction of each of 400 (20*20) types of possible dipeptides in the protein sequence.Here SVM is trained with a fixed pattern length of 400.The overall prediction accuracy by this SVM module (kernel-polynomial,d=1.c=200,j=1) was 80.39%.

    Subcellular LocalizationAccuracy(%)MCC
    cytoplasmic89.410.72
    Integral Membrane81.090.67
    Secreted50.000.60
    Attached to membrane by a lipid anchor50.000.57

  • Position-Specific Scoring Matrix  (PSSM): A PSSM, or Position-Specific Scoring Matrix, is a type of scoring matrix used in protein BLAST searches in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment. Thus, a Tyr-Trp substitution at position A of an alignment may receive a very different score than the same substitution at position B. This is in contrast to position-independent matrices such as the PAM and BLOSUM matrices, in which the Tyr-Trp substitution receives the same score no matter at what position it occurs.

    From the PSSM obtained for each protein sequence a SVM pattern has been made.The input vector contains 400 dimensions.Overall accuracy acheived by this SVM module  (kernel-RBF,g=2, c=50, j=1) was 86.62%.

    Subcellular LocalizationAccuracy(%)MCC
    cytoplasmic94.710.85
    Integral Membrane87.810.80
    Secreted44.000.48
    Attached to membrane by a lipid anchor68.330.69