About PSLpred


Subcellular localization plays a key role to elucidate the functions of a protein. Therefore, proteins that cooperate towards a common biological function are located in the same subcellular compartment. Eukaryotic cell has evolved highly elaborated subcellular compartments but prokaryotes (Gram-negative bacteria) too have 5 major subcellular localizations (outer membrane, inner membrane, periplasm, cytoplasm, and extracellular), specialized in distinct biochemical process. Since prokaryotes are the causative agent of most of the deadly disease and widespread of epidemics, hence, biologists are paying much attention for the functional annotation of prokaryotic proteins. This may further guide the determination of virulence factors as well as new pattern of resistance for antiobiotic agents in pathogenic bacteria. Hence, prediction of protein subcellular localization of gram-negative bacteria would be very useful in the field of molecular biology, cell biology, pharmacology, and medical science.In a present study,systematic attempt has been made to develop a SVM based method for the prediction of subcellular localization of prokaryotic proteins.

The data set
The data set used in the present work was same as used by Yu et al (2004) for developing the method CELLO. Previously, the same data set has also been used by Gardy et al (2003) for developing the method PSORT-B. The data set was generated from SWISS- PROT release 40.29 (Bairoch and Apweiler, 2000), consisted of a total of 1443 proteins, 1302 localized in single subcellular site and 141 proteins resident at multiple locations. However, in the present study, 141 proteins residing in more then one subcellular location were excluded and 1302 proteins (248 cytoplasmic, 268 inner membrane, 244 periplasmic, 352 outher membrane, and 190 extracellular) having single subcellular localization were used for the prediction of subcellular localization of prokaryotic proteins.

Support Vector Machines:
Previously, SVM has been used for the prediction of subcellular localization of eukaryotic proteins and has achieved remarkable success.  In the present study, a freely downloadable package of SVM, SVMlight has been used to predict the sub-cellular localization of proteins. The software enables the users to define a number of parameters and also allows a choice of inbuilt kernel function, including linear, RBF and polynomial.The prediction of subcellular localization is a multi-class classification problem. We constructed N SVMs for N class classification. Here, the class number was equal to five for prokaryotic proteins. Hence, ith SVM was trained with all the samples in the ith class with positive label and negative label for the proteins of remaining subcellular localizations. This kind of SVM is known as one versus rest SVM (1-v-r SVM). In this way, five SVMs were constructed for the subcellular localization of prokaryotic proteins to cytoplasm, extracellular, inner-membrane, outer-membrane, and periplasm. An unknown sample was classified into the class that corresponds to the SVM with highest output score. The machine learning techniques are more successful if input units/patterns are of fixed length. Therefore, in the present study, different approaches based on different features of a protein such as amino acid composition and dipeptide composition are considered that generate fixed length format.

Amino acid Composition
Amino acid composition is the fraction of each amino acid in a protein. The fraction of all 20 natural amino acids was calculated using following equation.


Composition of physico-chemical properties
The 33 physico-chemical properties were used to represent the proteins. The values of physico-chemical property for all 20 amino acids were normalized between 0 and 1 using the standard conversion formula. The input vector has 33 scalar values, each representing the average value of a distinct physico-chemical property of a protein.


Dipeptide composition
Dipeptide composition was used to encapsulate the global information about each protein sequence, which gives a fixed pattern length of 400.This representation encompassed the information about amino acid composition along local order of amino acid. The fraction of each dipeptide was calculated using following equation .


PSI-BLAST
A module PSI-BLAST was developed to predict the subcellular localization of proteins, in which a query sequence was searched against a database of proteins using PSI-BLAST. The database consists of 1302 sequences belonging to 4 major subcellular locations. The PSI-BLAST was used instead of normal standard BLAST because it has the capability to detect remote homologies (Altschul et al, 1990). It carries out an iterative search in which the sequences found in one round of search are used to build score model for the next round of searching. Three iterations of PSI-BLAST were carried out at a cut-off E-value of 0.001. This module could predict any of the four localizations (cytoplasmic, inner-membrane, periplasmic, outer-membrane, and extracellular) depending upon the similarity of the query protein to the proteins in the database. The module would return “unknown subcellular localization” if no significant similarity was obtained.


Hybrid SVM module
Recently, our group (Bhasin and Raghava, 2001) has introduced the concept of hybrid SVM module for the prediction of subcellular localization of eukaryotic proteins and achieved remarkable success. In the present study, the same approach was used to construct hybrid SVM module. The hybrid SVM module encapsulates the complete information of a protein such as amino acid composition, composition of physico-chemical properties, dipeptide composition, and PSI-BLAST output. SVM was provided with an input vector of 459 dimensions that consisted of 20 for amino acid composition, 33 for physico-chemical properties, 400 for dipeptide composition, and six for PSI-BLAST output.

Evaluation of PSLpred
In the present study, 5-fold cross validation technique has been adopted to evaluate the performance of the various SVM modules constructed. In this technique, the data set was partioned randomly into five equally sized sets. The training and testing was carried out five times, each time using one distinct set for testing and reaming four sets for training. In order to assess the prediction performances, accuracy and Mathew’s correlation coefficient (MCC) were calculated as described by Hua and Sun 2001 using equations.




where, x can be any subcellular location (cytoplasmic, inner membrane, periplasmic, outher membrane, and extracellular) exp(x) is the number of sequences observed in location x, p(x) is the number of correctly predicted sequences of location x, n(x) is the number of correctly predicted sequences not of location x, u(x) is the number of under-predicted sequences and o(x) is the number of over-predicted sequences.

Reliability Index
The reliability index (RI) assignment is used to measure the level of certainty in the prediction for a particular sequence. Hence, it is helpful to gain the confidence of the users about the prediction. The strategy used for assigning the RI is similar as used previously by our group. The RI was assigned according to the difference between the highest and second highest SVM output scores. The reliability index for the hybrid approach based methods was calculated using following equation.



RESULTS: The detail results obtained after 5-fold cross validation for all the SVM modules developed in the present study are as follows:


  • Amino acid composition
  • A SVM module developed on the basis of amino acid composition in a protein has achieved best results with the RBF kernel (g=100, c=2, j=1).  The calculation of amino acid composition generates the 20 dimensional input vectors for each protein sequence which were used to train five types of SVM models for the five types of subcellular localizations. The composition based SVM module was predicted with an overall accuracy of 86%.


    Subcellular localization Accuracy (%)  MCC
    Cytoplasmic 87.1 0.80
    Extracellular 77.9 0.81
    Inner-membrane 86.9 0.87
    Outer-membrane 93.5 0.76
    Periplasmic 79.9 0.83

  • Composition of physico-chemical properties
  • The calculation of composition of physico-chemical properties of the protein sequences generates input vector of 33 dimensions for each sequence. The overall accuracy of properties based SVM module is 83%,~3% lesser then amino acid composition based SVM module.

    Subcellular localization Accuracy (%)  MCC
    Cytoplasmic 83.5 0.77
    Extracellular 75.8 0.77
    Inner-membrane 85.8 0.83
    Outer-membrane 87.8 0.82
    Periplasmic 78.3 0.73

  • Composition of DIpeptide
  • Dipeptide composition is considered as better feature as compared to amino-acid composition as it encapsulates global as well as local information of the sequence. In order to implement information about frequency as well as local order of residues in proteins, we also constructed SVM module based on dipeptide composition.The dipeptide composition based SVM module encompasses the information about amino acid composition along local order of amino acid.It uses the fixed pattern length of a vector with 400 dimensions. The dipeptide composition based SVM module with the RBF kernel (g=300, C=2) was predicted with an overall accuracy of 86%.

    Subcellular localization Accuracy (%)  MCC
    Cytoplasmic 87.1 0.78
    Extracellular 73.7 0.79
    Inner-membrane 85.8 0.89
    Outer-membrane 93.8 0.88
    Periplasmic 84.0 0.77

     


  • PSI-BLAST
  • The performance of the PSI-BLAST based module was also evaluated through 5-fold cross-validation.The performance of this module is poorer as compared to other modules developed in the present study. The SVM module based on this approach was able to predict the subcellular localization of the proteins with overall accuracy of 68%.

     
    Subcellular localization Accuracy (%)
    Cytoplasmic 34.3
    Extracellular 79.5
    Inner-membrane 59.7
    Outer-membrane 93.8
    Periplasmic 65.5
     


  • Hybrid based approach
  • A hybrid module based on all features of the proteins and output of PSI-BLAST was developed. This hybrid module (g=25, C=4) achieved an overall accuracy of 91.2%, which is 5-8% higher than individual compositions based modules. It proves hybrid module is able to encapsulate more information, which successfully improves the reliability of prediction accuracy. These results confirmed that detection of subcellular localization of proteins requires wide range of information about a protein.


    Subcellular localization Accuracy (%)  MCC
    Cytoplasmic 90.7 0.86
    Extracellular 86.8 0.88
    Inner-membrane 90.3 0.90
    Outer-membrane 95.2 0.95
    Periplasmic 90.6 0.89

     



  • Reliability Index
  • In order to confirm the prediction reliability RI assignment was carried out for the hybrid module. As depicted from the RI curve, good accuracies that is 90% and 98.1% was obtained with RI=4 and 5 respectively. It has also been observed that ~74% of the sequences have RI=5. Hence, the present method can annotate subcellular localization of prokaryotic proteins more reliably.


     


Comparison with existing methods
The performance of the hybrid module developed in the present study was compared with methods such as CELLO, PSORT-B, which were also developed from the same data set. It has been observed that overall performance of the hybrid module is nearly 2% higher than CELLO and 16% higher than that of PSORT-B. Hence it can be mentioned here that present method is more accurate for the subcellular localization of prokaryotic proteins.