The determination of subcellular localization of a protein is considered to be most reliable and important way to elucidate the function of a protein. For last few years, numerous computational methods have been developed for the correct prediction of subcellular locations of proteins, however, based on different computational techniques, input features and datasets. These include PSORTB, NNPSL, TargetP, LOCSVMPSI, SignalP, ESLpred, CELLO, PSLpred, SubLoc and HSLPred.

The expansion of raw protein sequence databases in the post genomic era and availability of fresh annotated sequences for major localizations particularly motivated us to introduce a new improved version of our previously forged eukaryotic subcellular localizations prediction method namely "ESLpred2" trained on the ~10 years older and highly redundant dataset (referred as RH2427 dataset), by including a recently generated highly non-redundant kingdom specific dataset (used for developing BaCelLo method). Furthermore, a systematic approach has been taken to improve the prediction quality using PSSM profiles generated from PSI-BLAST along with compositional attributes and similarity-search based information. The present method has achieved a highest success rate for subcellular localizations prediction with good overall and average accuracy, and hence, compliments other existing subcellular localization prediction method.

Need of new improved version?

ESLpred method developed in 2004, has been predicting four type of eukaryotic localizations- cytoplasm, mitochondria, nuclear, and extracellular with a good accuracy of 88% till now. In addition, ESLpred has achieved highest success rate on the same dataset when compared with other popular methods such as SubLoc, NNPSL, Markov models and Fuzzy-k-NN. Though, LOCSVMPSI has attained higher accuracy of 90%, nevertheless prediction accuracy of ESLpred for nuclear proteins is much better. But, the growing sequence database and availability of annotated sequences for other major localizations, prompted us to develop new version of ESLpred covering new localizations. Moreover, it is also required to add new input features which could enhance the prediction accuracy of subcellular localization prediction. The brief description of input features used in the present study is descibed below.

 

Input Used

Amino acid composition (whole and N-terminal)          

In this work, the amino acid composition of whole sequence and N-terminal (of length 20 amino acids), which generated and input vector of 40 dimesnions was used as an input feature for the SVM model training. .

PSSM generated by PSI-BLAST

In the present study, an attempt was made to use Position specific scoring matrix (PSSM) generated by PSI-BLAST,  as an input feature for the training of SVM. PSI-BLAST search was carried out against non-redundant data set available at NCBI and the sequences found in one round of search were used to build a score model for the next round of searching. After three iterations with cut-off E-value of 0.001, it generated a PSSM having the highest score as a part of the prediction process. The matrix consisted of 21X M elements, where M is the length of the target sequence, and each element represents the frequency of occurrence of each of the 20 amino acids at one position in the alignment.

Next, each element of the matrix (20X M) was scaled to the range of 0-1, using sigmoid function.

Further, in order, to make input of fixed length, these normalized PSSM (20 X M) were used to generate a 400-dimensional input vector by summing up all rows in the PSSM corresponding to the same amino acids in the sequence. Finally, each element in this input vector was divided by the length of the protein sequence. This would result a matrix of (20 X 20) elements.

Similarity-search based module (EuPSI-BLAST)

Besides using PSI-BLAST to generate PSSM, it was also used to carry out similarity based search against the local in-built database of different localizations. For RH2427 dataset, we used the same module (EuPSI-BLAST), which was designed previously by our group for ESLpred method.For BaCelLo datasets; new modules were generated by carrying out similarity based search against the local datasets of 2597, 1198 and 491 animal, fungi and plant proteins respectively. Three iterations of PSI-BLAST were carried out at a cut-off E value of 1X10-8.

Hybrid-approach based module

The hybrid approach based SVM module incorporates the information of sequence composition (whole and N-terminal), profile composition and similarity-seach based results.

Comparison with ESLpred

ESLpred2 is an improved version of our previous eukaryotic subcellular localization prediction method ESLpred. ESLpred had already achieved better accuracy when compared with methods such as Subloc and NNPSL. An interesting feature of ESLPred2 is the hybrid of protein features, such as composition of PSSM profile, whole and N-terminal composition of sequence and similarity search based results, which assisted the assignment of the subcellular localization of proteins more reliably and with high accuracy irrespective of redundancy in the training datasets. The present method is able to complement all existing subcellular location prediction methods. In addition, using the same dataset used to develop ESLpred, the present study was able to attain higher accuracy of ~94%, ~6% higher than that achieved by ESLpred.