Other Servers |
Detailed Algorithm & Dataset Information
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Dataset Information:- The dataset used for training in this paper was the same as of NNPSLand SubLoc. The dataset was generated from the 33.0 version of SWISS-PROT by Reinhardt and Hubbard. The dataset have the protein which are complete and whose subcellular localization is experimentally determined. The dataset was redundancy reduced such that none protein have >90% sequence identity to any other protein in dataset. We have only obtained the dataset of 2427 eukaryotic proteins. This dataset consists of . 1097 nuclear, 684 cytoplasmic, 321 mitochondrial and 325 extracellular proteins. Support Vector Machine
The support vector machines (SVM) are universal approximator based on statistical and optimising theory. The SVM is particularly attractive to biological analysis due to its ability to handle noise, large dataset and large input spaces. The SVM has been shown to perform better in protein secondary structure, MHC and TAP binder prediction and analysis of microarray data. The basic idea of SVM can be described as follows; first the inputs are formulated as feature vectors. Secondly these feature vectors are mapped into a feature space by using the kernel function. Thirdly, a division is computed in the feature space to optimally separate to classes of training vectors. The SVM always seeks global hyperplane to separate the both classes of examples in training set and avoid overfitting. The hyperplane found by SVM is one that maximise the separating margins between both binary classes This property of SVM is made is more superiors in comparison to other classifiers based on artificial intelligence.The basic idea of SVM is depicted below
In this study, we have used SVM_light to predict the subcellular localization of proteins. The software is freely downloadable from http://www.cs.cornell.edu/People/tj/svm_light/. The software enable the users to define a number of parameters and allow to select a choice of inbuilt kernel function including linear, RBF, Polynomial (given degree) or user defined kernel.
For classifying the proteins to particular localization (out of 4), we have constructed four SVM known as cytoplasmic, Nuclear, extracellular ands Mitochondrial SVM modules. Each SVM module is Binary SVM that produces a single score for each of protein. A particular SVM (for example Cytoplasmic) will be trained with all samples of cytoplasmic class as positive label and rest samples with negative lable. The SVM modules for nuclear, extracellular and mitochondrial proteins are also trained similarly.
Evaluation of Modules:- The performance of all modules developed in this method is evaluated using five-fold cross-validation. In five- fold cross validation dataset was divided into five equal size sets. The training and testing of every module (BLAST and various SVMs) was carried five times, each time using one distinct set for testing and remaining four sets for training. The performance of each module is assessed by calculating the accuracy and Matthew's correlation coefficient (MCC).The formula's of acuuracy and MCC calculation are shown below. Where x can be any subcellular location (nuclear, cytoplasm, extracellular and mitochondria) exp(x) is the number of sequences observed in location x, p(x) is the number of correctly predicted sequences of location x, n(x) is the number of correctly predicted sequences not of location x, u(x) is the number of under-predicted sequences and o(x) is the number of over-predicted sequences.
Prediction Approaches:- In this study authors have developed modules on the basis of support vector machines (SVM), similiarity search (BLAST)and combination of SVM and BLAST.The SVM based modules were trained with various inputs to encapsulate the global information of protein sequence for more accurate prediction.The various SVM and BLAST modules developed in this study are described below.
Table S1: list of physico-chemical properties.
In the hybrid approach best results are obtained with RBF kernel. we have also used the polynomail kernel and linear kernel but the results are poorer in comprasion to RBF kernel.The value of the gamma and regulatory parameter C optimised to "50" and "1000" respectively.The prediction results are shown below in tabular form.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Compartive performence of Modules | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The systematic increase in prediction accurcy and MCC from only amino acid composition to hybrid approach based module is depicted below.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Reliability index | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The calculation of reliability index is important to known the reliability of prediction.In this study we have followed the simple statragey of Hua,S. and Sun, Z. (2001) for assigning the reliability index (RI). The RI is assigned on the basis of difference between the highest and second highest output score of SVM's for various subcellular localizations.A curve is plotted in between the prediction accuracy and reliablity index equal to particular value as shown below.
The curve depict that expected accuracy of sequences with RI=3 is 94.4% which is better in comparsion to exiting methods. Another claculation depict the nearly 74% of sequence have RI graeter than 2 and expected accuracy of these sequence is 96.4%.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||