Algorithm

    The data set used in the current study (containing 6975 sequences) is same as used by Bendtsen et al (2004) for developing the method SecretomeP. These sequences were extracted from Swiss-Prot database on the basis of subcellular localization annotations in the comment block. The proteins annotated as extracellular mammalian proteins were considered as positive examples (3321 sequences) secreted via classical and non-classical pathways, whereas the remaining 3654 proteins annotated as residing in the cytoplasm and/or the nucleus was considered as negative examples. The details about the dataset can be obtained from Bendtsen et al (2004).

    Neural network architecture

    For the neural network implementation and to generate the neural network architecture for the learning process, the publicly available free simulation package SNNS, version 4.2, from Stuttgart University has been used.It allows incorporation of the resulting networks into an ANSI C function for use in stand-alone code. A logistic activation function is used. At the start of each simulation, the weights are initialized with the random values. The training has been carried out using error back-propagation with a sum of square error functions as well as mean square error function. The learning parameter has been set to 0.001. The magnitude of the error sum in the test and training set is monitored in each cycle of the training. The ultimate numbers of cycles are determined where the network during training converges.

    Support Vector Machines

    In present study, a freely downloadable package of SVM, SVMlight has been used for the classification of secretory proteins. The software enables the users to define a number of parameters and also allows a choice of inbuilt kernel function, including linear, RBF and polynomial. The machine learning techniques are more successful if input units/patterns are of fixed length. Therefore, in the present study, different approaches based on different features of a protein such as amino acid composition, composition of physico-chemical properties and dipeptide composition are considered that generate fixed length patterns.

    Composition of physico-chemical properties

    The 33 physico-chemical properties (e.g. hydrophobicity, hydrophilicity, polarity) were used to represent the proteins as used recently by our group for the prediction of subcellular localization of eukaryotic proteins (Bhasin and Raghava, 2004). The values of each physico-chemical property for all 20 amino acids were normalized between 0 and 1 using the standard conversion formula. The input vector has 33 scalar values, each representing the average value of a distinct physico-chemical property of a protein.

    Amino acids composition

    Amino acid composition is the fraction of each amino acid in a protein. The fraction of all 20 natural amino acids was calculated using equation 1

    �������������������� �1

    where, i can be any amino acid

    Dipeptide composition

    Dipeptide compositions (e.g. ala-ala, ala-leu), which give a fixed pattern length of 400 (20 20), encapsulate the global information about each protein sequence. This representation encompassed the information about amino acids composition along with local order of amino acid. The fraction of each dipeptide was calculated using equation 2.

    ������������������������������ �.2

    where, dep(i) is one out of 400 dipeptide

    SRT-BLAST and SRT-PSI-BLAST

    In the present study, SRT-BLAST and SRT-PSI-BLAST modules were also developed to search a query protein against a database of secretory and non-secretory sequences using BLAST and PSI-BLAST respectively. The PSI-BLAST was used in addition to normal standard BLAST because it has the capability to detect remote homologies (Altschul et al., 1990). It carries out an iterative search in which the sequences found in one round of search are used to build score model for the next round of searching. Three iterations of PSI-BLAST were carried out at a cut-off E-value of 0.001. Depending upon the similarity of the query protein to the proteins present in the database, this module can classify the proteins and return �unknown classification� if no significant similarity is obtained.

    Hybrid SVM module

    Previously, hybrid approach based SVM modules has achieved remarkable success for the prediction of subcellular localization of proteins. In the present study, the same approach was used to construct hybrid SVM module. The module integrates the complete information of a protein such as amino acid composition, dipeptide composition, and PSI-BLAST output. SVM was provided with an input vector of 423 dimensions that consisted of 20 for amino acids compositions, 400 for dipeptide compositions, and 3 for PSI-BLAST output.

    Table 1. Detailed results obtained using different SVM based modules, PSI-BLAST and Blast

     

    Sensitivity

    Specificity

    Accuracy

    MCC

    Composition of Properties (NN)

    73.0

    73.2

    73.1

    0.46

    Composition of Properties (SVM)

    74.7

    80.1

    77.4

    0.60

    Composition of Amino Acids (NN)

    69.0

    82.5

    76.1

    0.52

    Composition of Amino Acids (SVM)

    76.2

    82.6

    79.4

    0.59

    Composition of Dipeptides(NN)

    70.0

    83.4

    77.1

    0.54

    Composition of Dipeptides(SVM)

    77.0

    82.2

    79.9

    0.59

    BLAST

    22.4

    30.9

    23.4

    -----

    PSI-BLAST

    20.2

    26.3

    26.9

    -----

    Hybrid (SVM)

    78.9

    87.1

    83.2

    0.66

     

Home  l  Submit  l  Algorithm  l  Help  l  Contact  

Institute of Microbial technology

home submit query sequence About srtpred help suggestions or doubts