The data set used in the current study (containing 6975 sequences) is same as used by Bendtsen et al (2004) for developing the

The data set used in the current study (containing 6975 sequences) is same as used by Bendtsen et al (2004) for developing the method SecretomeP. These sequences were extracted from Swiss-Prot database on the basis of subcellular localization annotations in the comment block. The proteins annotated as extracellular mammalian proteins were considered as positive examples (3321 sequences) secreted via classical and non-classical pathways, whereas the remaining 3654 proteins annotated as residing in the cytoplasm and/or the nucleus was considered as negative examples. The details about the dataset can be obtained from Bendtsen et al (2004).

Neural network architecture

For the neural network implementation and to generate the neural network architecture for the learning process, the publicly available free simulation package SNNS, version 4.2, from Stuttgart University has been used. It allows incorporation of the resulting networks into an ANSI C function for use in stand-alone code. A logistic activation function is used. At the start of each simulation, the weights are initialized with the random values. The training has been carried out using error back-propagation with a sum of square error functions as well as mean square error function. The learning parameter has been set to 0.001. The magnitude of the error sum in the test and training set is monitored in each cycle of the training. The ultimate numbers of cycles are determined where the network during training converges.

Support Vector Machines

In present study, a freely downloadable package of SVM, SVMlight has been used for the classification of secretory proteins. The software enables the users to define a number of parameters and also allows a choice of inbuilt kernel function, including linear, RBF and polynomial. The machine learning techniques are more successful if input units/patterns are of fixed length. Therefore, in the present study, different approaches based on different features of a protein such as amino acid composition, composition of physico-chemical properties and dipeptide composition are considered that generate fixed length patterns.

Composition of physico-chemical properties

The 33 physico-chemical properties (e.g. hydrophobicity, hydrophilicity, polarity) were used to represent the proteins as used recently by our group for the prediction of subcellular localization of eukaryotic proteins (Bhasin and Raghava, 2004). The values of each physico-chemical property for all 20 amino acids were normalized between 0 and 1 using the standard conversion formula. The input vector has 33 scalar values, each representing the average value of a distinct physico-chemical property of a protein.

Amino acids composition

Amino acid composition is the fraction of each amino acid in a protein. The fraction of all 20 natural amino acids was calculated using equation 1

…1

where, i can be any amino acid

Dipeptide composition

Dipeptide compositions (e.g. ala-ala, ala-leu), which give a fixed pattern length of 400 (20 ´ 20), encapsulate the global information about each protein sequence. This representation encompassed the information about amino acids composition along with local order of amino acid. The fraction of each dipeptide was calculated using equation 2.

….2

where, dep(i) is one out of 400 dipeptide

SRT-BLAST and SRT-PSI-BLAST

In the present study, SRT-BLAST and SRT-PSI-BLAST modules were also developed to search a query protein against a database of secretory and non-secretory sequences using BLAST and PSI-BLAST respectively. The PSI-BLAST was used in addition to normal standard BLAST because it has the capability to detect remote homologies (Altschul et al., 1990). It carries out an iterative search in which the sequences found in one round of search are used to build score model for the next round of searching. Three iterations of PSI-BLAST were carried out at a cut-off E-value of 0.001. Depending upon the similarity of the query protein to the proteins present in the database, this module can classify the proteins and return “unknown classification” if no significant similarity is obtained.

Hybrid SVM module

Previously, hybrid approach based SVM modules has achieved remarkable success for the prediction of subcellular localization of proteins. In the present study, the same approach was used to construct hybrid SVM module. The module integrates the complete information of a protein such as amino acid composition, dipeptide composition, and PSI-BLAST output. SVM was provided with an input vector of 423 dimensions that consisted of 20 for amino acids compositions, 400 for dipeptide compositions, and 3 for PSI-BLAST output.

Table 1. Detailed results obtained using different SVM based modules, PSI-BLAST and Blast

	Sensitivity	Specificity	Accuracy	MCC
Composition of Properties (NN)	73.0	73.2	73.1	0.46
Composition of Properties (SVM)	74.7	80.1	77.4	0.60
Composition of Amino Acids (NN)	69.0	82.5	76.1	0.52
Composition of Amino Acids (SVM)	76.2	82.6	79.4	0.59
Composition of Dipeptides (NN)	70.0	83.4	77.1	0.54
Composition of Dipeptides (SVM)	77.0	82.2	79.9	0.59
BLAST	22.4	30.9	23.4	-----
PSI-BLAST	20.2	26.3	26.9	-----
Hybrid (SVM)	78.9	87.1	83.2	0.66