The data set used in
the current study
(containing 6975 sequences) is same as used by Bendtsen et al (2004) for
developing the method SecretomeP. These sequences were extracted from
Swiss-Prot database on the basis of subcellular localization annotations in the
comment block. The proteins annotated as extracellular mammalian proteins were
considered as positive examples (3321 sequences) secreted via classical and
non-classical pathways,
whereas the remaining 3654 proteins annotated as residing in the cytoplasm
and/or the nucleus was considered as negative examples. The details about the
dataset can be obtained from Bendtsen et al (2004).
Neural network architecture
For the neural network implementation and
to generate the neural network architecture for the learning process, the
publicly available free simulation package SNNS, version 4.2, from Stuttgart
University has been used. It allows
incorporation of the resulting networks into an ANSI C function for use in
stand-alone code. A logistic activation function is used. At the start of each
simulation, the weights are initialized with the random values. The training
has been carried out using error back-propagation with a sum of square error
functions as
well as mean square error function. The learning parameter has been set to
0.001. The magnitude of the error sum in the test and training set is monitored
in each cycle of the training. The ultimate numbers of cycles are determined
where the network during training converges.
Support Vector Machines
In present study, a freely downloadable package of SVM, SVMlight has
been used for the classification of secretory proteins. The software enables
the users to define a number of parameters and also allows a choice of inbuilt
kernel function, including linear, RBF and polynomial. The machine learning
techniques are more successful if input units/patterns are of fixed length.
Therefore, in the present study, different approaches based on different
features of a protein such as amino acid composition, composition of
physico-chemical properties and dipeptide composition are considered that
generate fixed length patterns.
The 33 physico-chemical properties (e.g. hydrophobicity,
hydrophilicity, polarity) were used to represent the proteins as used recently
by our group for the prediction of subcellular localization of eukaryotic
proteins (Bhasin and Raghava, 2004). The values of each physico-chemical
property for all 20 amino acids were normalized between 0 and 1 using the
standard conversion formula. The input vector has 33 scalar values, each
representing the average value of a distinct physico-chemical property of a
protein.
Amino acid composition is the fraction of each amino acid in a protein.
The fraction of all 20 natural amino acids was calculated using equation 1
…1
where, i can be any
amino acid
Dipeptide compositions (e.g. ala-ala, ala-leu), which give a fixed
pattern length of 400 (20 ´ 20), encapsulate the global
information about each protein sequence. This representation encompassed the
information about amino acids composition along with local order of amino acid.
The fraction of each dipeptide was calculated using equation 2.
….2
where, dep(i) is one out of 400 dipeptide
In the present study,
SRT-BLAST and SRT-PSI-BLAST modules were also developed to search a query
protein against a database of secretory and non-secretory sequences using BLAST
and PSI-BLAST respectively. The PSI-BLAST was used in addition to normal
standard BLAST because it has the capability to detect remote homologies
(Altschul et al., 1990). It carries out an iterative search in which the
sequences found in one round of search are used to build score model for the
next round of searching. Three iterations of PSI-BLAST were carried out at a
cut-off E-value of 0.001. Depending upon the similarity of the query protein to
the proteins present in the database, this module can classify the proteins and
return “unknown classification” if no significant similarity is obtained.
Previously, hybrid approach
based SVM modules has achieved remarkable success for the prediction of
subcellular localization of proteins. In the present study, the same approach
was used to construct hybrid SVM module. The module integrates the complete
information of a protein such as amino acid composition, dipeptide composition,
and PSI-BLAST output. SVM was provided with an input vector of 423 dimensions
that consisted of 20 for amino acids compositions, 400 for dipeptide
compositions, and 3 for PSI-BLAST output.
Table 1. Detailed results obtained using different SVM based modules,
PSI-BLAST and Blast
|
Sensitivity |
Specificity |
Accuracy |
MCC |
Composition of Properties
(NN) |
73.0 |
73.2 |
73.1 |
0.46 |
Composition of Properties
(SVM) |
74.7 |
80.1 |
77.4 |
0.60 |
Composition of Amino Acids
(NN) |
69.0 |
82.5 |
76.1 |
0.52 |
Composition of Amino Acids
(SVM) |
76.2 |
82.6 |
79.4 |
0.59 |
Composition of
Dipeptides (NN) |
70.0 |
83.4 |
77.1 |
0.54 |
Composition of
Dipeptides (SVM) |
77.0 |
82.2 |
79.9 |
0.59 |
BLAST |
22.4 |
30.9 |
23.4 |
----- |
PSI-BLAST |
20.2 |
26.3 |
26.9 |
----- |
Hybrid (SVM) |
78.9 |
87.1 |
83.2 |
0.66 |