Experimental Data and Pattern generation:-
The prediction of 20S proteasome cleavage
sites was done on the digests of enolase I and bovine b casein proteins
by the 20S proteasome. The data for yeast enolase and b casein digests
was obtained from the work of Toes et al., 2001 and Emmerich et al.,
2000 respectively. The proteaosme has created 109 fragments from enolase
I using 136 distinct cleavage sites. Proteasome generated 63 fragments
from b casein using, 48 distinct cleavage sites. For the generation
of patterns for various computational classifiers residues of enolase
I and b casein were divided to cleavage and non-cleavage sites. The
residue on N terminal of experimentally proven cleavage site is assigned
as cleavage site where as rest of residues as non-cleavage sites.
We used the sequence window size of seven amino acids. The central
amino acid was designated as cleavage site for particular window configuration
if the P1 cleavage site occurs at its center otherwise it is assigned
as non-cleavage site. The actual cleavage site occurs between the
central residue and following (C terminal) residue. The classifiers
will do prediction for central residue as cleavage and non-cleavage
site for particular window configuration. In case of SVMs the cleavage
and non-cleavage sites are represented by "+1" and "-1" respectively.
Each amino acid is represented using 21 binary encoding positions
(conventional sparse encoding). The 20 amino
acid are coded as given by: A=1000000000000000000000, G=010000000000000000000
and so on. The 21st bit is added for handling the incomplete
or terminal parts of proteins. The patterns were generated for all
classifiers (SVMs, WEKA, PEBLS) using same amino acid encoding.
Independent dataset test: -
More, as a demonstration of practical
application and unbiased test the SVM based method was further evalvuated
on an independent dataset. The dataset was obtained from the work
of saxova et al., 2003 which has been used for the evaluation of already
existing prediction methods.. The dataset is consist of three protein
SSX2 , Nef and RUI collected from the literature. None of patterns
generated from these proteins occurs in the training or testing sets
used in the development of the method.
Support Vector Machine (SVM): -
Support vector machine (SVM) is a promising margin
classifier that uses the theory from both statistics and optimization.
SVM are trained with a learning algorithm from optimization theory
and implement a learning bias from statistical learning theory. It
draws an optimal hyperplane in a high dimensional feature space that
defines a boundary that maximizes the margin between data samples
in two classes, therefore giving a better generalization property.
Recently SVM is being used increasingly for the purpose of solving
biological problems, including MHC binders prediction, secondary structure
prediction and microarray analysis. SVM perform well compared to other
learning algorithms because they are effective in controlling the
classifier's capacity and associated potential for overfitting. Specifically,
SVMlight, which is an implementation (in C language) of SVM has been
used here for the prediction of cleavage and non-cleavage sites. The
SVM light package can be downloaded from http://www.joachims.org for
non-commercial or academic use. The SVM light has options to select
or define a kernel as well as varying kernel parameters. The relevant
mathematical principal of SVM can be briefly described as follows.
The n number of training examples (cleavage and non-cleavge
will be represented by the (Xc, Xn) vectors. The SVM MAP the examples
to high dimensional feature space and try to find out the hyper plane
where the margin between the cleavage and non-cleavage vectors is
maximum. The schematic representation of process has been shown below.

font size="2">The SVM usually map the objects to high-dimensional
space by computing the corresponding kernel between two objects X
and Y which are defined by this equation

The most popular kernel used for biological problem
are polynomial and RBF kernel has been used in this work.The polynomail
kernel is defined by following equation. The d specify the degree
of polynomail kernel.

The RBF kernel can be defined by following equation.

The prediction model are fine tuned by changing the
parameter C , which control the training error and the margin. The
performance of the different models was measured by checking on the
testing set. The model having minimum error and maximum correlation
coefficient on test set is considered the best model. This best model
was further applied on the blind dataset that is unseen by the prediction
model.
Evaluation of Method Performance: -
The five set cross-validation has been used to develop
and evaluate the performance of the prediction methods. In five set
cross-validation, five subsets of data has been constructed randomly
from the dataset. The datasets for SVM, WEKA and PEBLS consist of training
and testing set. The training set is consists of four subsets and the
testing is done on the remaining fifth set. This process is reapted
five times so that each subset can be can be used for testing and training.
In this way the five models were generated in the five set cross-validation.
The final prediction results have been averaged over the five testing
sets.
Performence Measures:
Six parameters have been used in present work to measure
the performance of 20S proteaosme prediction method as described by
Kesmir et al., 2002. These five parameters were derived from four
scalar quantities: TP (number of correctly
identified cleavage sites), TN (number of correctly identified non-cleavage
sites), FP (experimentally verified non-cleavage sites, predicted
as cleavage sites) and FN (experimentally verified cleavage sites
predicted as non-cleavage sites). The parameters derived
from these values are as follows i) sensitivity, is the percentage
of correctly identified cleavage sites; ii) specificity, is the percentage
of correctly identified non-cleavage sites; iii) PPV and NPV are positive
and negative prediction value respectively; iv) accuracy, is the percentage
of correctly identified cleavage and non-cleavage sites. v) Matthew's
Correlation coefficient (MCC), is more stringent measure of prediction
accuracy accounts for both under and over-predictions. These parameters
are calculated by using the following equations:
Results:
After running hours of computational power
the paramaerters of various kernels were optimised to obtain the best
results. The best results are obtained after five set cross-validation.The
second degree of polynomail is able to discriminate between cleavage
and non-cleavage sites. The value of C , r, s parameters are also
optimised. The value of the gammam factor of RBF kernerl is optimised
to .01. The value of regualtory parameter C is optimised to 1.The
SVM trained on invitro digested data can predict the 84% of cleavage
sites. The further evaluation of this method on an independent data
achieved an MCC of 0.43 which outperformed all the existing proteasomal
cleavage prediction methods.
References:
-
Altuvia,Y. and Margalit,H. (2000) Sequence
signals for generation of antigenic peptides by the proteasome:
implications for proteasomal cleavage mechanism.J. Mol. Biol.,
295, 879-890
-
Craiu,A., Akopian,T., Goldberg,A.
and Rock,K.L. Two distinct proteolytic processes in the generation
of a major histocompatibility complex class I-presented peptide
(1997) Proc. Natl Acad. Sci. USA, 94, 10850-10855.
-
Emmerich,N.P., Nussbaum,A.K.,
Stevanovic,S., Priemer,M., Toes,R.E., Rammensee,H.G. and Schild,H.
The Human 26 S and 20 S Proteasomes Generate Overlapping but Different
Sets of Peptide Fragments from a Model Protein Substrate (2000)
J. Biol. Chem., 275, 21140-21148.
Groll,M., Ditzel,L., Lowe,J.,
Stock,D., Bochtler,M., Bartunik,H.D. and Huber,R. Structure of 20S
proteasome from yeast at 2.4 A resolution. (1997) Nature,
386, 463-471.
-
Holzhutter,H.G., Frommel,C. and
Kloetzel,P.M. A Theoretical Approach Towards the Identification
of Cleavage-determining Amino Acid Motifs of the 20 S Proteasome
(1999) J. Mol. Biol., 286, 1251-1265.
-
Kuttler,C., Nussbaum,A.K., Dick,T.P.,
Rammensee,H.G., Schild,H. and Hadeler,K.P. An Algorithm for the
Prediction of Proteasomal Cleavages (2000) J. Mol. Biol.,
298, 417-429.
-
Lucchiari-Hartz,M., Van Endert,P.M.,
Lauvau,G., Maier,R., Meyerhans,A., Mann,D., Eichmann,K. and Niedermann,G.
Cytotoxic T Lymphocyte Epitopes of HIV-1 Nef: Generation of Multiple
Definitive Major Histocompatibility Complex Class I Ligands by Proteasomes
(2000) J. Exp. Med., 191, 239-252.
-
Morel,S., Levy,F., Burlet-Schiltz,O.,
Brasseur,F., Probst-Kepper,M., Peitrequin,A.L., Monsarrat,B., Van
Velthoven,R., Cerottini,J.C., Boon,T. et al. Processing of some
antigens by the standard proteasome but not by the immunoproteasome
results in poor presentation by dendritic cells. (2000) Immunity,
12, 107-117.
-
Nussbaum,A.K., Kuttler,C., Hadeler,K.P.,
Rammensee,H.G. and Schild,H. PAProC: a prediction algorithm for
proteasomal cleavages available on the WWW (2001) Immunogenetics,
53, 87-94.
-
R Witten, I.h., Frank, E. Data
mining :practical machine learning tools and techniques wuth java
implementations. San Francisco: Morgan Kaufmann:1999.
-
Cost, S. Salzberg S. (1993) A
weighted nearest neighbor algorithm for learning with symbolic features.
Machine Learning , 10, 57-78.
-
Joachims, T. (1999) Making large-Scale
SVM Learning Practical. In: B Scholkopf and C Burges and A Smola,
(eds) Advances in Kernel methods -support vector learning. MIIT
Press, Cambridge massachusetts,London England .
-
Cristianini, N. and Shawe-Taylor,
J. (2000) Support Vector machines and other kernel -based learning
methods. Cambridge University Press,Cambriddge England The Edinburg
Building, Cambridge, CB2 2RU, UK .
-
Singh,H. and Raghava,G.P.S. (2003)
ProPred1: Prediction of Promiscuous MHC Class-I binding sites.
Bioinformatics., 19, 1009-1014.
Uebel, S. and Tampe, R. (1999)
Specificity of the proteasome and the TAP transporter. Curr Opin
Immunol.11, 203-8.
Bhasin.M, and Raghava,G.P.S. (2003)
Ahybrid method for Prediction of MHC Class-I restricted T cell epitopes
(submitted).
-
Saxova, P., Buus, S., Brunak,
S. Kesmir, C., (2003) Predicting proteasomal cleavage sites: a comparison
of available methods. Int Immunol, 15, 781-7.
Kesmir ,C., Nussbaum, A.K., Schild,
H., Detours, V. Brunak S. (2002) Prediction of proteasome cleavage
motifs by neural networks. Protein Eng.15, 287-96.
-
Harris, J.L., Alper, P.B., Li,
J., Rechsteiner, M., Backes, B.J. (2001) Substrate specificity of
the human proteasome. Chem & Biol. 8, 1131-41.
|