a neural network based MHC Class I Binding Peptide Prediction Server

IMTECH Home

nHLAPred
A neural network based MHC Class-I Binding Peptide Prediction Server

Prediction Algorithm

Introduction:
The identification of the MHC class I binding peptides is crucial to prevent the dreadful diseases like cancer.The experimental identification of peptides binding specifically to MHC molecules require the binding assay of each peptide which is very laborious and time consuming.A computational method for MHC or CTL epitope prediction are the best alternative as they reduce the number of peptides to be synthesized for wet experimentation.
In past,number of methods have been developed for predicting MHC binders in an antigen sequence on the basis of rules that govern the binding of a peptide to MHC molecule. Broadly, these methods are based on

All the methods described above have few limitations:-

Most of these methods have been developed only for one or two alleles and are not able to predict promiscuous MHC binders that can bind to many alleles.
These methods have been developed using information from a limited number of binders and non-binders.

In order to overcome some of the above limitations, we have systematically developed a method for a large number of MHC class I alleles which have sufficient data for training. In first step, we developed a quantitative matrix based method for 47 alleles that have minimum 15 binders. The data for deriving matrix for each allele has been obtained from the latest release of MHCBN database. As neural networks require more data for training, so we have selected 30 alleles from these 47 alleles for which at least 40 MHC binders are available. It was proposed that combination of two prediction approaches (quantitative matrices and ANNs) can give better prediction than any individual approach (Gulukota et al., 1997). Thus for 30 alleles , we have combined the artificial neural network and quantitative matrices based approach of prediction. Another aim of this study is to develop a method for large number of alleles. Therefore we have included 20 alleles in prediction method whose quantitative matrices are already available in literature (BIMAS and ProPred1).The overall structure of the nHLAPred is show in diagramatic view below.

The architecture of the nHLAPred server.The server is consist of two major Parts (I) ComPred and (II) ANNPred.

ANNPred

The prediction in this part of server is solely based on Artificial neural network.The prediction results obtined by using artificial neural networks is better as compared to motif and weight matrices based prediction. The major constraints of neural prediction is that it requires large amount of data of MHC binders and non-binders for prediction.We are able to develop the neural based prediction method only for 30 alleles.

ComPred

This is a comprehensive platform for prediction of MHC binders form an antigenic sequence for 67 different MHC alleles.The prediction for 30 alleles is based on the hybrid approach of Artificial neural network and quantitative matrices.The prediction for rest 37 alleles is based on the quantitative matrices only.The matrices for 17 MHC alleles out of these 37 alleles has been generated in present study and rest of matrices are obtained from BIMAS server. The predicted MHC binders are refind to potential T cell epitopes by locating the proteasomal cleavage sites.

Stepwise description of Prediction Algorithm.

Extraction and preprocessing of Data.
Training of Artificial Neural Networks (ANNs).
Generation of Quantitative matrices(QM).
Hybrid approach based on combination of ANNs and QM.
Refining of MHC binders to Potential CTL epitopes.

Extraction and preprocessing of Data:-

The dataset of MHC binders for the training of artificial neural network as well as for the generation of quantitative matrices has been obtained from MHCBN, a comprehensive database of MHC binding and non-binding peptides (Bhasin et al., 2002). MHC binders of 9 amino acids have been used for the training of artificial neural networks and for the generation of quantitative matrices. For each MHC allele, the dataset of MHC non-binders is obtained from MHCBN (wherever available) otherwise it is prepared from the SWISS-PROT database by randomly choosing peptides of 9 amino acids. The complete dataset of each MHC allele has nearly an equal number of MHC binders and non-binders.

Input data for ANNs:- The input for artificial neural network of each MHC allele is a single sequence in the binary representation. In this representation, each amino acid at each window position is encoded by a group of 21 input units- 20 units code for each of the possible natural amino acids at that position and one is used when the moving window overlaps the amino- or carboxyl terminal end of peptide. In each group of 21 input units, the input corresponding to the amino acid type at that window position is set to 1 and rest all other inputs to 0. for example the alanine and glycine is represented as follows
Alanine: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Glycine: 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In this manner a single peptide of 9 amino acids is represented by 189 input units to neural networks.

Training of Artificial Neural Networks (ANNs):-

The neural network implementation has been achieved by using Stuttgart Neural Network Simulator, SNNS 4.2 (Zell, A. and mamier, G. 1997). A linear activation function is used. At the start of each simulation, the weights are initialized with the random value. The training is carried out using error back propagation with a Sum of Squared Error function (SSE). The magnitude of the SSE on training and testing set is monitred after each cycle. The ultimate number of cycles is determined where the network converges. The neural networks is trained for each MHC allele in the similar manner.The determination of ultimate numbers of cycles where the trained artificial neural networks gave maximum accuracy is achieved by spending 100 of hours of computation.The utimate parameters determinded as as follows:

Ulimate numbers of epochs: 2000.
Number of hidden layers: 1.
Number of neurons per hidden layer: 10.
Linear learning rate: 0.01.

The accuracy of the trained artificial neural network in classifying MHC binders and non-binders for all MHC alleles is evalvuated by using the "Leave One Out Cross Validation (LOOCV)" test.In LOOCV, the n fold of cross validation when there are n training examples.This is the most accurate and most exterme version of cross-validation. During the testing of networks, a cut off value is set for trained networks of each MHC allele and the scores of peptides produced by the networks are compared to cutoff score. A peptide getting the score more than the cutoff value is predicted as the binder whereas a peptide having score less than cut off value is predicted as non-binder.

Generation of Quantitative matrices:-

The quantitaive matrices consist of a table having the sequence weight frequencies of each of the 21 amino acids (including "X") at each position in the dataset of MHC binders divided by the corresponding expected frequency of that amino acid in the non-binders dataset. The MHC binders datasets for each MHC allele are generated by obtaining MHC binders of 9 amino acids from MHCBN database. The equal number of the non-binders is also obtained from the same database (if available) otherwise the 9-mer peptides are randomly choosen from the SWISS-PROT database. The quantitative matrices are addition matrices where the score of a peptide is calculated by summing up the scores of each residue at specific position along peptide sequence. For example, the score of peptide "ILKEPVHGV" is calculated as follows.

Score= I(1)+L(2)+K(3)+E(4)+P(5)+V(6)+H(7)+G(8)+V(9) (1)

The peptides with score more than the cutoff score at a particular threshold are predicted as MHC binders.A few matrices are also obtained from literature(BIMAS and ProPred1).These matrices are mostly multiplication matrices. The score of the peptide is calculated as follows:e.g "ILKEPVHGV"

Peptide score=I(1) * L(2) * K(3) * E(4) * P(5) * V(6) * H(7) * G(8) *V (9) (2)

Threshold score: The determination of threshold or cutoff score is an integral part of matrices based predictions. It is prerequisite to calculate the cutoff score for each of the matrices. The calculation of the threshold score requires sufficient amount of data of MHC binders and non-binders. The threshold score for the matrix of each MHC allele is determined as follows:

Firstly, all overlapping peptides of 9 amino acids are generated from all the proteins present in SWISS-PROT database.

Secondly, the score for these natural SWISS-PROT peptides are obtained by using new quantitative matrices of different MHC alleles. These peptides are sorted in the desending order depending on the score achieved by each peptide. The top 1% of the peptides are extracted and the minimum score out of these peptides is considered as threshold score at 1%. Similarly, the peptide scores at other thresholds such as 2%,3% ,etc. is also calculated.

The peptide getting score more than cutoff score at particular threshold are known as binders. On the otherhand the peptide getting score less than threshold score are known as non-binders.

Hybrid appoarch based on combination of ANNs AND QM

We have combined the machine learning tecnique (ANNs) and statistical method (QM) to improve the accuracy of prediction.It was proposed by Gulukota et al in 1997 that artificial neural network and quantitative matrix based method will complement each other leading to reduction in false prediction.

Diagrammatic representation of combining ANNs and QM

Refining of MHC binders to Potential CTL epitopes

The MHC binders produced after first level of filteration are refined to be T cell epitopes by using the proteasomal matrices.The standard proteasomal and immunoproteasomal matrices are obtained from the ProPred I server. The matrices are originally derived from the work of Toe et al., 2001.These matrices are addition matrices like quantitative matrices.

Stepwise prediction of Proteasomal cleavage

Overlapping peptides of 12 amino acids are obtained from the antigenic protein.

Score of the each peptide is calculated by using the proteasomal and/or immunoproteasomal matrices.

Peptides with score more than cutoff score at selected threshold are predicted as peptides with proteasomal cleavage sites at their center position i.e. six positions away from the amino terminal position.

The method compares carboxyl terminal position of the predicted MHC binders with the proteosomal cleavage position in the antigenic protein. The MHC binders having carboxyl terminal position coinciding with proteasomal cleavage sites are predicted as potential T cell epitopes.

[Home] [ComPred] [ANNPred] [Links] [References] [Help] [matrices] [Team] [Contact]