BETATPRED2: Prediction of beta turns by neural networks and multiple alignment

BetaTPred2: Prediction of ß-turns in proteins using neural
networks and multiple alignment

Department of Computational Biology, IMTECH, New Delhi

This is the main page which gives an introduction of BetaTPred2

Here you can submit your query sequence for ß-turn prediction

This page gives a detailed description of ß-turns, neural networks, multiple alignment and performance measures used

About Server

ß-turn

Top

Definition

A ß-turn is a region of the protein involving four consecutive residues where the polypeptide chain folds back on itself by nearly 180 degrees (Lewis et al. 1971, 1973; Kuntz,1972; Crawford et al.,1973; Chou & Fasman, 1974). It is these chain reversals which give a protein its globularity rather than linearity.

The ß-turn was originally identified, in model building studies, by Venkatachalam (1968). He proposed three distinct conformations based on phi,psi values (designated I,II and III) along with their related turns (mirror images) which have the phi, psi signs reversed (I',II' and III'), each of which could form a hydrogen bond between the main chain C=O(i) and the N-H(i+3). Subsequently, Lewis et al. (1973) examined the growing number of three-dimensional protein structures and suggested a more general definition of a ß-turn. This stated that the distance between the Calpha(i) and the Calpha(i+3) was <7Å and the residues involved were not helical. They found that 25% of their extended ß-turns did not possess the intraturn hydrogen bond suggested by Venkatachalam. To include the new data they extended the classification of ß-turns to 10 distinct types (I,I',II,II',III,III',IV,V,VI and VII). These classes were defined not only by phi,psi angles, but also less stringent criteria. Richardson (1981) has since reappraised the situation, and has suggested that there are only 6 distinct types (I,I',II,II',VIa and VIb) based on phi,psi ranges, along with a miscellaneous category IV. The Richardson classification is the system most widely used at present. Two main types of ß-turns are Type-I and Type-II with their mirror images I' and II'.

Type I & II turns

Two types of ß-turns (Type-I and Type-II)

A ß-turn consists of four consecutive residues defined by positions i, i+1, i+2, i+3 which are not present in alpha-helix; the distance between Calpha(i) and Calpha(i+3) is less than 7Å (Richardson 1981; Rose et al. 1985) and the turn leads to reversal in the protein chain. ß-turns may or may not be accompained by the NH(i+3)-CO(i) hydrogen bond connecting the main chain atoms; CO of the ith residue and NH of (i+3)rd residue in the turn (Lewis et al 1973; Nemethy and Scheraga 1980).

Turns play an important role in globular proteins from both structural and functional points of view. A polypeptide chain cannot fold into a compact structure without the component of turns. Also, turns usually occur on the exposed surface of proteins and hence probably represent antigenic sites or involve molecular recognition. Thus, owing to the above reasons, the prediction of ß-turns in proteins is an important element of secondary structure prediction.

Prediction Methods

Top

Various methods for prediction of tight turns and their types are reviewed by Chou. The existing prediction methods can be classified into 4 categories:

Site-Independent Model
1-4 & 2-3 correlation model
Sequence coupled model
Neural Network based methods

Site-Independent Model

This model is based upon the knowledge of amino acid preferences at individual positions in ß-turns and does not consider any coupling between the residues in the sequence. A simple empirical method based on this concept was first introduced by Lewis et al. (1971) and Chou-Fasman (1977). All these methods described below involves no coupling between the residues forming the turn and thus can be classified as Site-Independent Models.

Chou-Fasman algorithm : It is based on calculating the product of derived amino acid probabilities at each of the four positions in the turn and calculating positional frequencies and conformational parameters.

Thornton's algorithm : Later, the conformational potentials, positional potentials and turn type dependent positional potentials are recalculated by Wilmot & Thornton in 1988 by using a dataset of 59 proteins for the prediction of Type I and II ß-turns.

GORBTURN(v3.0) : In 1990, a ß-turn prediction program called GORBTURN (v 1.0) was developed by using the directional parameters of Gibrat et al. in combination with equivalent parameters produced from work by Garett et al. to eliminate potential helix and strand forming residues from the ß-turn prediction. Later in 1994, the program GORBTURN (v 1.0) was improved by incorporating the positional frequencies as calculated by Thornton by using a dataset of 205 protein chains and finally a new version of GORBTURN(v3.0) was developed for the prediction of different types of ß-turns (Type I, I', II, II', VIII and non-specific). In 1997,

1-4 & 2-3 Correlation Model

In 1997, an entirely new model, the so-called 1-4 & 2-3 correlation model was proposed by Chou to predict the ß-turns in proteins based on residue coupling. In this model, the coupling effect between the 1st and the 4th residues and that between the 2nd and the 3rd residues was given a special consideration.

Sequence Coupled Model

In the same year i.e. 1997, a sequence coupled model was proposed by Chou for prediction of ß-turns in proteins. It was based on first-order markov chain involving conditional probabilities. These 1-4 & 2-3 correlation model and sequence coupled model were confined to ß-turn and non-turn, i.e., it can be used to predict one of only two possibilities. So, in the same year, Chou & Blinn extended the sequence coupled model for the prediction of different types of ß-turns (Type I, I', II, II', VI, VIII).

Neural Network based Method

Presently, there is only one method BTPRED which is based on neural network. BTPRED is a neural network based method (Shepherd et al., 1999) developed on a set of 300 non-homologus protein domains with resolution 2.0 angstrom or better. A neural network is used to predict whether a given residue is part of a beta-turn or not. A filtering network is used to improve the accuracy and the individual turn type is predicted using a seperate neural network for each turn type to be predicted. It uses secondary structure information obtained from PHDsec program (Rost and Sander, 1993; Rost and Sander, 1994) about each amino acid rather than just amino acid type.

Evaluation of Methods

Top

Recently, we have done the evaluation of all the existing beta turn prediction methods on a data set of 426 non-homologus protein chains. Click here to check the results of evaluation and ranking of existing methods.

Moreover, an online service is also available for assessing the performance of a new beta turn prediction algorithm on a data set of 426 protein chains. We have developed a web server BTEVAL for assessing the performance and ranking of beta-turn prediction methods. Evaluation of a method can be carried out on a single protein or a number of proteins. It consists of clean data set of 426 proteins with seven subsets of these proteins. Users can evaluate their method on any subset or a complete set of data. It allows users to perform a comprehensive assessment of their method and its comparison with other existing methods.

Top

Neural Networks

Top

Definition

Top

Also referred to as connectionist architectures, parallel distributed processing, and neuromorphic systems, an artificial neural network (ANN) is an information-processing paradigm inspired by the way the densely interconnected, parallel structure of the mammalian brain processes information. Artificial neural networks are collections of mathematical models that emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. The key element of the ANN paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements that are analogous to neurons and are tied together with weighted connections that are analogous to synapses.

There are multitudes of different types of ANNs. Some of the more popular include the multilayer perceptron which is generally trained with the backpropagation of error algorithm, learning vector quantization, radial basis function, Hopfield, and Kohonen, to name a few. Some ANNs are classified as feedforward while others are recurrent (i.e., implement feedback) depending on how data is processed through the network. Another way of classifying ANN types is by their method of learning (or training), as some ANNs employ supervised training while others are referred to as unsupervised or self-organizing. Unsupervised algorithms essentially perform clustering of the data into similar groups based on the measured attributes or features serving as inputs to the algorithms. This is analogous to a student who derives the lesson totally on his or her own. ANNs can be implemented in software or in specialized hardware.

Learning algorithm and architecture

Top

The neural network used here is the standard feed-forward network, which always passes information in the forward direction. The learning algorithm is the standard backpropagation of error, which minimizes the difference between the actural output of the network and the desired output for patterns in the training set. Training is done on a set of 426 non-homologus protein chains by 7-fold cross-validation.

Two feed-forward networks are used. In the first step, the sequence-to-structure network is trained where the occurrence of various residues in a window size of 9 amino acids is correlated with the output of the central residue. The predicted turn/nonturn output from the first net is incorporated along with predicted secondary strcuture information in the second filtering structure-to-structure network and the net is trained to filter the unreasonably isolated residues which are predicted as turns.

To generate the neural network architecture and the learning process, the publicly available free simulation package SNNS version 4.2 is used. Click here to download SNNS(v4.2).

In both the single sequences and multiple alignment, the network uses a sliding window of 9 amino acids, where a prediction is made for the central residue. With single sequence input, a vector of 21 units, where only one unit is set to 1 for a particular residue and the rest are set to zero, encodes each residue. With multiple alignment profile input, the position specific scoring matrix from PSI-BLAST (Altschul et al., 1997) is used as input to the neural network. The matrix has 21 X M elements, where M is the length of the target sequence. Each residue is encoded by 21 real numbers (rather than by a one and 20 zeros).

The notation used to define network topology is 9(21)-10-1 network means it has 3 layers, input layer with window size of 9 residues and 21 input units per window position, a hidden layer with 10 nodes, and an output layer with 1 unit. Total number of units in the input layer is 9 X 21 = 189 units. Also, in second net (structure-to-structure) network, the window size of 9 amino acids is used. 4 units encode each residue; one unit for predicted turn/nonturn by the first network and it is either set to 1 or 0 and remaining 3 units for 3 secondary structure states (helix, strand and coil) predicted by PSIPRED(Jones, 1999)

Both the first and the second network have a single output unit, the value of which is 1 for a beta-turn residue and 0 for non-beta-turn residue. The learning process consists of altering the weights of the connections between units in response to a teaching signal that provides information about the correct classification. The difference between the actual output and the desired output is minimized (the sum of square error, SSE). During the testing of network, a cutoff value is set for each network and the output produced by the network is compared with the cutoff value. If the output is greater that the cutoff value, then that residue is taken as beta-turn residue while if it is lower, it is considered as non-beta-turn. For each network, the cutoff value is adjusted that it yields the highest accuracy for that network.

Multiple Alignment

Top

Prediction from a multiple alignment of protein sequences rather than a single sequence has long been recognized as a way to improve prediction accuracy (Cuff and Barton, 1999). During evolution, residues with similar physico-chemical properties are conserved if they are important to the fold or function of the protein. The availability of large families of homologous sequences revolutionised secondary structure prediction. Traditional methods, when applied to a family of proteins rather than a single sequence proved much more accurate at identifying core secondary structure elements.

The same approach is used here for the prediction of beta turns. It is a combination of neural network and multiple alignment information. Net is trained on the PSI-BLAST(part of PSIPRED) generated position specific scoring matrices.

PSI-BLAST

Top

In PSI-BLAST(Position Specific Iterative Blast)(Altschul et al., 1997), the sequences extracted from a Blast search are aligned and a statistical profile is derived from the multiple alignment. The profile is then used as a query for the next search, and this loop is iterated a number of times that is controled by the user. For more information, Click here.

The PSIPRED method has been used for secondary structure prediction. It uses PSI-BLAST to detect distant homologues of a query sequence and generate position specific scoring matrix as part of the prediction process (Jones, 1999), and training is done on these intermediate PSI-BLAST generated position specific scoring matrices as a direct input to the neural network. The matrix has 21 X M elements, where M is the length of the target sequence and each element represents the likelihood of that particular residue substitution at that position in the template. It is a sensistive scoring system, whcih involves the probabilities with which amino acids occur at various positions.

Performance Measures

Top

Here, four different parameters are used to measure the performance of BetaTPred2 as described by Shepherd et al. (1999).

The predictive performance of a method is expressed by following four parameters:

1. Qtotal, the percentage of correctly classified residues, is defined as

Top

where, p is the number of correctly classified beta-turn residues, n is the number of correctly classified non-beta-turn residues and t is the total number of residues in a protein. Qtotal, also known as 'prediction accuracy' may be defined simply as the total percentage of correct prediction. One difficulty with this measure is that it does not take into account disparities in the number of beta-turns(around 25%) and non-turns. Hence, it is possible to get a Qtotal score of about 75% by the trivial strategy of predicting all residues to be non-turn residues. Therefore, there is a risk of losing the information because of the dominance of non-turn residues. The Matthews Correlation Coefficient remedies this problem, which is defined as

2. MCC, the Matthews Correlation Coefficient, defined as

where, p is the number of correctly classified beta-turn residues, n is the number of correctly classified non-beta-turn residues, o is the number of non-beta-turn residues incorrectly classified as beta-turn residues and u is the number of beta-turn residues incorrectly classified as non-beta-turn residues. It is a measure that accounts for both over- and under-predictions.

3. Qpredicted, defined as

Qpredicted is the percentage of beta-turn predictions that are correct. Otherwise known as specificity, is the proportion of true negatives or the proportion of non-turn residues that have been correctly predicted as nonturns.

4. Qobserved, defined as

Qobserved is the percentage of observed beta-turns that are correctly predicted. Otherwise, known as sensitivity, is the proportion of true positives or the proportion of beta-turn residues that have been correctly predicted as beta-turns.

Thus, the prediction accuracy is measured at residue level or accuracy is considered in terms of the percentage of individual amino acids predicted correctly.

Top

[Home] [Submit your sequence] [About server] [Help] [References] [Who are we?]