βhairPred - βhairpin prediction server from protein sequences

METHOD

Bhairpred server is based on machine learning technique SVM using single sequence information, evolutionary profile, predicted and observed secondary structure (as obtained using Psipred and DSSP), predicted and observed accessibility values (as obtainned from Netasa and DSSP). The methods were trained and tested on dataset of 2880 proteins and their performance was evaluated on dataset of 534 proteins used by Thornton (PNAS, 2002). Best prediction results were obtained with hybrid approach that combined prediction results from evolutionary profile, predicted secondary structure and accessibility.

Beta-Hairpin Dataset
2880 protein chains were selected from PDB and secondary structure of each amino acid was assigned using DSSP. Strech of amino acids that form sheet-coil-sheet (ECE) regions were extracted. Amino acids forming β-hairpins in these proteins were extracted using PROMOTIF. ECE patterns that was assigned hairpins by PROMOTIF were taken as positive examples and remaining as negative examples. On analysis it was found that length of hairpin varies between 5-22 and majority of them has length 17 amino acids. Hence we fixed 17 residues with maximum coil region 10 residues and minimum sheet length 3 residues. In case of less than 17 residues, flanking residues were taken to complete the required length. The final dataset has 5102 hairpins and 5131 non-hairpins.

SVM Models
Different input features were used to develop the SVM Model. was constructed using single sequence information, PSI-BLAST evolutionary profile, secondary structure and accessibility.
predicted and observed [http://bioinf.cs.ucl.ac.uk/psipred/] (obtained from PSI-PRED and DSSP [ftp://ftp.embl-heidelberg.de/pub/databases/dssp] respectively) observed and predicted from DSSP and NetASA server (http://www.netasa.org) respectively

Feature Representation :

Sequence based model - Each amino acid is encoded by 21 binary representations. This means the sequence based SVM model was trained using 21*17 dimensional binary vector pattern.

Evolutionary profile based model - Evolutionary profile was extracted from PSSM obtained from PSI-BLAST search against NR protein database.

Model based on accessibility profile along with amino acid information - Along with binary representation, accessibility was added to make the final input vector of size 22*17. Predicted accessibility was obtained by NetASA server (http://www.netasa.org). Burried resides are represented by 0 and exposed by 1. Assigned accessibility of DSSP (ftp://ftp.embl-heidelberg.de/pub/databases/dssp) was used in ideal case where structure is already known.

Secondary structure based model - In case of predicted secondary structure (obtained from PSI-PRED; http://bioinf.cs.ucl.ac.uk/psipred/) values of all three states - helix, sheet and coil were added. Assigned secondary structure by DSSP is represented by binary notation. Thus the final input pattern is of size 17*24 units.

Hybrid approach - In the final step, patterns of all above models - multiple alignment, accessibility and secondary structure were used for training.