SVM based Hemoglobin prediction

The dataset is used to examine the effectiveness of the new prediction method. The hemoglobin protein data were extracted from SWISSPROT.There was 1570 hemoglobin Protein sequences available in database. The hemoglobin sequences was carring as by similarity, fragments, potential and experimental, these were filtered and made a high quality dataset for predicting hemoglobin protein. The sequence redundancy of dataset was reduced by PROSET software. This software is used for no two sequences had >90% sequence identify to any other sequence in the dataset. So the final hemoglobin sequences was 525 in the dataset after filtering. The non hemoglobin data set also made with 1266 sequences. .

Support Vector Machine Support vector machine (SVM) is a novel machine learning method. It is based on the statistical learning theory presented by V.N.Vapnik, it has been successfully applied to numerous classification and pattern recognition problems such as text categorization, image recognition and bioinformatics. The application of SVM results in the globally optimized while with neural networks, the gradient based on training algorithms and the solution for a classification problems. The SVM light is a freely downloadable package written by Joachim's which can be downloadable from http://ais.gmd.de/~thorsten/svm_light/. The SVM_light is used to predict the hemoglobin protein. The SVM modules were developed based on the following features of a protein 1.Aminoacid composition, 2.Dipeptide composition, and 3.Hybrid composition. .

i) Amino acid composition:
The amino acid composition provided the information of protein in 20 dimensions vector. The amino acid composition is the fraction of each amino acid in protein.

ii) Dipeptide Composition:
The dipeptide composition provided the information of protein in the form of a vector of 400 dimensions. The dipeptide composition encapsulates the information about fraction of amino acids as well as their local order.

iii) Hybrid Composition:

The hybrid composition is that the complete composition of amino acid composition, dipeptide composition and PSI-BLAST output. For hybrid the SVM input vector of 423 dimensions that consist of 20 from amino acid composition, 400 from dipeptide composition and 3 from PSI-BLAST out put.

The 5 fold cross validation technique examined the prediction quality. In this technique the relevant dataset was partitioned randomly into 5 equal datasets. The training and testing was carried out five times, each time onset for testing and other 4 sets for training. The accuracy of results commonly measured by the quantity of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). In the prediction system the total prediction accuracy, Mathew's correlation co-efficient (MCC), sensitivity and specificity was calculated by following equations.

Sensitivity = TP / (TP+FN),
Specificity = TN / (TN+FP),
Accuracy = TP+TN / TP+TN+FP+FN and
MCC = (TP * TN) - (FP*FN) / Ö(TP+FN)* (TP+FP)*(TN+FP)*(TN+FN).