Dataset Information
The dataset is used to examine the effectiveness of the new prediction method. The hemoglobin protein data were extracted from SWISSPROT.There was 1570 hemoglobin
Protein sequences available in database. The hemoglobin sequences was carring as by similarity, fragments, potential and experimental, these were filtered and made a high quality dataset for predicting hemoglobin protein. The sequence redundancy of dataset was reduced by PROSET software. This software is used for no two sequences had >90% sequence identify to any other sequence in the dataset. So the final hemoglobin sequences
was 525 in the dataset after filtering. The non hemoglobin data set also made with 1266 sequences.
.
Support Vector Machine
Support Vector Machine
Support vector machine (SVM) is a novel machine learning method. It is based on the statistical learning theory presented by V.N.Vapnik, it has been successfully applied to numerous classification and pattern recognition problems such as text categorization, image recognition and bioinformatics. The application of SVM results in the globally optimized while with neural networks, the gradient based on training algorithms and the solution for a classification problems. The SVM light is a freely downloadable package written by Joachim's which can be downloadable from http://ais.gmd.de/~thorsten/svm_light/. The SVM_light is used to predict the hemoglobin protein. The SVM modules were developed based on the following features of a protein
1.Aminoacid composition, 2.Dipeptide composition, and 3.Hybrid composition.
.
i) Amino
acid composition:
The amino acid composition provided the information of protein in 20 dimensions vector. The amino acid composition is the fraction of each amino acid in protein.
ii) Dipeptide Composition:
The dipeptide composition provided the information of protein
in the form of a vector of 400 dimensions. The dipeptide composition
encapsulates the information about fraction of amino acids as
well as their local order.
iii) Hybrid Composition:
The hybrid composition is that the complete composition of
amino acid composition, dipeptide composition and PSI-BLAST
output. For hybrid the SVM input vector of 423 dimensions that
consist of 20 from amino acid composition, 400 from dipeptide
composition and 3 from PSI-BLAST out put.
Evaluation of Performance:-
The 5 fold cross validation technique examined the prediction
quality. In this technique the relevant dataset was partitioned
randomly into 5 equal datasets. The training and testing was carried
out five times, each time onset for testing and other 4 sets for
training. The accuracy of results commonly measured by the quantity
of True Positives (TP), True Negatives (TN), False Positives (FP)
and False Negatives (FN). In the prediction system the total prediction
accuracy, Mathew's correlation co-efficient (MCC), sensitivity
and specificity was calculated by following equations.
Sensitivity = TP / (TP+FN),
Specificity = TN / (TN+FP),
Accuracy = TP+TN / TP+TN+FP+FN and
MCC = (TP * TN) - (FP*FN) / Ö(TP+FN)* (TP+FP)*(TN+FP)*(TN+FN).