Path to this page:
Home »
Dataset Information
Dataset was obtained from SuperSite encyclopedia. The dataset contains 247 GTP Binding proteins chains. The dataset was filtered with 90% sequence identity using the program CD-HIT, which produces a non-redundant dataset using greedy increment algorithm. The final dataset obtained after removing redundancy was 55 proteins chains. Then we use PDBID in Ligand Protein Contact (LPC: Edelman M 1998) and extract total chains which interact with GTP and its intracting residues.
.
Binary Patterns
Amino acids were represented as binary string of length 20 where 19 "0"
and a unique position set to "1" for each amino acid.For example an amino
acid(A) can be represented as follows
A = 10000000000000000000
Support Vector Machine
Support vector machine (SVM) is a novel machine learning method. It is based on the statistical learning theory presented by V.N.Vapnik, it has been successfully applied to numerous classification and pattern recognition problems such as text categorization, image recognition and bioinformatics. The application of SVM results in the globally optimized while with neural networks, the gradient based on training algorithms and the solution for a classification problems. The SVM light is a freely downloadable package written by Joachim's which can be downloadable from http://ais.gmd.de/~thorsten/svm_light/. The SVM_light is used to predict the GTP interacting proteins. The SVM modules were developed based on the binary patterns of amino acid and pssm patterns.
PSSM generated by PSI-BLAST
In the present study, an attempt was made to use Position specific scoring
matrix (PSSM) generated by PSI-BLAST,as an input feature for the training of SVM. PSI-BLAST search was carried out against non-redundant data set available at NCBI and the sequences found in one round of search were used to build a score model for the next round of searching. After three iterations with cut-off E-value of 0.001, it generated a PSSM having the highest score as a part of the prediction process. The matrix consisted of 21X M elements, where M is the length of the target sequence, and each element represents the frequency of occurrence of each of the 20 amino acids at one position in the alignment.
Next, each element of the matrix (20X M) was scaled to the range of 0-1, using sigmoid function.
Further, in order, to make input of fixed length, these normalized PSSM (20 X M) were used to generate a 400-dimensional input vector by summing up all rows in the PSSM corresponding to the same amino acids in the sequence. Finally, each element in this input vector was divided by the length of the protein sequence. This would result a matrix of (20 X 20) elements.
Evaluation of Performance:-
The 5 fold cross validation technique examined the prediction
quality. In this technique the relevant dataset was partitioned
randomly into 5 equal datasets. The training and testing was carried
out five times, each time onset for testing and other 4 sets for
training. The accuracy of results commonly measured by the quantity
of True Positives (TP), True Negatives (TN), False Positives (FP)
and False Negatives (FN). In the prediction system the total prediction
accuracy, Mathew's correlation co-efficient (MCC), sensitivity
and specificity was calculated by following equations.
Sensitivity = TP / (TP+FN),
Specificity = TN / (TN+FP),
Accuracy = TP+TN / TP+TN+FP+FN and
MCC = (TP * TN) - (FP*FN) / Ö(TP+FN)* (TP+FP)*(TN+FP)*(TN+FN).