Dataset Information
In this study, we used dataset of Human-HIV Protein Protein Interactions (PPIs) from the NCBI HIV-Human protein interaction database. There are two caregories of protein interaction are given: Direct and Indirect Interactions. In this study, we took human proteins which are known to be directly interacting with the HIV-1. For negative dataset, we took random proteins from Mitocondrial, Cytoplasmic and Nuclear proteins of 'Human' as source organism from uniprot database.
Prediction Approaches
There are mainly four prediction approaches that were used in this study.
Amino Acid Composition (AAC)
In this percentage composition of all 20 amino acids were calculated, which inturn were used to derive the weight corrosponding to each amino acid. It was done by substracting the composition data. To determine the any unknown protein, compositions is calculated and then corrosponding weight is multiplied to it. All the 20 values determined in this way is summed up to get the cumulative score.
Dipeptide Composition (DPC)
In this percentage composition of all 400 (20X20) dipeptides were calculated, which inturn were used to derive the weight corrosponding to each dipeptide. It was done by substracting the composition data. To determine the any unknown protein, dipeptide compositions is calculated and then corrosponding weight is multiplied to it. All the 400 values determined in this way is summed up to get the cumulative score.
Split Amino Acid Composition (SAAC)
In case of SAAC, a sequence was divided into non-overlapping fragments and amino acid composition of each fragment was calculated independently. Thus, the dimension of the final input vector was N×20, where N is the number of fragments. In this study, V3 sequences were divided into two parts (N = 2) generating 40 input dimensions, respectively. All these input vectors have been used to develop SVM models.
Domain Based Approach
In this study, we find out the exclusive domains found in a set of proteins. These domains were calculated by using 'hmmpfam' tool of iprscan software. we used these domains as input vectors of SVM and predicted the physically interacting Human proteins with HIV.
Support Vector Machine
The SVM is an excellent machine learning technique and which is freely available as SVM_light package, written by Thorsten Joachims (1999). The software enables the user to define a number of parameters as well as to select from a choice of inbuilt kernal functions, including a linear, polynomial and radial basis function (RBF) kernel. It is based on the statistical learning theory presented by V.N.Vapnik, it has been successfully applied to numerous classification and pattern recognition problems such as text categorization, image recognition and bioinformatics. The application of SVM results in the globally optimized while with neural networks, the gradient based on training algorithms and the solution for a classification problems. The SVM light is a freely downloadable package, which is avilable at joachim's website.
Evaluation of parameter
The 5 fold cross validation technique examined the prediction quality. In this technique the relevant dataset was partitioned randomly into 5 equal datasets. The training and testing was carried out five times, each time one set for testing and other 4 sets for training. The accuracy of results commonly measured by the quantity of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). In the prediction system the Sensitivity, Specificity, Accuracy and MCC was calculated by following equations:
Sensitivity = (TP / (TP+FN))×100,
Specificity = (TN / (TN+FP))×100,
Accuracy = (TP+TN / TP+TN+FP+FN)×100
MCC = (TP×TN)-(FP×FN) / √((TP+FP)(TP+FN) (TN+FP)(TN+FN))