HIVcoPRED

Server for prediction of Coreceptor usage by HIV-1
Raghava BIC IMTECH CRDD HIVbio
Algorithm


Home
Submit
Help
Algorithm
Contact
Team
Supplement
Other Links
Raghava
HIVBio
CRDD
Imtech
Dataset Information
we extracted all the HIV-1 V3 sequences IDs from Los Alamos HIV sequences database (www.hiv.lanl.gov/) of all subtype which were 5181 R5-tropic and 612 X4-tropic. After removing all the duplicate sequences from R5 and X4-tropic sequences, it finally formed a non-redundant dataset of 1799 R5 and 246 X4-tropic sequences. We also included 352 R5X4-tropic sequences in the CXCR4 dataset to make the final X4 dataset of 598 (246+352) sequences. Each sequences was unique in R5 as well as in X4-tropic dataset.

Prediction Approaches
There are several approaches used in this study e.g Amino Acid Composition (AAC), Dipeptide composition, Split Amino Acid Composition (SAAC), Binary based, Two sample logo(TSL) based, Binary+TSL based and Hybrid (SAAC+BLAST) approach. Since SAAC and Hybrid approach perform performed better than other approaches, therefore final prediction models are based on these two approaches only.

Split Amino Acid Composition (SAAC) based approach
In case of SAAC, a sequence was divided into non-overlapping fragments and amino acid composition of each fragment was calculated independently. Thus, the dimension of the final input vector was N×20, where N is the number of fragments. In this study, V3 sequences were divided into two parts (N = 2) generating 40 input dimensions, respectively. All these input vectors have been used to develop SVM models.

Hybrid (SAAC+BLAST) approach
In this study, we introduced an different approach for predicting coreceptor usage, integrating best SVM model (SAAC) and BLAST approach. In this hybrid approach, the prediction is done in four steps
(1) the SAAC is calculated and subjected the best model for SVM score generation
(2) BLAST of this sequence was done against the set database (1799/598) and recorded the E-value of sequence with maximum similarity
(3) The SVM score was aligned alongside with E-value
(4) Depending upon the E-value of the blast output, SVM score was modified (I) If the 'matched sequence' is CCR5 and the E-value is "-17 or below i.e. -18, -19", the SVM score is modified with adding "1" (II) Similarly, if the 'matched sequence' is CXCR4 and the E-value is "-17 or below i.e. -18, -19", the SVM score is modified with subtracting "1" from it. This was a unique way to add the features of both SAAC based SVM model and Blast search. The final score was used to predict the status of the query sequence. In this way, the best of both the approaches have been integrated into single output which was used for prediction purpose See Hybrid Approach .

Support Vector Machine (SVM)
The SVM is an excellent machine learning technique and which is freely available as SVM_light package, written by Thorsten Joachims (1999). The software enables the user to define a number of parameters as well as to select from a choice of inbuilt kernal functions, including a linear, polynomial and radial basis function (RBF) kernel. It is based on the statistical learning theory presented by V.N.Vapnik, it has been successfully applied to numerous classification and pattern recognition problems such as text categorization, image recognition and bioinformatics. The application of SVM results in the globally optimized while with neural networks, the gradient based on training algorithms and the solution for a classification problems. The SVM light is a freely downloadable package, which is avilable at joachim's website . Here, we used SVM_light package to predict the coreceptor usage by HIV-1. The SVM modules were developed on input features of AAC, DPC, SAAC, HYBRID, Binary patterns etc - using V3 amino acids sequence information. Finally, SAAC and Hybrid based model were implemented in the webserver for online prediction purpose.

Evaluation of Performance:-
The 5 fold cross validation technique examined the prediction quality. In this technique the relevant dataset was partitioned randomly into 5 equal datasets. The training and testing was carried out five times, each time one set for testing and other 4 sets for training. The accuracy of results commonly measured by the quantity of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). In the prediction system the Sensitivity, Specificity, Accuracy and MCC was calculated by following equations:

Sensitivity = (TP / (TP+FN))×100,

Specificity = (TN / (TN+FP))×100,

Accuracy = (TP+TN / TP+TN+FP+FN)×100

MCC = (TP×TN)-(FP×FN) / √((TP+FP)(TP+FN) (TN+FP)(TN+FN))

Department of Computational Biology, Indraprastha Institute of Information Technology,New Delhi,India