Carbohydrate epitope prediction server

Home Submit Algorithm Developers contact us Help Data
From IEDB database we obtained 240 carbohydrate epitopes in the smile format which were reported to be antigenic. These molecules were from different sources covering bacteria to mammals and are of carbohydrate nature. Using Babel program, molecules are converted to different formats with addition of hydrogen atoms. These molecules were subjected to 3D conversion and energy minimization by using V-life program which rejected many molecules due to the errors in the molecules and reduced to the 219 antigenic carbohydrate molecules. This formed the positive dataset. Corresponding to this we compiled 571 non-antigenic carbohydrate molecules obtained from PDB and KEGG structures from edible plants and non-pathogenic microorganisms after removing entries with 'antigen', 'epitope', 'antibody', 'immune', 'allergy' terms. Positive and negative molecules constituted the main dataset (mdset). A second neagtive dataset (rdset) was created using molecular clustering which were similar in structure to positive carbohydrates. This was to ensure the prediction bias of the models. This led to 261 neagive molecules. This data was used for the calculation of descriptors using different softwares including CDK ( and Padel (
Descriptor calculation:
For deriving the structural activity relationship of each molecule, we calculated descriptors using different softwares, CDK and Padel. Chemistry Development Kit (CDK), implemented in a in-house program WebCdk (, is a Java based open source library for structural chemo- and bioinformatics projects. by which we have computed 178 descriptors. However there is no separation of 2D, 3D or FingerPrint classes. As a second software we used Padel which calculates more than 800 descriptors with separate classes for 2D, 3D and FingerPrints.
Feature selection through Weka:
CfsSubsetEvalas an evaluator which considers the individual predictive ability of each attribute along with the degree of redundancy between them. Subsets of attributes that are highly correlated with the class while having low intercorrelation are preferred. After that evaluation search was done by genetic algorithm (Goldberg, D.E. 1989). During feature selection missing value, zero value etc has been taken care by Weka and led to 218 and 41 selected descriptors for V-life and WebCdk respectively.
Weka Classifiers:
In this study Random forest prefom well to predict carbohydrate epitope as compare to SVM based method (LibSVM, SMO),Neural network and discriminant analysis. This Tree based method based on the concept of Breimans bagging, where successive trees do not depend on earlier trees and is independently constructed using a bootstrap sample of the data set.
Performance Measures:
The performance of various models developed in this study was computed using threshold- dependent as well as threshold-independent parameters. In current study the performance of all the methods and models was evaluated using 5-fold cross-validation using following equations.
Precision and recall:

Precision is a measure of exactness which is true positives (tp) divided by the total number of elements labeled as the positive class (i.e. true positives + false positives (fp)).
Precision = tp/tp +fp

Recall can be defined as number of true positives divided by the total number of elements that actually belong to the positive class (i.e. true positives + false negative (fn)). Recall indicates the specificity of the performance.
Recall= tp /tp+fn

F measure - Generally there is an inverse relationship between Precision and Recall. These two parameters are not discussed in isolation but combined into a single measure, termed as F measure (F) which is the weighted harmonic mean of precision and recall.
F=2*(precision*recall/precision + recall)

Accuracy- All the three parameters do consider true positive, false positive and false negative but lack account of true negatives. Accuracy measures true results of the model and considers both true positives and true negatives in the population.
Accuracy= tp +tn/tp+tn+fp+fn

ROC Plot - As a threshold independ paramter we use ROC (Receiver operating Characteristic) plot which is a plot between sensitivity and (1-specificity) and gives Area under curve (AUC).
Home   |   Raghava   |   CRDD   |   OSDD    |   Contact Us