DatasetNAD binding proteins data was obtained from Protein Data Bank (PDB). We extracted 555 PDBIDs, from SuperSite, which bind to NAD by giving 'NAD' in the cofactor search field. We used these PDBIDs in the Ligand Protein contact server(LPC) and obtained 1556 amino acid chains with contact details.
In the present study we used a cutoff of 6 Å to define the NAD interacting residue (NIR) in consideration with the experimental noise. Using CD-HIT only the non-redundant protein chains, where no two chains had sequence similarity more than 40% were included in the main dataset and contain 195 NAD interacting protein chains. We had total 4772 NIRs out of 65783 amino acid residues.
Different window patterns (15,17 and 19) were created from the sequence data and converted into binary pattern which was represented by a vector of dimension 21 (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) containing 20 amino acids and one dummy amino acid 'X'. A pattern of window length W was represented by a vector of dimension 21 X W.
For Position Specific Scoring Matrix (PSSM), we obtained respective PSSM by PSIBLAST and then converted this matrix into SVM readable patterns of vector dimension 20 X W.
Support Vector Machine & Evaluation ProceduresSupport vector machines (SVMs) are a set of related supervised learning methods used for classification and regression.
Machine learning tools have been proved useful and successful in identification of molecular patterns.
Previously concept of SVM has been successfully utilized in the protein structure prediction, B-cell, T-cell epitope prediction, identification of the MHC binding peptides, sub cellular localization etc.
In the present study, a freely downloadable package of SVM ie svm_light version 6.01 has been used to exploit different sequence features like binary and PSSM.
For evaluation of prediction we used thresold dependent measures like Sensitivity, Specificity and Matthew's correlation coefficient (MCC).
Copyright 2009. All rights reserved.|