Algorithm
Dataset
Intein data was obtained from Inbase database. It represents intein sequences with corresponding N and C Terminal Splice Sites, covering all 3 kingdoms of life i.e. Archea (147), prokaryotes (201) and Eukaryotes (89).
From this we selected a total of 69 experimentally proved Inteins as annotated by Inbase Database.
For Intein prediction we took these 69 as positive dataset and 600 protein sequences (randomly selected from three species as Archaeoglobus fulgidus (Archea) , Neisseria meningitidis (Prokaryote), Drosophila melanogaster (Eukaryote) due to absence of a single intein sequence in them as reported by InBase) as negative dataset.
For Intein's Splice Site prediction we collected N and C terminal splice site of 16 amino acid length as positive and 16 amino acid motifs obtained from the corresponding protein sequences by sliding window method considered as negative dataset.All protein sequences were obtained from Swiss-Prot Database.
Support Vector Machine & Evaluation Procedures
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression.Machine learning tools have been proved useful and successful in identification of molecular patterns.
Previously concept of SVM has been successfully utilized in the protein structure prediction, B-cell , T-cell epitope prediction, identification of the MHC binding peptides, sub cellular localization etc.
In the present study, a freely downloadable package of SVM ie SVM light has been used to exploit different sequence features like Amino acid, Dipeptide composition and Binary patterns.
For evaluation of prediction we used thresold dependent measures like Sensitivity and Specificity.
|