Datasets used

Our data can be divided as:
Main-dataset –
This includes 1805 sequences as positive training data, 3593 negative sequences from Swissprot, independent dataset comprising 303 positive sequences and 300 negative sequences from Swissprot; this data was used in all analysis
Alternative dataset –
This comprises same 1805 positive sequences as Main dataset, 12541 negative sequences from TrEMBLE, and independent dataset having same positive independent data as Main dataset and 1000 negative data from TrEMBLE.

Amino acid preference analysis

Amino acid composition plot of different therapeutic peptides with toxin peptides

Prediction Approach

In the present study, SVM classifier was used from freely available SVM_light package. This package is powerful as well as user-friendly where we can adjust the parameters and kernel functions like Linear, Polynomial, RBF and Sigmoid.

Input features for SVM

In this study we have been used various features as SVM input for the prediction of toxic peptides.

1. Amino Acid Composition: Amino Acid Composition is the fraction of each amino acid present in a peptide. There are 20 vectors generated in which one corresponds to one amino acid and these vectors used for as SVM input.
2. Dipeptide Composition: Dipeptide Composition is the fraction of each dipeptide like AA, AC, AD and so on. It provides compositional as well as local order each residue present in the peptide. It contains 20x20 (400) vectors.
3. Binary Profile pattern: Binary Profile pattern is represented by 20 vectors for each amino acid. For a peptide of length n, there are nx20 vectors generated in binary form which were used as SVM input (as shown in the figure below)
4. Motif-based profile: We have discovered different sequence -based motifs from toxin peptides with the help \of MEME-MAST software and used this information for prediction of query peptide sequences (as shown in the figure below).

Hybrid Method

We observed that there are number of motifs present in the toxic peptides. So, we have used this motif information for the prediction of toxic peptides. Motifs in toxic peptides were searched by the MEME software and then query sequences were hit with the toxic peptide motif list by MAST software. If hit was found against a peptide, its SVM score is increased by 5. So, it will be predicted as toxic peptide irrespective of SVM threshold. This approach increases the reliability of our prediction method.

Quantitative Matrix

Quantitative matrix was generated for each residue on the basis of contribution of eve\ry residue on each position.The quantitative matrix was generated on the basis of probability or frequency of amino acid at \particular position.The performance was evaluated by using 5 fold cross validation technique.

Evaluation or Performance

Five-fold cross validation technique has been used. Four sets are used for training and remaining one in used for testing, i\n this way the process repeats five times. Evaluation of performance of different SVM modules has been done by calculating a\ccuracy and Matthew's correlation coefficient (MCC).