Algorithm of imRNA



This page explains the details of all the algorithms and methods used to develop models for prediction of Immunomodulatory RNA sequences.

Dataset

1. Positive dataset: This includes 602 sequences, majority of them coming from the database and the rest of them being collected from the patents that have been experimentally shown to be immunomodulatory.

2. Negative dataset: This includes 520 sequences.

3. Distribution of sequences for training-testing and validation In each of the positive and negative datasets, ~80% of the ssequences were put in the datasdet of Training-Testing to build prediction models and the rest ~20% were used to constitute the Independent Dataset on which the prediction models were validated.

Compilation of Datasets

Pic-1


Support Vector Machine based methods In the present study, SVM classifier was used from freely available SVM_light package . This package is powerful as well as user-friendly where we can adjust the parameters and kernel functions like Linear, Polynomial, RBF and Sigmoid.

Evaluation or Performance

Five-fold cross validation technique has been used. Four sets are used for training and remaining one in used for testing, in this way the process repeats five times. Evaluation of performance of different SVM modules has been done by calculating accuracy and Matthew's correlation coefficient (MCC).

Input features for SVM

In this study we have been used various features as SVM input for the prediction of Immunomodulatory RNAs.

1. Trinucleotide Composition: Trinucleotide Composition is the percentage value of each of the possible 64 trinucleotide sequences within a given query sequence. There are 64 vectors generated in which each vector corresponds to one trinucleotide and these vectors used for as SVM input.

2. Binary Pattern: This attribute is a representation of the local order of the nucleotides within a sequence. In this notation, A,C,G and T are represented as 1,0,0,0; 0,1,0,0; 0,0,1,0 and 0,0,0,1 respectively. Since the minimum length of sequence in the dataset is 17, binary patterns of sequence length 17 from 5' and 3' ends of the dataset sequences were taken into account.