CancerPred

A server for the prediction of Cancer lectins
Home Submit1 Submit2 Help Algorithm Contact Team Supplementary
Help


Sequence Name:

The user can give a name of sequence which he wish to input for the prediction.

E-mail Address:

The user is required to give his/her email ID for mailing the result of the prediction server to the corresponding ID. This is must for receiving the output of webserver if want to get the result on your email address.

SVM
This method is based upon Support Vector Machine (SVM). Depending upon the threshold value which user choices, SVM will classify the unknown protein into Cancerlectin and Non-cancerlectin protein. The default value of threshold is "-0.3". If user want less sensitivity but more specificity, then higher threshold value should be specified, but if opposite is anticipated then lower threshold value should be choosen. So, the expected outcome will depends on the trade-off between sensitivity and specificity.

Amino composition calculation
In this percentage composition of all 20 amino acids were calculated, which in turn were used to derive the weight corrosponding to each amino acid. To determine the any unknown protein, compositions is calculated and then corrosponding weight is multiplied to it. All the 20 values determined in this way is summed up to get the cumulative score. If the cumulative score is less than -0.3 then it will be classified as non-Cancerlectin protein and if the score is greater than -0.3 then it will be predicted as cancerlectin .

Dipeptide composition calculation
In this percentage composition of all 400 dipeptides were calculated, which inturn were used to derive the weight corrosponding to each dipeptide. It was done by substracting the composition data. To determine the any unknown protein, dipeptide compositions is calculated and then corrosponding weight is multiplied to it. All the 400 values determined in this way is summed up to get the cumulative score. If the cumulative score is less than -0.3 then it will be classified as non-Cancerlectin protein and if the score is greater than -0.3 then it will be predicted as cancerlectin

Split composition calculation:
As the name "split" suggests , here a protein has been splitted into two or four equal parts and each part subjected to amino acid composition calculation. In this percentage composition of all the 40 or 80 (2 and 4 part)split amino acids were calculated, which inturn were used to derive the weight corrosponding to each split amino acid. To determine the any unknown protein,the split amino acid compositions is calculated and then corrosponding weight is multiplied to it. All the 40 or 80 values determined in this way is summed up to get the cumulative score.If the cumulative score is less than -0.3 then it will be classified as non-Cancerlectin protein and if the score is greater than -0.3 then it will be predicted as cancerlectin

Position Specific Scoring Matrix
In the present study, an attempt was made to use Position specific scoring matrix (PSSM) generated by PSI-BLAST as an input feature for the training of SVM. PSI-BLAST search was carried out against non-redundant data set available at SwissProt and the sequences found in one round of search were used to build a score model for the next round of searching. After three iterations with cut-off E-value of 0.001, it generated a PSSM having the highest score as a part of the prediction process. The matrix consisted of 21X M elements, where M is the length of the target sequence, and each element represents the frequency of occurrence of each of the 20 aminoacids at one position in the alignment. Next, each element of the matrix (20X M) was scaled to the range of 0-1.Further, in order, to make input of fixed length,these normalized PSSM (20 X M) were used to generate a 400-dimensional input vector by summing up all rows in the PSSM corresponding to the same amino acids in the sequence. Finally, eaelement in this input vector was divided by the length of the protein sequence. This would result a matrix of (20 X 20) elements.

PROSITE Domains and PSSM:
PROSITE is a database of families and domains found in various types of proteins.InterProScan is a perl based stand-lone tool that combines different protein signature into single platform, where PROSITE is an integral part of interproscan. In this work, we applied 4.3 version of iprscan tool for the prosite-based ProfileScan method for all dataset of Cancer and Non-cancerlectins. We generated a vector of 414 dimensions containing 400 features of PSSM and 14 features of selected PROSITE domains. The 14 selected PROSITE domains were : PS50049, PS50217, PS50287, PS50915, PS51127, PS50927, PS50228, PS50068, PS50092, PS50234, PS50853, PS50948, PS51115 and PS51117. Out of these 14 PROSITE domains, Seven(PS50049, PS50217, PS50287, PS50915, PS51127, PS50927, PS50228) were specific in cancerlectins while 7 other (PS50068, PS50092, PS50234, PS50853, PS50948, PS51115, PS51117) were abundant in non-cancerlectins. A SVM-based classifier was developed using 414 features and achieved an accuracy level of 69.09% with MCC 0.38.

SVM threshold :
Selection of prediction threshold is most important parameter of prediction. CancerPred server provides threshold in range of -1 to +1 (default= -0.3). If the prediction score of query sequence is more than specified threshold it will be predicted as cancerlectin otherwise non-cancerlectin protein. To get prediction with less number of false positives, user should choose higher threshold. For prediction with less number of false negatives, threshold should be very low. In summary, for prediction with very high specificity threshold should be very high but for high sensitivity threshold should be low.

If you still have any doubt or suggestion then please contact with us.


Department of Computational Biology, Indraprastha Institute of Information Technology,New Delhi,India