\ Title

              

Dataset:
            A representative dataset of 608 proteins consisting 526 ordered and 82 disordered proteins. This dataset has earlier been used to develop the POODLE webserver (Shimizu et al, 2007). Its raw dataset was retrieved from Disprot (version 3.3) which was later processed following an intensive protocol. A data set of 417 partially disordered proteins is also used for testing.

Support Vector Machines
            SVMs are universal approximators based on statistical learning and optimization theory which supports both regression and classification tasks and can handle multiple, continuos and categorical variables. To construct an optimal hyperplane, SVM employees an iterative training algorithm which is used to minimize an error function.

            In present study, a freely downloadable package of SVM, SVMlight has been used for the classification of secretory proteins. The software enables the users to define a number of parameters and also allows a choice of inbuilt kernel function, including linear, RBF and polynomial. The machine learning techniques are more successful if input units/patterns are of fixed length. Therefore, in the present study, different prediction approaches based on different features of a protein such as amino acid composition, dipeptide composition, PSSM-composition and SS-composition have been generate fixed length patterns.

Position Specific Scoring Matrices

Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.

Performance:

            To assess the performance of methods we used several parameters routinely used. The following is a brief description of these parameters.
(i)The sensitivity or percent coverage of disordered proteins is the percentage of disordered proteins correctly predicted as disordered proteins.
(ii) The specificity or percent coverage of globular or ordered proteins is the percentage of ordered proteins correctly predicted ordered proteins.
(iii) The accuracy is the percentage of correctly predicted proteins.
(iv) The Matthew's correlation coefficient.
where TP and TN are truly or correctly predicted positive (disordered) and negative (ordered) proteins, respectively. FP and FN are falsely or wrongly predicted disordered and ordered proteins, respectively.