| STEPS TO RUN A PREDICTION JOB ON GlycoPP |
Input your sequence: GlycoPP accepts sequence submission in two ways: Either paste the sequence directly into the text-box or upload the text file using 'BROWSE' option. Before uploading please ensure that the sequence must be in FASTA format (Example Sequence) and single-letter code of amino acid representation. However, there should not more than 500 sequences in one submission.
Select prediction programme: During our training programme N-glycosites were best predicted using BPP based methods whereas O-glycosites were best predicted using PPP method on balanced datasets. CPP method performed better on realistic datasets for O-glycosites prediction (where number of unglycosylated residues would be much higher than glycosylated residues in a sequence). User can select for any of the following prediction approaches:
Binary profile of pattern (BPP): In this approach, sequence patterns of fixed length of 21-residues were converted into binary form. Each residue of patterns was represented by a vector of dimension 21 (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), which contains 20 amino acids and one dummy amino acid "X".
Composition profile of patterns (CPP): Composition profile of patterns is the percentage frequencies of each amino acid in a fixed length sequence patterns. The fraction of all 20 natural amino acids of fixed length sequence patterns were calculated using the following equation:
Where comp (i) is the percent composition of a amino acid residue of type i; Ri is number of amino acid residues of type i, and N is the total the number of residues in the fixed length sequence pattern.
PSSM profile of patterns (PPP): It is also called evolutionary information. The multiple sequence alignment information in the form of position specific scoring matrix (PSSM) has been used as input feature to develop this learning model. Each target sequence was scanned at Swiss-Prot to generate the alignment profiles or position specific scoring matrices (PSSM) by PSI-BLAST program. Three iterations of PSI-BLAST were run for each protein with cut off e-value 0.001 thus generating the profile matrices. The PSSM contains probability of occurrence of each type of amino acid at each residue position of protein sequence. Finally we extract PSSM contains probability of occurrence of each type of amino acid of fixed length sequence patterns from full length sequence PSSM matrix that is calles PSSM profile of patterns.
Hybrid Approaches: In view of the current understanding that glycosylation occurs on folded proteins in prokaryotes, we also provide hybrid models:
a)BPP+ASA: Employing predicted Average Surface Accessibility (ASA) information along with Binary profile of pattern (BPP) as input features.
b)PPP+ASA: Employing predicted Average Surface Accessibility (ASA) information in combination with PSSM profile of patterns (PPP) as input features.
In our analysis, we found that prediction results of hybrid approaches were slightly better than abovementioned solo models however, these approaches take longer time in learning process.
Select SVM threshold: User may choose various threshold to customize the run. Usually, the probability of correct prediction directly depends on the threshold. For prediction with high confidence (less probability of false positive prediction) high threshold may be chosen. At low threshold, the probability of false negative prediction is very low. On default theshold (0.0) the rate of false negative and false positive prediction was nearly equal during training of SVM model.
Submit the job: Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server produces tabular output providing the predictions along with corresponding SVM scores.
EXAMPLE OUTPUT FOR EXAMPLE SEQUENCE:
No of sequences:1
Threshold Selected: 0.0
>Protein Length = 272
Potential N-Linked Glycosylated Sites: