Supplementary information for EGPRED

Supplementary Information

[HOME] [SUMBISSION FORM] [CONTACT] [TEAM] [UPDATES] [HELP] [RESULTS] [About EGPRED]

INTRODUCTION

Burset and Guigo (1996) discussed that it may be beneficial to combine outputs from several programs. It is noted that different ab initio programs grasps different aspects of coding DNA to predict exons in genomic sequences. hough the coding level accuracy of recent programs is very high (Rogic et al., 2001) the gene level accuracy is very poor with less than 50 % of predicted genes corresponding exactly to actual genes. This fact gives advantage in combining predictions. One program that is usually worse predictor in a given case can predict successfully where other programs fail.

However, simple combinations are not effective as number of exons detected by all programs is considerably less. This results in increased specificity but decreased sensitivity or coverage. Taking exons predicted by at least one program results in increase of sensitivity but decreases specificity.

Based on this finding Murakami and Tagaki (1998) developed five combinatorial methods for gene prediction. These were the AND-method, OR-method, HIGHEST-method, RULE-method and BOUNDARY-method. Further improvement in prediction accuracy was achieved recently by Rogic et al. (2002). They developd three new strategy of combining ab initio predictions. These were the Exon Union-Intersection (EUI), Exon Union-Intersection with Reading Frame Consistency (EUI-frame), and Gene Intersection (GI) methods. Significant improvement in prediction accuracy was noted through the use of these methods.

Based on these findings, a requirement for further improvement is noted. However, any attempt to increase the average accuracy should not loose the advantage gained by previous efforts of EUI, EUI-frame and GI methods.Combinatorial approaches that use both homology search and pattern recognition were proposed by Snyder and Stormo (1995), Kulp et al. (1997), Xu and Uberbacher (1997) and Solovyev and Salamov (1997).

Similarity searches against protein databases derives all known protein coding regions from any genomic sequence. This method is especially helpful where homologs are present in database and ab initio predictors fail to predict these exons. However, the similarity searches are not without its own faults. Pseudogenes and certain non coding DNA can sometimes be predicted as exon by these programs. The advantage lies in correct identification of gene boundaries where known homologs are present in database.

We decided to use similarity search against Intron database to reduce number of false positives generated by other programs. This is a very tricky aspect, since introns are known to have no homology between themselves. It is known, however, that exons show little or no similarity to introns.

This led to the point that if we use similarity search against intron database, the sensitivity will not be affected significantly. This is because exons and introns will not be similar. But any known intronic sequences predicted as exons by different programs can be modified or removed, thereby improving the prediction accuracy.

Here we have tested our two-way strategy on a non-homologus dataset, HMR195 (Rogic et al., 2002).

TWO-WAY SIMILARITY SEARCH STRATEGY: CONCEPT--

The initial step is to predict gene structure using Genscan & HMMgene programs with respect to an organism model.
Combinatorial methods by Rogic et al. (2002) are used to combine the ab initio predictions.
BLASTX program is used to search against protein database. Predicted HSPs are filtered using criteria of 85% identity. The boundaries of HSPs are taken as exon boundaries.
The BLASTX results were combined with exons from Rogic’s combinatorial methods using the simple OR-based technique. Preference is given to ab initio predictions for assigning the boundaries of the overlapping exons.
All the predicted combination exons are searched against intron database using the BLASTN program and the HSPs are used to parse the exons to get a possibly more accurate exon boundary.

EVALUATION OF COMBINATION BASED GENE FINDING PROGRAMS

Dataset and Perl scripts for EUI, EUI-frame and GUI methods:
HMR195 dataset was used for evaluation of programs. It was obtained from the site http://www.cs.ubc.ca/labs/beta/genefinding/. Perl scripts implementing the combination methods of Sanja Rogic were also downloaded from the site.
GENSCAN developed by Burge and Karlin, 1997
Genscan uses a number of different algorithms to predict introns, exons (leading, internal, and terminal), donor and acceptor splice sites, and polyadenylation sites. The highest scoring arrangement of these categories are then used to predict actual gene composition. Genscan supports polygenic genomic DNA input, and scans both strands for coding regions. Genscan has an additional feature that draws a GIF representation of the resultant prediction, showing all putative exons in their respective positions on either strand, and whether they are leading, internal or terminal, and a simplified scoring scheme. Genscan was developed for use on vertebrate sequences, but has been used with some success on maize, Arabidopsis and Drosophila.
HMMGENE developed by Krogh, 1997
HMMgene is a program for prediction of genes in anonymous DNA. The program predicts whole genes, so the predicted exons always splice correctly. It can predict several whole or partial genes in one sequence, so it can be used on whole cosmids or even longer sequences. HMMgene can also be used to predict splice sites and start/stop codons. The program is based on a hidden Markov model, which is a probabilistic model of the gene structure. This means that all predictions have associated probabilities that reflect how confident it is in the predictions. Apart from reporting the best prediction, HMMgene can also report the N best gene predictions for a sequence. This is useful if the there are several equally likely gene structures and may even indicate alternative splicing. The output is a prediction of partial or complete genes in the sequences.
Databases for BLASTX and BLASTN
SWISS-PROT, NON-REDUNDANT protein and Expressed Sequence Tags (ESTs) databases were downloaded from NCBI FTP site (ftp://ftp.ncbi.nin.gov/blast/db/) as part of Genome-wide Similarity search webservers the GWBLAST (http://webs.iiitd.edu.in/raghava/gwblast/)and GWFASTA (http://webs.iiitd.edu.in/raghava/gwfasta/).
The Intron database was downloaded from http://intron.bic.nus.edu.sg/introndb/introndb.html.

From all the databases, those sequences belonging to the HMR195 dataset were removed so that the combination method did not have any advantage over ab initio methods by using these sequences.

RESULTS FROM THE EVALUATION

Abbrv: Sen Sensitivity; Spec Specificity; AC Approximate Correlation; CC Correlation Coefficient; ESn Exon Sensitivity; ESp Exon Specificity; MG Missed Genes; GS Genscan; HMM HMMgene; EUI Exon-Union Intersection; EUI-frame Exon-Union Intersection with Reading Frame consistency; GI Gene Intersection;

Initially, all the programs were tested individually for their performance on HMR195 dataset. BLASTX program was used against Non-redundant and SWISS-PROT protein databases, while BLASTN was used against Expressed Sequence Tags (ESTs) database. Only the forward strand predictions were considered for performance evaluation. The Results are shown in Table 1.

The ab initio programs were better predictors at both nucleotide level and exons level accuracy. Among similarity search programs, BLASTX against SWISS-PROT was marginally less accurate than BLASTX against Non-Redundant (NR) protein database. At exon level however, the SWISS-PROT was a much better option that NR. BLASTN against EST was even less accurate than BLASTX against proteins at both nucleotide and exon level. A more striking aspect is the higher number of missed genes that result from similarity searches.

One conclusion that can be derived from this result is that ab initio predictors are able to predict the exon boundaries with very good accuracy than similarity searches.

Since BLASTX against SWISS-PROT database resulted in almost similar accuracy to BLASTX against NR and also because BLSTX against SWISS-PROT had better accuracy at Exon level than BLASTX against NR, we decided to use SWISS-PROT for futher evaluation. Since, BLASTN against EST had very low probability of correct predictions or specificity, further evaluation using EST was avoided.

In our next step, BLASTN against Intron database was used to modify predictions from all ab initio programs. Results are shown in Table 2. With respect to all the ab initio programs there was an ~2% increase in Specificity with out affecting the Sensitivity with the exception of SWISS-PROT where a marginal increase was noted in Sensitivity but an increase in number of actual genes that were missed was noted.

In our next phase, we evaluated the performance when results from BLASTX against SWISS-PROT was integrated with ab initio programs. The reults from this evaluation are shown in Table 3. All combination showed a ~3-4% increase in Sensitivity or coverage. Concurrently, a general ~2% decrease in Specificity was noted for all programs. One interesting aspect noted was that the number of missed genes for all combinations decreased considerably. The maximum benefit was seen in combination Gene Intersection method where the number of missed genes decreased from 15 to 5 genes. This conclusively proves that combination of ab initio and similarity search against protein databases can improve prediction accuracy.

In our last step, we used the results from BLASTN against intron database to modify results from combination effected between ab initio programs and BLASTX against SWISS-PROT database. Results are shown in Table 4. The specificity of all combinations increase by ~4-5% compared to combination between just SWISS-PROT and ab initio programs. One drawback noted here is the decrease in exon level specificity by a significant margin. This is due to the fact that combinations are effected through a simple OR-based method. In this method preference is given to all ab initio predictions. If similarity searches give predictions other than that predicted by these ab initio programs, they are included for combinations. However, as seen in Table 1, similarity searches do not give correct boundaries of exons. Here exons boundaries are just the HSP (High Segment Pairs) boundaries.

The advantage of using this strategy lies in significant increase in number of genes that are predicted without decrease in probability of correct prediction (Specificity) at nucleotide level as seen in Table 4.