| Supplementary Information |
Burset and Guigo (1996) discussed that it may be beneficial to combine outputs from several programs. It is noted that different ab initio programs grasps different aspects of coding DNA to predict exons in genomic sequences. hough the coding level accuracy of recent programs is very high (Rogic et al., 2001) the gene level accuracy is very poor with less than 50 % of predicted genes corresponding exactly to actual genes. This fact gives advantage in combining predictions. One program that is usually worse predictor in a given case can predict successfully where other programs fail.
However, simple combinations are not effective as number of exons detected by all programs is considerably less. This results in increased specificity but decreased sensitivity or coverage. Taking exons predicted by at least one program results in increase of sensitivity but decreases specificity.
Based on this finding Murakami and Tagaki (1998) developed five combinatorial methods for gene prediction. These were the AND-method, OR-method, HIGHEST-method, RULE-method and BOUNDARY-method. Further improvement in prediction accuracy was achieved recently by Rogic et al. (2002). They developd three new strategy of combining ab initio predictions. These were the Exon Union-Intersection (EUI), Exon Union-Intersection with Reading Frame Consistency (EUI-frame), and Gene Intersection (GI) methods. Significant improvement in prediction accuracy was noted through the use of these methods.
Based on these findings, a requirement for further improvement is noted. However, any attempt to increase the average accuracy should not loose the advantage gained by previous efforts of EUI, EUI-frame and GI methods.Combinatorial approaches that use both homology search and pattern recognition were proposed by Snyder and Stormo (1995), Kulp et al. (1997), Xu and Uberbacher (1997) and Solovyev and Salamov (1997).
Similarity searches against protein databases derives all known protein coding regions from any genomic sequence. This method is especially helpful where homologs are present in database and ab initio predictors fail to predict these exons. However, the similarity searches are not without its own faults. Pseudogenes and certain non coding DNA can sometimes be predicted as exon by these programs. The advantage lies in correct identification of gene boundaries where known homologs are present in database.
We decided to use similarity search against Intron database to reduce number of false positives generated by other programs. This is a very tricky aspect, since introns are known to have no homology between themselves. It is known, however, that exons show little or no similarity to introns.
This led to the point that if we use similarity search against intron database, the sensitivity will not be affected significantly. This is because exons and introns will not be similar. But any known intronic sequences predicted as exons by different programs can be modified or removed, thereby improving the prediction accuracy.
Here we have tested our two-way strategy on a non-homologus dataset, HMR195 (Rogic et al., 2002).
Abbrv: Sen Sensitivity; Spec Specificity; AC Approximate Correlation; CC Correlation Coefficient; ESn Exon Sensitivity; ESp Exon Specificity; MG Missed Genes; GS Genscan; HMM HMMgene; EUI Exon-Union Intersection; EUI-frame Exon-Union Intersection with Reading Frame consistency; GI Gene Intersection;
Initially, all the programs were tested individually for their performance on HMR195 dataset. BLASTX program was used against Non-redundant and SWISS-PROT protein databases, while BLASTN was used against Expressed Sequence Tags (ESTs) database. Only the forward strand predictions were considered for performance evaluation. The Results are shown in Table 1.
The ab initio programs were better predictors at both nucleotide level and exons level accuracy. Among similarity search programs, BLASTX against SWISS-PROT was marginally less accurate than BLASTX against Non-Redundant (NR) protein database. At exon level however, the SWISS-PROT was a much better option that NR. BLASTN against EST was even less accurate than BLASTX against proteins at both nucleotide and exon level. A more striking aspect is the higher number of missed genes that result from similarity searches.
One conclusion that can be derived from this result is that ab initio predictors are able to predict the exon boundaries with very good accuracy than similarity searches.
Since BLASTX against SWISS-PROT database resulted in almost similar accuracy to BLASTX against NR and also because BLSTX against SWISS-PROT had better accuracy at Exon level than BLASTX against NR, we decided to use SWISS-PROT for futher evaluation. Since, BLASTN against EST had very low probability of correct predictions or specificity, further evaluation using EST was avoided.
In our next step, BLASTN against Intron database was used to modify predictions from all ab initio programs. Results are shown in Table 2. With respect to all the ab initio programs there was an ~2% increase in Specificity with out affecting the Sensitivity with the exception of SWISS-PROT where a marginal increase was noted in Sensitivity but an increase in number of actual genes that were missed was noted.
In our next phase, we evaluated the performance when results from BLASTX against SWISS-PROT was integrated with ab initio programs. The reults from this evaluation are shown in Table 3. All combination showed a ~3-4% increase in Sensitivity or coverage. Concurrently, a general ~2% decrease in Specificity was noted for all programs. One interesting aspect noted was that the number of missed genes for all combinations decreased considerably. The maximum benefit was seen in combination Gene Intersection method where the number of missed genes decreased from 15 to 5 genes. This conclusively proves that combination of ab initio and similarity search against protein databases can improve prediction accuracy.
In our last step, we used the results from BLASTN against intron database to modify results from combination effected between ab initio programs and BLASTX against SWISS-PROT database. Results are shown in Table 4. The specificity of all combinations increase by ~4-5% compared to combination between just SWISS-PROT and ab initio programs. One drawback noted here is the decrease in exon level specificity by a significant margin. This is due to the fact that combinations are effected through a simple OR-based method. In this method preference is given to all ab initio predictions. If similarity searches give predictions other than that predicted by these ab initio programs, they are included for combinations. However, as seen in Table 1, similarity searches do not give correct boundaries of exons. Here exons boundaries are just the HSP (High Segment Pairs) boundaries.
The advantage of using this strategy lies in significant increase in number of genes that are predicted without decrease in probability of correct prediction (Specificity) at nucleotide level as seen in Table 4.
Abbrv: Sen Sensitivity; Spec Specificity; AC Approximate Correlation; CC Correlation Coefficient; ESn Exon Sensitivity; ESp Exon Specificity; MG Missed Genes; GS Genscan; HMM HMMgene; EUI Exon-Union Intersection; EUI-frame Exon-Union Intersection with Reading Frame consistency; GI Gene Intersection;