| Supplementary Information: Human Chromosome 13 |
Predictions are available only in tabular format
Introduction:
In order to demonstrate the capability of EGPred we analyzed the partial human chromosome 13 that has been recently sequenced (Dunham et al., 2004). Human chromosome 13 is the largest acrocentric human chromosome and is estimated to contain 633 genes with a total of 4266 exons excluding those from the pseudogenes (Dunham et al., 2004). An initial analysis has been performed as described in Methods. A total of 96175021 bp was analyzed in a region from 17918001 to 114093021 bp to predict genes.
The genes were predicted using two different combinations implemented in EGPred for Genscan and HMMgene methods. Application of these two strategies on human chromosome 13 produced four sets of putative genes (two for each strand) available above and which is summarized in Table below. A total of 2125 multi-exon genes, 406 single exon genes and 2065 partial genes are predicted by the Genscan-based EGPred strategy. HMMgene-based EGPred strategy produced 4000 multi-exon genes, 220 single-exon genes and 2705 partial genes. Surprisingly, more than 70% of exons predicted by the EGPred are not reported in the annotation. Since EGPred uses similarity to protein sequences, a large fraction of predicted genes are likely to be protein coding. However, results suggest that all predictions from similarity-based approach are also predicted by ab initio approach. A considerable proportion of genes are estimated to be absent from the current databases therefore the predicted genes may also have potentially novel protein-coding genes. A direct computation of sensitivity and specificity of the program based on available public domain annotation for human chromosome 13 is impossible for two main reasons. First, is the overlapping transcripts for different genes (see public domain annotation file above), and secondly due to the fact that most publicl domain annotations are manually curated at a final stage based on available EST, cDNA or protein information. Since almost half of the genes and their products are not yet identified, such a curation will inadverently result in incomplete data. While EGPred is demonstrated to be reliable, the success of the program is critically dependent on the accuracy of underlying programs and continued improvements in gene prediction algorithms should improve future EGPred results.
Predictions were made using similarity-based approach against RefSeq protein database and Intron database in combination with two different ab initio predictors--Genscan and HMMgene. The figure in column headers indicates the number of that column. The figure in brackets in column 2-4 denotes percentage of total predicted genes and figure in brackets in column 8 denotes percent of total analyzed nucleotide sequence of human chromosome 13. Column 9-11 represents the number of exons that are predicted by only ab initio approach, similarity-based approach and by both the approaches, respectively. The last column shows the number of predicted exons from each category that are found to match to that provided by public domain. These matches include the exact, partial and overlapping exon matches.


