Computaional Tools for Genome Annotation

The sequencing techniques are increasingly becoming more advanced. Hence the number of sequenced genomes is also increasing exponentially. One of the major challenges in contemporary science is to annotate the available sequence data. Annotation defines the coding regions in the genome as well as their physical location. It also provides the number and spatial distribution of repeat regions and the evolutionary information about the whole genomes.
Several computational tools have been developed to cut down time and expense involved in the experimental procedure of annotation. Computational resources at CRDD have been classified in following categories:

Servers integrated at CRDD

Server Description
FTG A web server for locating probable protein coding region in nucleotide sequence using fourier tranform approach (Issac, B., Singh, H., Kaur, H. and Raghava, G.P.S. (2002) Bioinformatics 18:196).
EGPred This server allows to predict gene (protein coding regions) in eukaryote genomes that includes introns and exons, using similarity aided (double) and consensus Ab Intion methods. (Issac B, Raghava GP. (2004) Genome Res. 14(9):1756-66)
FTGPred A web server for predicting genes in a DNAsequence.
GWBLAST A genome wide blast server. It allow user to search ther sequence against sequenced genomes and annonated proteomes. This integrate various tools which allows analysys of BLAST SEARCH.
SVMgene It is a support vector based approach to identify the protein coding regions in human genomic DNA.
SRFSpectral Repeat Finder (SRF) is a program to find repeats through an analysis of the power spectrum of a given DNA sequence. By repeat we mean the repeated occurrence of a segment of N nucleotides within a DNA sequence. SRF is an ab initio technique as no prior assumptions need to be made regarding either the repeat length, its fidelity, or whether the repeats are in tandem or not (Sharma D, Issac B, Raghava GP, Ramaswamy R. (2004) Bioinformatics. 20(9):1405-12)
GWFASTA Genome Wise Sequence Similarity Search using FASTA. It allow user to search their sequence against sequenced genomes and their product proteome. This integrate various tools which allows analysys of FASTA search (Issac, B. and Raghava, G.P.S. (2002) Biotechniques 33:548-56).
GeneBench A suite of datasets and tools for evaluating gene prediction methods.
MyPatternMyPattern Finder is a program for detection of a 'motif' in DNA sequence by using an exact search method (Option A (1.0)) or an alignment technique (Option B (1.0)).

Meta-servers, web-servers and mirroring of web-servers and databases

Name Can be Used For Algorithm References
GeneMarkArchaea, Metagenomes ,Eukaryotes,Viruses, Phages, Plasmids, EST and cDNAhidden Markov modelBesemer J. and Borodovsky M. Nucleic Acids Research, 2005, Vol. 33, Web Server Issue, pp. W451-454
GeneHackerMicrobial genomesMarkov modelYada.T , Hirosawa.M DNA Res., 3, 335-361 (1996). Syst. Mol. Biol. pp.252-260 (1996). Syst. Mol. Biol. pp.354-357 (1997).
GeneWalkerHumanHidden Markov model
HMMgene (v. 1.1) vertebrate and C. elegansHidden Markov modelA. Krogh: In Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T. et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186.
Chemgenome2.0ProkaryotesAb-initio MethodPoonam Singhal, B. Jayaram, Surjit B. Dixit and David L. Beveridge. Prokaryotic Gene Finding based on Physicochemical Characteristics of Codons Calculated from Molecular Dynamics Simulations.Biophysical Journal,2008,Volume:94 Issue:11, 4173-4183 ]
Softberry ServerBacteria ,Viruses and eukaryotesHMM and similarity based searchesSolovyev V.V.,Salamov A.A., Lawrence C.B. (Nucl.Acids Res.,1994,22,24,5156-5163).
Gene IDAnimal, Human, Plants fungus, ProtistsNeural NetworkBlanco et.al., Genome Research 6(4):511-515 (2000).
GenScanVertebrates, Arabidopsis, MaizeAb-inito MethodBurge and Karlin (1998) Curr. Opin. Struct. Biol. 8, 346-354.

Web Interface on Libraries

Standalone Software
Name Can be Used For Algorithm References
GenomeThreaderPlantsSimilarity-based gene prediction program where additional cDNA/EST and/or protein sequences are used to predict gene structures via spliced alignmentsGremme et al Information and Software Technology, 47(15):965-978, 2005
JIGSAW(formerly "Combiner")Eukaryotesmultiple sources of evidence (output from gene finders, splice site prediction programs and sequence alignments to predict gene models)Allen et al. Genome Biology 2007, 7(Suppl):S9.; Allen and Salzberg Bioinformatics 21(18): 3596-3603, 2005; Allen et al. Genome Research, 14(1), 2004.
GlimmerHMMEukaryotesGlimmerHMM is based on a Generalized Hidden Markov Model (GHMM). Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. It also utilizes Interpolated Markov Models for the coding and noncoding models . Currently, GlimmerHMM's GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single).Majoros et al. Bioinformatics 20 2878-2879, 2004
GenZillaEukaryotesGeneZilla is based on the Generalized Hidden Markov Model (GHMM). It evolved out of the ab initio eukaryotic gene finder TIGRscan, which was developed at The Institute for Genomic Research.GeneZilla (formerly "TIGRscan") is briefly described in: Majoros W, et al. (2004) Bioinformatics 20, 2878-2879 The novel decoding algorithm used by GeneZilla is described in: Majoros W. et al. (2005) BMC Bioinformatics 5:616.
Twinscan/N-SCAN (Ver 4.1.2)TWINSCAN extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among their patterns of evolutionary conservation. TWINSCAN: Gross and Brent. J Comput Biol. 2006 Mar;13(2):379-93. Korf I, N-SCAN: Flicek et al Bioinformatics. 2001;17 Suppl 1:S140-8.
Manateeprokaryotic and eukaryotic genomesManatee is a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. The Manatee interface allows biologists to quickly identify genes and make high quality functional assignments using a multitude of genome analyses tools. These tools consist of, but are not limited to GO classifications, BER and blast search data, paralogous families, and annotation suggestions generated from automated analysis.NA
EvoGeneNAalignment of multiple genomic sequencesPedersen and Hein. Bioinformatics (in press)
CRITICA(Coding Region Identification Tool Invoking Comparative Analysis)ProkaryoticCRITICA combines traditional approaches to the problem with a novel comparative analysis. If, in a nucleotide alignment, a pair of ORFs can be found in which the conceptual translated products are more conserved than would be expected from the amount of conservation at the nucleotide level, this is evolutionary evidence that the DNA sequences are protein coding. Regions found by this method are used to generate traditional dicodon frequencies for further analysis and give the prediction about a probable protein coding region.Badger and Olsen. Molecular Biology and Evolution, 16(4):512-524. 1999.
sgp2Sgp2 predict genes by comparing anonymous genomic sequences from two different species. Further it combines tblastx, a sequence similarity search program, with geneid, an "ab initio" gene prediction program.Parra et al. Genome Research 13(1):108-117 (2003)
PhatEukaryotes (Homo sapiens, Plasmodium falciparum, Plasmodium vivax)Phat is a HMM-based genefinder, originally developed for genefinding in Plasmodium falciparum. Unpublished
EuGeneEukaryotesEuGène exploit probabilistic models like Markov models for discriminating coding from non coding sequences or to discriminate effective splice sites from false splice sites (using various mathematical models).LNCS 2066, pp. 111-125, 2001
AUGUSTUSEukaryotic genomic sequencesIt allows to use protein homology information and travel in the prediction.Stanke and Waack (2003) Bioinformatics, Vol. 19, Suppl. 2, pages ii215-ii225

Databases

Name Description
GeneCardsA database of human genes, their products and their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol as well as selected others. It is especially useful for those who are searching for information working in functional genomics and proteomics. The data is collected with Knowledge Discovery and Data Mining's techniques and accessed by means of proprietary Guidance System that makes more or less intelligent suggestions to the user of where and how the information may be retrieved.
TRANSFACTRANSFAC is a transcription factor database. It compiles data about gene regulatory DNA sequences and protein factors binding to them. On this basis, programs are developed that help to identify putative promoter or enhancer structures and to suggest their features.
The EpoDB (Erythropoiesis Database)A database of genes that relate to vertebrate red blood cells. A detailed description of EpoDB can be found on Chapter 5. The database includes DNA sequence, structural features and potential transcription factor binding sites.
PlantProm DBA Database of plant promoter
RegulonDBRegulonDB provides curated information on gene organization and regulation in E. coli. Current information is provided on the gene, operon and regulon level. Future expansion will include information on regulation beyond transcription initiation.