Computaional Tools for Genome Annotation

The sequencing techniques are increasingly becoming more advanced. Hence the number of sequenced genomes is also increasing exponentially. One of the major challenges in contemporary science is to annotate the available sequence data. Annotation defines the coding regions in the genome as well as their physical location. It also provides the number and spatial distribution of repeat regions and the evolutionary information about the whole genomes.
Several computational tools have been developed to cut down time and expense involved in the experimental procedure of annotation. Computational resources at CRDD have been classified in following categories:

Servers integrated at CRDD

Server	Description
FTG	A web server for locating probable protein coding region in nucleotide sequence using fourier tranform approach (Issac, B., Singh, H., Kaur, H. and Raghava, G.P.S. (2002) Bioinformatics 18:196).
EGPred	This server allows to predict gene (protein coding regions) in eukaryote genomes that includes introns and exons, using similarity aided (double) and consensus Ab Intion methods. (Issac B, Raghava GP. (2004) Genome Res. 14(9):1756-66)
FTGPred	A web server for predicting genes in a DNAsequence.
GWBLAST	A genome wide blast server. It allow user to search ther sequence against sequenced genomes and annonated proteomes. This integrate various tools which allows analysys of BLAST SEARCH.
SVMgene	It is a support vector based approach to identify the protein coding regions in human genomic DNA.
SRF	Spectral Repeat Finder (SRF) is a program to find repeats through an analysis of the power spectrum of a given DNA sequence. By repeat we mean the repeated occurrence of a segment of N nucleotides within a DNA sequence. SRF is an ab initio technique as no prior assumptions need to be made regarding either the repeat length, its fidelity, or whether the repeats are in tandem or not (Sharma D, Issac B, Raghava GP, Ramaswamy R. (2004) Bioinformatics. 20(9):1405-12)
GWFASTA	Genome Wise Sequence Similarity Search using FASTA. It allow user to search their sequence against sequenced genomes and their product proteome. This integrate various tools which allows analysys of FASTA search (Issac, B. and Raghava, G.P.S. (2002) Biotechniques 33:548-56).
GeneBench	A suite of datasets and tools for evaluating gene prediction methods.
MyPattern	MyPattern Finder is a program for detection of a 'motif' in DNA sequence by using an exact search method (Option A (1.0)) or an alignment technique (Option B (1.0)).

Meta-servers, web-servers and mirroring of web-servers and databases

Name	Can be Used For	Algorithm	References
GeneMark	Archaea, Metagenomes ,Eukaryotes,Viruses, Phages, Plasmids, EST and cDNA	hidden Markov model	Besemer J. and Borodovsky M. Nucleic Acids Research, 2005, Vol. 33, Web Server Issue, pp. W451-454
GeneHacker	Microbial genomes	Markov model	Yada.T , Hirosawa.M DNA Res., 3, 335-361 (1996). Syst. Mol. Biol. pp.252-260 (1996). Syst. Mol. Biol. pp.354-357 (1997).
GeneWalker	Human	Hidden Markov model
HMMgene (v. 1.1)	vertebrate and C. elegans	Hidden Markov model	A. Krogh: In Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T. et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186.
Chemgenome2.0	Prokaryotes	Ab-initio Method	Poonam Singhal, B. Jayaram, Surjit B. Dixit and David L. Beveridge. Prokaryotic Gene Finding based on Physicochemical Characteristics of Codons Calculated from Molecular Dynamics Simulations.Biophysical Journal,2008,Volume:94 Issue:11, 4173-4183 ]
Softberry Server	Bacteria ,Viruses and eukaryotes	HMM and similarity based searches	Solovyev V.V.,Salamov A.A., Lawrence C.B. (Nucl.Acids Res.,1994,22,24,5156-5163).
Gene ID	Animal, Human, Plants fungus, Protists	Neural Network	Blanco et.al., Genome Research 6(4):511-515 (2000).
GenScan	Vertebrates, Arabidopsis, Maize	Ab-inito Method	Burge and Karlin (1998) Curr. Opin. Struct. Biol. 8, 346-354.

Web Interface on Libraries

Standalone Software

Name	Can be Used For	Algorithm	References
GenomeThreader	Plants	Similarity-based gene prediction program where additional cDNA/EST and/or protein sequences are used to predict gene structures via spliced alignments	Gremme et al Information and Software Technology, 47(15):965-978, 2005
JIGSAW(formerly "Combiner")	Eukaryotes	multiple sources of evidence (output from gene finders, splice site prediction programs and sequence alignments to predict gene models)	Allen et al. Genome Biology 2007, 7(Suppl):S9.; Allen and Salzberg Bioinformatics 21(18): 3596-3603, 2005; Allen et al. Genome Research, 14(1), 2004.
GlimmerHMM	Eukaryotes	GlimmerHMM is based on a Generalized Hidden Markov Model (GHMM). Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. It also utilizes Interpolated Markov Models for the coding and noncoding models . Currently, GlimmerHMM's GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single).	Majoros et al. Bioinformatics 20 2878-2879, 2004
GenZilla	Eukaryotes	GeneZilla is based on the Generalized Hidden Markov Model (GHMM). It evolved out of the ab initio eukaryotic gene finder TIGRscan, which was developed at The Institute for Genomic Research.	GeneZilla (formerly "TIGRscan") is briefly described in: Majoros W, et al. (2004) Bioinformatics 20, 2878-2879 The novel decoding algorithm used by GeneZilla is described in: Majoros W. et al. (2005) BMC Bioinformatics 5:616.
Twinscan/N-SCAN (Ver 4.1.2)		TWINSCAN extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among their patterns of evolutionary conservation.	TWINSCAN: Gross and Brent. J Comput Biol. 2006 Mar;13(2):379-93. Korf I, N-SCAN: Flicek et al Bioinformatics. 2001;17 Suppl 1:S140-8.
Manatee	prokaryotic and eukaryotic genomes	Manatee is a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. The Manatee interface allows biologists to quickly identify genes and make high quality functional assignments using a multitude of genome analyses tools. These tools consist of, but are not limited to GO classifications, BER and blast search data, paralogous families, and annotation suggestions generated from automated analysis.	NA
EvoGene	NA	alignment of multiple genomic sequences	Pedersen and Hein. Bioinformatics (in press)
CRITICA(Coding Region Identification Tool Invoking Comparative Analysis)	Prokaryotic	CRITICA combines traditional approaches to the problem with a novel comparative analysis. If, in a nucleotide alignment, a pair of ORFs can be found in which the conceptual translated products are more conserved than would be expected from the amount of conservation at the nucleotide level, this is evolutionary evidence that the DNA sequences are protein coding. Regions found by this method are used to generate traditional dicodon frequencies for further analysis and give the prediction about a probable protein coding region.	Badger and Olsen. Molecular Biology and Evolution, 16(4):512-524. 1999.
sgp2		Sgp2 predict genes by comparing anonymous genomic sequences from two different species. Further it combines tblastx, a sequence similarity search program, with geneid, an "ab initio" gene prediction program.	Parra et al. Genome Research 13(1):108-117 (2003)
Phat	Eukaryotes (Homo sapiens, Plasmodium falciparum, Plasmodium vivax)	Phat is a HMM-based genefinder, originally developed for genefinding in Plasmodium falciparum.	Unpublished
EuGene	Eukaryotes	EuGène exploit probabilistic models like Markov models for discriminating coding from non coding sequences or to discriminate effective splice sites from false splice sites (using various mathematical models).	LNCS 2066, pp. 111-125, 2001
AUGUSTUS	Eukaryotic genomic sequences	It allows to use protein homology information and travel in the prediction.	Stanke and Waack (2003) Bioinformatics, Vol. 19, Suppl. 2, pages ii215-ii225

Databases

Name	Description
GeneCards	A database of human genes, their products and their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol as well as selected others. It is especially useful for those who are searching for information working in functional genomics and proteomics. The data is collected with Knowledge Discovery and Data Mining's techniques and accessed by means of proprietary Guidance System that makes more or less intelligent suggestions to the user of where and how the information may be retrieved.
TRANSFAC	TRANSFAC is a transcription factor database. It compiles data about gene regulatory DNA sequences and protein factors binding to them. On this basis, programs are developed that help to identify putative promoter or enhancer structures and to suggest their features.
The EpoDB (Erythropoiesis Database)	A database of genes that relate to vertebrate red blood cells. A detailed description of EpoDB can be found on Chapter 5. The database includes DNA sequence, structural features and potential transcription factor binding sites.
PlantProm DB	A Database of plant promoter
RegulonDB	RegulonDB provides curated information on gene organization and regulation in E. coli. Current information is provided on the gene, operon and regulon level. Future expansion will include information on regulation beyond transcription initiation.