Software_and

GENE IDENTIFICATION UTILITY SITES

HMMgene
HMMgene is a program for prediction of genes in anonymous DNA. The
methods used are described in the paper:A. Krogh: Two methods for improving
performance of an HMM and their application for gene finding. In Proc. of Fifth
Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T.
et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186. The program predicts whole
genes, so the predicted exons always splice correctly. It can predict several whole
or partial genes in one sequence, so it can be used on whole cosmids or even longer
sequences. HMMgene can also be used to predict splice sites and start/stop codons.
If some features of a sequence are known, such as hits to ESTs, proteins, or repeat
elements, these regions can be locked as coding or non-coding and then the program
will find the best gene structure under these constraints. The program is based on
a hidden Markov model, which is a probabilistic model of the gene structure. This
means that all predictions have associated probabilities that reflect how confident
it is in the predictions. Apart from reporting the best prediction, HMMgene can also
report the N best gene predictions for a sequence. This is useful if the there are
several equally likely gene structures and may even indicate alternative splicing.
HMMgene takes an input file with one or more DNA sequences in FASTA format. It also
has a few options for changing the default behavior of the program. The output is a
prediction of partial or complete genes in the sequences. The output is in a stan-
-dardized format that is easily read by other programs, which specifies the location
of all the predicted genes and their coding regions and scores for whole genes as
well as exon scores.
FramePlot
FramePlot is a web-based tool for predicting protein-coding regions in
bacterial DNA with a high G+C content, such as Streptomyces. The graphical output
provides for easy distinction of protein-coding regions from non-coding regions.
The plot is a clickable map. Clicking on an ORF provides not only the nucleotide
sequence but also its deduced amino acid sequence. These sequences can then be
compared to the NCBI sequence database over the Internet. The program is freely
available for academic purposes at http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl.
tRNAScan

tRNA detection in genome sequences.tRNAscan-SE detects ~99% of
eukaryotic nuclear or prokaryotic tRNA genes, with a false positive rate of
less than one per 15 gigabases, and with a search speed of about 30 kb/
second. It was implemented for large-scale human genome sequence analysis,
but is applicable to other DNAs as well.
NetGene
Artificial neural networks have been combined with a rule based system
to predict intron splice sites in the dicot plant Arabidopsis thaliana. A two step
prediction scheme, where a global prediction of the coding potential regulates a
cutoff level for a local prediction of splice sites, is refined by rules based on
splice site confidence values, prediction scores, coding context, and distances
between potential splice sites. In this approach, the prediction of splice sites
mutually affect each other in a non-local manner. The combined approach drastically
reduces the large amount of false positive splice sites normally haunting splice
site prediction. An analysis of the errors made by the networks in the first step
of the method revealed a previously unknown feature, a frequent T-tract prolongation
containing cryptic acceptor sites in the 5' end of exons. The method presented here
has been compared to three other approaches, GeneFinder, GeneMark, and Grail.
Overall the method presented here is an order of magnitude better. We show that the
new method is able to find a donor site in the coding sequence for the jelly fish
Green Fluorescent Protein, exactly at the position that was experimentally observed
in thaliana transformants. Predictions for alternatively spliced genes are also
presented, together with examples of genes from other dicots, monocots, and algae.
The method has been made available through electronic mail ( NetPlantGene@cbs.dtu.dk
), or the WWW at http://www.cbs.dtu.dk/NetPlantGene.html
ORFFinder
The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool
which finds all open reading frames of a selectable minimum size in a user's sequ-
-ence or in a sequence already in the database. This tool identifies all open
reading frames using the standard or alternative genetic codes. The deduced amino
acid sequence can be saved in various formats and searched against the sequence
database using the WWW BLAST server. The ORF Finder should be helpful in preparing
complete and accurate sequence submissions. It is also packaged with the Sequin
sequence submission software.
BCM GeneFinder

Gene Finder is a web-based gene-prediction tool available freely
from the Baylor College of Medicine. BCM's tool allows one to interact
directly, provided the submissions are less than 7 Kb, or to send queries
in via e-mail.There are a number of algorithms available separately at BCM,
which collectively are called FGENEH and perform 'gene prediction' by exon
assembly. BCM does offer two stand-alone UNIX programs for use with human
and bacterial data: URL:->>http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html
Grail

GRAIL is a suite of tools designed to provide analysis and putative
annotation of DNA sequences both interactively and through the use of
automated computation. The capabilities of GRAIL are available by several
methods. These include an e-mail server at Oak Ridge National Laboratory
(ORNL), which processes DNA sequence(s) contained in e-mail messages, a
web-based utility, and an interactive graphical X-based client-server system
called Xgrail, which supports a wide range of analysis tools, including gene
modeling. There are several versions of Grail: GRAIL 1 uses a neural network
described in PNAS 88, 11261-11265, which recognizes coding potential within
a fixed size (100 base) window. It evaluates coding potential without looking
for additional features (information such as splice junctions, etc). GRAIL
1a is an updated version of GRAIL 1. It uses a fixed-length window to locate
the potential coding regions and then evaluates a number of discrete candi-
-dates of different lengths around each potential coding region, using
information from the two 60-base regions adjacent to that coding region, to
find the "best" boundaries for that coding region. GRAIL 2 uses variable
-length windows tailored to each potential exon candidate, defined as an
open reading frame bounded by a pair of start/donor, acceptor/donor or
acceptor/stop sites. This scheme facilitates the use of more genomic context
information (splice junctions, translation starts, non-coding scores of
60-base regions on either side of a putative exon) in the exon recognition
process. GRAIL 2 is therefore not appropriate for sequences without genomic
context (when the regions adjacent to an exon are not present). The follow-
-ing organisms are supported: Human, Mouse, Arabidopsis, Drosophila, and
E. coli. URL;->>>x-client->ftp://128.219.9.76/pub/xgrail/sun/ver1.2
web->http://grail.lsd.ornl.gov/Grail-1.3/
GeneMark

GenMark, which uses a Hidden Markov Model (HMM) approach, was
originally written for use on bacterial genomes, which of course have no
introns. After a number of published successes, the authors have modified
a version to work with eukaryotes, although it is primarily for long exons
and ESTs. Training sets either exist or are being developed for these
eukaryotes: Homo sapiens, Caenorhabditis elegans, Arabidopsis thaliana,
Drosophila melanogaster, and Chlamydomonas reinhardtii. GenMark is web-
based, with both client submission forms and e-mail servers, but the stand
-alone version can be obtained by academic and not-for-profit institutions
by signing a license.
URL:->>client version->http://genemark.biology.gatech.edu/GeneMark/index.html

stand-alone->lukashin@amber.biology.gatech.edu
Genie

Genie uses a statistical model of genes in DNA. A Generalized
Hidden Markov Model (GHMM) provides the framework for describing the
grammar of a legal parse of a DNA sequence. Probabilities are assigned to
transitions between states in the GHMM and to the generation of each
nucleotide base given a particular state. Machine learning techniques are
applied to optimize these probabilities using a standardized gene data set
URL:->http://www.fruitfly.org/seq_tools/genie.html
GENSCAN

Genscan is another client-server arrangement which offers complete
gene prediction, in that it uses a number of different algorithms to predict
introns, exons (leading, internal, and terminal), donor and acceptor splice
sites, and polyadenylation sites. The highest scoring arrangement of these
categories are then used to predict actual gene composition. Genscan supports
polygenic genomic DNA input, and scans both strands for coding regions.
Genscan has an additional feature that draws a GIF representation of the
resultant prediction, showing all putative exons in their respective positions
on either strand, and whether they are leading, internal or terminal, and a
simplified scoring scheme. Genscan was developed for use on vertebrate
sequences, but has been used with some success on maize, Arabidopsis and
Drosophila. Whether other training/background sets will be implemented is
unclear.I am in the process of contacting the author in an attempt to obtain
the stand-alone source code, but as it stands, it is purely web-based, and
can support submissions up to 200Kb. Larger sequences can be submitted
through their e-mail server.
URL:->>http://gnomic.stanford.edu/~chris/GENSCANW.html
FGENES

Pattern based Human Gene structure prediction (multiple genes, both
chains). Algorithm based on pattern recognition of different types of exons,
promoters and polyA signals and by dynamic programming finding the optimal comb-
-ination of them constructing a set of gene models along a given sequences.
FGENES-M

Pattern based Human Multiple variants of Gene structure prediction).
Algorithm outputs several suboptimal variants of predicted gene structure. In the
current WWW server variant up to 15 structures of gene of multiple genes is provided
It is similar with FGENES and based on pattern recognition of different types of
exons, promoters and polyA signals and by dynamic programming finding the optimal
combination of them constructing a set of gene models along a given sequences.
Splice

A neural network based program to find possible 5' and 3' splice
sites
Procrustes
Given a genomic sequence and a set of candidate exons, the spliced alignment
algorithm explores all possible exon assemblies and finds a chain of exons with the
best fit to a related target protein. The set of candidate exons is constructed by
selection of all blocks between candidate acceptor and donor sites (i.e. between AG
dinucleotide at intron-exon boundary and GU dinucleotide at exon-intron boundary)
and further filtration of this set. To avoid losing true exons, the filtration pro-
-cedure is made very gentle, and the resulting set of blocks may contain a large
number of false exons. Instead of trying to identify the correct exons by statisti-
-cal methods, PROCRUSTES considers all possible chains of candidate exons and finds
a chain with the maximum global similarity to the target protein (click here for an
example). The number of exon assemblies is huge; however, the spliced alignment
algorithm is fast enough to process large genomic fragments (up to 180,000 nucleot-
-ides) containing multi-exon genes (more than 30 exons). After the highest-scoring
exon assembly is found, the hope is that it represents the correct exon-intron
structure. This is almost guaranteed if a protein sufficiently similar to the one
encoded in the analyzed fragment is available (99% correlation between predicted
and actual genes with mammalian targets; click here for an example). Tests are
reported in Gelfand, Mironov and Pevzner, 1996. At the postprocessing step PROCRUSTES
assigns the guaranteed level of correlation between the predicted and the actual
protein. In the Las-Vegas version PROCRUSTES generates a set of suboptimal spliced
alignments and uses it to assess the confidence level for the complete predicted
gene or individual exons. In many cases there are no reasons to believe that the analyzed genomic fragment contains a complete gene. PROCRUSTES
has a local spliced alignment mode that should be used in such situations.
GenePrimer

This software implements an algorithm for experimental gene ident-
-ification by multiple PCR amplifications described in Sze S.-H., Roytberg
M.A., Gelfand M.S., Mironov A.A., Astakhova T.V. and Pevzner P.A. (1998)
Algorithms and software for support of gene idenification experiments.
Bioinformatics, 14, 14-19. Since current algorithms for gene recognition
make mistakes, biologists have to perform experimental gene identification
to eliminate errors in predictions. Conventional approaches amount to
`guessing' PCR primers on top of unreliable gene predictions and frequently
lead to wasting experimental efforts. An algorithm which eliminates the
need of gene verification in some cases is the Las Vegas algorithm for gene
recognition. The algorithm locates a set of PCR primers which relatively
uniformly cover the exons and can be used for RT-PCR and further sequencing
of (unknown) mRNA.
MIRAGE

MIRAGE (Molecular Informatics Resource for the Analysis of Gene
Expression) is a web site dedicated to methodologies, tools, and technolo-
-gies relating to information in the study of gene expression. MIRAGE is an
experimental web resource of the Institute for Transcriptional Informatics
(IFTI), Pittsburgh PA 15230-2556 USA.
TransTerm
TransTerm is a database of sequence contexts about the stop and start
codons of many species found in GenBank. TransTerm also contains codon usage data
for these same species and summary statistics for the sequences analysed.
PLACE
PLACE is a database of motifs found in plant cis-acting regulatory DNA
elements, all from previously published reports. It covers vascular plants only.
In addition to the motifs originally reported, their variations in other genes or
in other plant species reported later are also compiled. The PLACE database also
contains a brief description of each motif and relevant literature with PubMed ID
numbers. DDBJ/EMBL/GenBank nucleotide sequence databases accession numbers will be
also included.
NNPP
The function of the eukaryotic promoter as a initiator for transcription
is one of the most complex processes in molecular biology. It has been shown that
multiple functional sites in the primary DNA are involved in the polymerase binding
process. These elements, such as TATA-box, GC-box, CAAT-box and the transcription
start site, are known to function as binding sites for transcription factors and
other proteins, that are involved in the initiation process. These promoter elements
are present in various combinations separated by various distances in sequence. A
neural network is trained to recognize promoter elements until it reaches a local
minimum. Then the pruning procedure deletes those weights in the network that add
the lowest predictive value to the overall prediction. After pruning, the neural
network is retrained until it is stuck again in a minimum. This procedure is repeat-
-ed until a defined error level is reached. Eventually, the pruned neural network
gives clues about the importance of specific positions in the promoter element by
studying the remaining weights.
FastM/ModelInspector

A program for the generation of models for regulatory regions in
DNA sequences. FastM using the TRANSFAC 4.0 matrices.
MatInd and MatInspector

Search for potential transcription factor binding sites in your
own sequences with the matrix search program MatInspector using the
TRANSFAC 4.0 matrices.
UTRscan

Understanding the basic mechanisms of cell growth, differentiation
and response to environmental stimuli, i.e. the program controlling the
temporal and spatial order of molecular events, is becoming a real challenge
in molecular biology. Indeed, although most of the regulatory elements are
thought to be embedded in the non-coding part of the genomes, nucleotide
databases are biased by the presence of expressed sequences mostly corres-
-ponding to the protein coding portion of the genes. Among non-coding
regions, the 5' and 3' untranslated regions (5'-UTR and 3'-UTR) of euka-
-ryotic mRNAs have often been experimentally demonstrated to contain
sequence elements crucial for many aspects of gene regulation and expres-
-sion . The program UTRscan looks for UTR functional elements by searching
through user submitted sequence data for the patterns defined in the
UTRsite collection. UTRsite is a collection of functional sequence patterns
located in 5' or 3' UTR sequences.
Artemis
Artemis is a DNA sequence viewer and annotation tool that allows visualis-
-ation of sequence features and the results of analyses within the context of the
sequence, and its six-frame translation. Artemis is written in Java, reads EMBL or
GENBANK format sequences and feature tables, and can work on sequences of any size
from a few kb to entire genomes of 5 Mb or more. Given an EMBL accession number
Artemis also can read an entry directly from the EBI using CORBA.

------------------------------------------------------------------------------
##############################################################################
------------------------------------------------------------------------------
OTHER USEFUL SITES