GENE IDENTIFICATION UTILITY SITES


HMMgene 
           HMMgene is a program for prediction of genes in anonymous DNA. The 
methods used are described in the paper:A. Krogh: Two methods for improving 
performance of an HMM and their application for gene finding. In Proc. of Fifth 
Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T. 
et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186. The program predicts whole 
genes, so the predicted exons always splice correctly. It can predict several whole
or partial genes in one sequence, so it can be used on whole cosmids or even longer
sequences. HMMgene can also be used to predict splice sites and start/stop codons. 
If some features of a sequence are known, such as hits to ESTs, proteins, or repeat
elements, these regions can be locked as coding or non-coding and then the program 
will find the best gene structure under these constraints. The program is based on
a hidden Markov model, which is a probabilistic model of the gene structure. This 
means that all predictions have associated probabilities that reflect how confident
it is in the predictions. Apart from reporting the best prediction, HMMgene can also
report the N best gene predictions for a sequence. This is useful if the there are 
several equally likely gene structures and may even indicate alternative splicing. 
HMMgene takes an input file with one or more DNA sequences in FASTA format. It also
has a few options for changing the default behavior of the program. The output is a
prediction of partial or complete genes in the sequences. The output is in a stan-
-dardized format that is easily read by other programs, which specifies the location
of all the predicted genes and their coding regions and scores for whole genes as
well as exon scores. 
FramePlot 
           FramePlot is a web-based tool for predicting protein-coding regions in 
bacterial DNA with a high G+C content, such as Streptomyces. The graphical output 
provides for easy distinction of protein-coding regions from non-coding regions. 
The plot is a clickable map. Clicking on an ORF provides not only the nucleotide 
sequence but also its deduced amino acid sequence. These sequences can then be 
compared to the NCBI sequence database over the Internet. The program is freely 
available for academic purposes at http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl. 
tRNAScan 
          tRNA detection in genome sequences.tRNAscan-SE detects ~99% of
eukaryotic nuclear or prokaryotic tRNA genes, with a false positive rate of less than one per 15 gigabases, and with a search speed of about 30 kb/ second. It was implemented for large-scale human genome sequence analysis, but is applicable to other DNAs as well. NetGene            Artificial neural networks have been combined with a rule based system to predict intron splice sites in the dicot plant Arabidopsis thaliana. A two step prediction scheme, where a global prediction of the coding potential regulates a cutoff level for a local prediction of splice sites, is refined by rules based on splice site confidence values, prediction scores, coding context, and distances between potential splice sites. In this approach, the prediction of splice sites mutually affect each other in a non-local manner. The combined approach drastically reduces the large amount of false positive splice sites normally haunting splice site prediction. An analysis of the errors made by the networks in the first step of the method revealed a previously unknown feature, a frequent T-tract prolongation containing cryptic acceptor sites in the 5' end of exons. The method presented here has been compared to three other approaches, GeneFinder, GeneMark, and Grail. Overall the method presented here is an order of magnitude better. We show that the new method is able to find a donor site in the coding sequence for the jelly fish Green Fluorescent Protein, exactly at the position that was experimentally observed in thaliana transformants. Predictions for alternatively spliced genes are also presented, together with examples of genes from other dicots, monocots, and algae. The method has been made available through electronic mail ( NetPlantGene@cbs.dtu.dk ), or the WWW at http://www.cbs.dtu.dk/NetPlantGene.html ORFFinder            The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds all open reading frames of a selectable minimum size in a user's sequ- -ence or in a sequence already in the database. This tool identifies all open reading frames using the standard or alternative genetic codes. The deduced amino acid sequence can be saved in various formats and searched against the sequence database using the WWW BLAST server. The ORF Finder should be helpful in preparing complete and accurate sequence submissions. It is also packaged with the Sequin sequence submission software. BCM GeneFinder
          Gene Finder is a web-based gene-prediction tool available freely
from the Baylor College of Medicine. BCM's tool allows one to interact directly, provided the submissions are less than 7 Kb, or to send queries in via e-mail.There are a number of algorithms available separately at BCM, which collectively are called FGENEH and perform 'gene prediction' by exon assembly. BCM does offer two stand-alone UNIX programs for use with human and bacterial data: URL:->>http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html Grail
          GRAIL is a suite of tools designed to provide analysis and putative
annotation of DNA sequences both interactively and through the use of automated computation. The capabilities of GRAIL are available by several methods. These include an e-mail server at Oak Ridge National Laboratory (ORNL), which processes DNA sequence(s) contained in e-mail messages, a web-based utility, and an interactive graphical X-based client-server system called Xgrail, which supports a wide range of analysis tools, including gene modeling. There are several versions of Grail: GRAIL 1 uses a neural network described in PNAS 88, 11261-11265, which recognizes coding potential within a fixed size (100 base) window. It evaluates coding potential without looking for additional features (information such as splice junctions, etc). GRAIL 1a is an updated version of GRAIL 1. It uses a fixed-length window to locate the potential coding regions and then evaluates a number of discrete candi- -dates of different lengths around each potential coding region, using information from the two 60-base regions adjacent to that coding region, to find the "best" boundaries for that coding region. GRAIL 2 uses variable -length windows tailored to each potential exon candidate, defined as an open reading frame bounded by a pair of start/donor, acceptor/donor or acceptor/stop sites. This scheme facilitates the use of more genomic context information (splice junctions, translation starts, non-coding scores of 60-base regions on either side of a putative exon) in the exon recognition process. GRAIL 2 is therefore not appropriate for sequences without genomic context (when the regions adjacent to an exon are not present). The follow- -ing organisms are supported: Human, Mouse, Arabidopsis, Drosophila, and E. coli. URL;->>>x-client->ftp://128.219.9.76/pub/xgrail/sun/ver1.2 web->http://grail.lsd.ornl.gov/Grail-1.3/ GeneMark
          GenMark, which uses a Hidden Markov Model (HMM) approach, was
originally written for use on bacterial genomes, which of course have no introns. After a number of published successes, the authors have modified a version to work with eukaryotes, although it is primarily for long exons and ESTs. Training sets either exist or are being developed for these eukaryotes: Homo sapiens, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster, and Chlamydomonas reinhardtii. GenMark is web- based, with both client submission forms and e-mail servers, but the stand -alone version can be obtained by academic and not-for-profit institutions by signing a license. URL:->>client version->http://genemark.biology.gatech.edu/GeneMark/index.html
stand-alone->lukashin@amber.biology.gatech.edu
Genie
         Genie uses a statistical model of genes in DNA. A Generalized
Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence. Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized gene data set URL:->http://www.fruitfly.org/seq_tools/genie.html GENSCAN
         Genscan is another client-server arrangement which offers complete
gene prediction, in that it uses a number of different algorithms to predict introns, exons (leading, internal, and terminal), donor and acceptor splice sites, and polyadenylation sites. The highest scoring arrangement of these categories are then used to predict actual gene composition. Genscan supports polygenic genomic DNA input, and scans both strands for coding regions. Genscan has an additional feature that draws a GIF representation of the resultant prediction, showing all putative exons in their respective positions on either strand, and whether they are leading, internal or terminal, and a simplified scoring scheme. Genscan was developed for use on vertebrate sequences, but has been used with some success on maize, Arabidopsis and Drosophila. Whether other training/background sets will be implemented is unclear.I am in the process of contacting the author in an attempt to obtain the stand-alone source code, but as it stands, it is purely web-based, and can support submissions up to 200Kb. Larger sequences can be submitted through their e-mail server. URL:->>http://gnomic.stanford.edu/~chris/GENSCANW.html FGENES
        Pattern based Human Gene structure prediction (multiple genes, both
chains). Algorithm based on pattern recognition of different types of exons, promoters and polyA signals and by dynamic programming finding the optimal comb- -ination of them constructing a set of gene models along a given sequences. FGENES-M
        Pattern based Human Multiple variants of Gene structure prediction).
Algorithm outputs several suboptimal variants of predicted gene structure. In the current WWW server variant up to 15 structures of gene of multiple genes is provided It is similar with FGENES and based on pattern recognition of different types of exons, promoters and polyA signals and by dynamic programming finding the optimal combination of them constructing a set of gene models along a given sequences. Splice
        A neural network based program to find possible 5' and 3' splice
sites Procrustes          Given a genomic sequence and a set of candidate exons, the spliced alignment algorithm explores all possible exon assemblies and finds a chain of exons with the best fit to a related target protein. The set of candidate exons is constructed by selection of all blocks between candidate acceptor and donor sites (i.e. between AG dinucleotide at intron-exon boundary and GU dinucleotide at exon-intron boundary) and further filtration of this set. To avoid losing true exons, the filtration pro- -cedure is made very gentle, and the resulting set of blocks may contain a large number of false exons. Instead of trying to identify the correct exons by statisti- -cal methods, PROCRUSTES considers all possible chains of candidate exons and finds a chain with the maximum global similarity to the target protein (click here for an example). The number of exon assemblies is huge; however, the spliced alignment algorithm is fast enough to process large genomic fragments (up to 180,000 nucleot- -ides) containing multi-exon genes (more than 30 exons). After the highest-scoring exon assembly is found, the hope is that it represents the correct exon-intron structure. This is almost guaranteed if a protein sufficiently similar to the one encoded in the analyzed fragment is available (99% correlation between predicted and actual genes with mammalian targets; click here for an example). Tests are reported in Gelfand, Mironov and Pevzner, 1996. At the postprocessing step PROCRUSTES assigns the guaranteed level of correlation between the predicted and the actual protein. In the Las-Vegas version PROCRUSTES generates a set of suboptimal spliced alignments and uses it to assess the confidence level for the complete predicted gene or individual exons. In many cases there are no reasons to believe that the analyzed genomic fragment contains a complete gene. PROCRUSTES has a local spliced alignment mode that should be used in such situations. GenePrimer
        This software implements an algorithm for experimental gene ident-
-ification by multiple PCR amplifications described in Sze S.-H., Roytberg M.A., Gelfand M.S., Mironov A.A., Astakhova T.V. and Pevzner P.A. (1998) Algorithms and software for support of gene idenification experiments. Bioinformatics, 14, 14-19. Since current algorithms for gene recognition make mistakes, biologists have to perform experimental gene identification to eliminate errors in predictions. Conventional approaches amount to `guessing' PCR primers on top of unreliable gene predictions and frequently lead to wasting experimental efforts. An algorithm which eliminates the need of gene verification in some cases is the Las Vegas algorithm for gene recognition. The algorithm locates a set of PCR primers which relatively uniformly cover the exons and can be used for RT-PCR and further sequencing of (unknown) mRNA. MIRAGE
         MIRAGE (Molecular Informatics Resource for the Analysis of Gene
Expression) is a web site dedicated to methodologies, tools, and technolo- -gies relating to information in the study of gene expression. MIRAGE is an experimental web resource of the Institute for Transcriptional Informatics (IFTI), Pittsburgh PA 15230-2556 USA. TransTerm           TransTerm is a database of sequence contexts about the stop and start codons of many species found in GenBank. TransTerm also contains codon usage data for these same species and summary statistics for the sequences analysed. PLACE           PLACE is a database of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports. It covers vascular plants only. In addition to the motifs originally reported, their variations in other genes or in other plant species reported later are also compiled. The PLACE database also contains a brief description of each motif and relevant literature with PubMed ID numbers. DDBJ/EMBL/GenBank nucleotide sequence databases accession numbers will be also included. NNPP           The function of the eukaryotic promoter as a initiator for transcription is one of the most complex processes in molecular biology. It has been shown that multiple functional sites in the primary DNA are involved in the polymerase binding process. These elements, such as TATA-box, GC-box, CAAT-box and the transcription start site, are known to function as binding sites for transcription factors and other proteins, that are involved in the initiation process. These promoter elements are present in various combinations separated by various distances in sequence. A neural network is trained to recognize promoter elements until it reaches a local minimum. Then the pruning procedure deletes those weights in the network that add the lowest predictive value to the overall prediction. After pruning, the neural network is retrained until it is stuck again in a minimum. This procedure is repeat- -ed until a defined error level is reached. Eventually, the pruned neural network gives clues about the importance of specific positions in the promoter element by studying the remaining weights. FastM/ModelInspector
         A program for the generation of models for regulatory regions in
DNA sequences. FastM using the TRANSFAC 4.0 matrices. MatInd and MatInspector
         Search for potential transcription factor binding sites in your
own sequences with the matrix search program MatInspector using the TRANSFAC 4.0 matrices. UTRscan
         Understanding the basic mechanisms of cell growth, differentiation
and response to environmental stimuli, i.e. the program controlling the temporal and spatial order of molecular events, is becoming a real challenge in molecular biology. Indeed, although most of the regulatory elements are thought to be embedded in the non-coding part of the genomes, nucleotide databases are biased by the presence of expressed sequences mostly corres- -ponding to the protein coding portion of the genes. Among non-coding regions, the 5' and 3' untranslated regions (5'-UTR and 3'-UTR) of euka- -ryotic mRNAs have often been experimentally demonstrated to contain sequence elements crucial for many aspects of gene regulation and expres- -sion . The program UTRscan looks for UTR functional elements by searching through user submitted sequence data for the patterns defined in the UTRsite collection. UTRsite is a collection of functional sequence patterns located in 5' or 3' UTR sequences. Artemis          Artemis is a DNA sequence viewer and annotation tool that allows visualis- -ation of sequence features and the results of analyses within the context of the sequence, and its six-frame translation. Artemis is written in Java, reads EMBL or GENBANK format sequences and feature tables, and can work on sequences of any size from a few kb to entire genomes of 5 Mb or more. Given an EMBL accession number Artemis also can read an entry directly from the EBI using CORBA.
          ------------------------------------------------------------------------------ ############################################################################## ------------------------------------------------------------------------------ OTHER USEFUL SITES