HELP AND DOCUMENTATION
NAME
This field is necessary. A default name `query' is given in case client
does not wish to give his/her own name to the query.
Paste your sequence
The DNA sequence can be pasted into the text area. In case the Format of
the sequence is any of the standard ones (EMBL, FASTA, GENBANK, etc.) then
`INPUT-FORMAT' should be selected to `FORMATTED' or to `NON-FORMATTED' in
case the input sequence is just nucleotide sequence.
UPLOAD-FILE
File containing nucleotide sequence in any of the standard formats or a
non-formatted DNA sequence can be uploaded using this option. Users can-
-not upload a sequence file and paste the sequence simultaneously to be
analyzed together.
INPUT-FORMAT
The program recognizes any of the standard formats (EMBL, FASTA, GENBANK,
etc.). It uses the ReadSeq program developed by Dr. Don Gilbert, Biology
Dept., Indiana University, to read the input sequence and can accept most
commonly used standard sequence formats. Users should select `FORMATTED'
if the input sequence is in any of these standard formats or `NON-FORMAT-
-TED' if the input sequence is just the nucleotide sequence.
OUTPUT-FORMAT
The server gives the result in either of the two formats-
TABULAR or GRAPHICAL
In TABULAR OUTPUT it gives the values of the Fourier spectra or Power at
different frequencies in case of GENESCAN or m1, m2, m3 in case of ZCurve.
In GRAPHICAL OUTPUT it gives the plot of Power vs frequency in case of
GENESCAN while in case of ZCurve it is a plot of m1 vs m3. In case of FTG
the TABULAR OUTPUT is a table of Power at different frequencies while the
GRAPHICAL OUTPUT gives a spectrum of Power vs frequency.
FROM_&_TO
Users can select a particular region from the input sequence data to anal-
-yze through this property.
Certain tips--
If the user wants to do a complete sequence analysis but does not know
the last base number, he/she can leave the `TO' field empty and give `1'
in the `FROM' field.
If the user wants to do analysis on the region 230-590 for eg. he/she
should give `230' in the `FROM' field and `590' in the `TO' field.
Do not leave `FROM' field empty.
ALGORITHM
Basically there are three different algorithms and their three modificat-
-ions. The Algorithms are -GENESCAN, LENGTHEN-SHUFFLE and FTG while their
modifications are- GENESCAN-WINDOW, LENGTHEN-SHUFFLE-WINDOW and FTG-WINDOW.
The WINDOW options of each of these algorithms are given for convenience.
A long DNA sequence to be analyzed for multiple protein-coding regions can
be analyzed using these options. Once protein coding regions are identified
they can be confirmed for periodicity using the original versions of the
algorithms.
GENESCAN
This algorithm uses a Fourier technique based on a distinctive feature
of protein-coding regions, the 3-base periodicity. The signature of
this (also other) periodicity can be observed most directly through the
Fourier analysis.
A sequence of N nucleotides may be formally viewed as a symbol string,
{xj, j=1,2,.....,N}, where xj is one of the four symbols A, T, G and C,
and denotes the occurrence of that particular nucleotide in position j.
One can define a binary indicator function or projection operator Ua
which selects the elements of the sequence that are equal to the symbol
a, namely Ua(xj)=1 if xj is a and 0 otherwise. Using the operators UA,UT,
UG,UC, successively on a DNA sequence yields four binary sequences, as
illustrated below;
Sequence GGATACACTTTAGAG
Apply UA 001010100001010
Apply UT 000100001110000
Apply UG 110000000000101
Apply UC 000001010000000
Figure 1.
Thus, any DNA sequence can be converted to four binary sequences, which
can then be Fourier analyzed in the normal manner, to examine correlat-
ions between the symbols. The total Fourier spectrum of the DNA sequence
is the sum of these individual spectra, namely;
-----------(1)
where the discrete frequency f=k/N, with k=1,2,....N/2. Sa(f) is the par-
-tial spectrum corresponding to the symbol a=A, G, C, or T. The average
of the total spectrum, S^, can be calculated from the frequency of occur-
-rence, þa of each symbol (a=A, T, G, C) as;
-----------(2)
For protein-coding sequences from a variety of organisms, the Fourier sp-
-ectrum [equation(1)] reveals the characteristic periodicity of three as
a distinct peak at frequency f=1/3. No such `peak' above noise level is
apparent for non-protein coding sequences such as rRNA, intergenic spacers
and introns, which have a flat Fourier spectrum devoid of any periodicity.
In order to contrast signal-to-noise ratio of the peak at f=1/3, is given
as;
-----------(3)
P=4 is used as discriminator between coding and non-coding sequences.
For a detailed description of the algorithm please refer the original
paper (Tiwari et al., 1997).
The academic version of the program is available for distribution and
can be accessed at http://202.41.10.146/GS.html
LENGTHEN-SHUFFLE
Due to the limited length (usually 100bp or so) of the window used in gene
finding process, the application of the Fourier measure is without imp-
-ressive success. For a longer sequence, >1024bp, it is easier to detect
the periodicity by the FFT algorithm. This algorithm find a way to solve
this problem.
FORMAT OF Z CURVES:
Consider a DNA sequence with N bases read from the 5-end to the 3-end.
Begining from th first base, inspect the sequence one base at a time. Let
the number of steps be denoted by n, i.e. n=1,2,....N. In the nth step,
count the cumulative numbers of the bases A, C, G and T, respectively,
occurring in the subsequence from the first to the nth base in the DNA
sequence inspected. Denote the four positive integers by An, Cn, Gn, and
Tn, respectively. The Zcurve consists of a series of nodes Pn(n=1,2,....N)
whose coordinates are denoted by xn, yn, and zn. It was shown that
xn = 2(An + Gn) - n,
yn = 2(An + Cn) - n, n=0,1,2...........,N -----------(4)
zn = 2(An + Tn) - n,
where A0=C0=G0=T0=0 and thus x0=y0=z0=0. The connection of nodes P0(i.e.
the origin), P1, P2,...PN one by one by lines is defined as the Z curve
of the DNA sequence inspected. We then define;
§xn = xn - xn-1,
§yn = yn - yn-1, n=1,2,....N ------------(5)
§zn = zn - zn-1,
where §xn, §yn and §zn can only have the values of 1 or -1. §xn is equal
to 1 when the nth base is A or G(Purine), or -1 when the nth base is C or
T(Pyrimidine); §yn is equal to 1 when the nth base is A or C(amino-type),
or -1 when the nth base is G or T(keto-type); §zn is equal to 1 when the
nth base is A or T(weak hydrogen bond), or -1 when the nth base is G or C
(strong hydrogen bond). Therefore, DNA sequence can be decomposed into 3
series of digital signals, consisting of 1 or -1, each of which has clear
biological meaning. The first series of digital signals §xn represents
the distribution of the bases of the purine/pyrimidine along the DNA
sequences. The second series §yn represents the distribution of the bases
of the amino/keto types along the sequence. Similarly, the third series
§zn, represents the distribution of the bases of the strong/weak hydrogen
bonds along the sequence.
A LENTHEN-SHUFFLE FOURIER TRANSFORM:
The relatively short DNA sequence D(<150bp) is first lengthened by repeating
the sequence K times, where K=1200/D. Obviously a bogus periodicity of
D will be observed in the power spectrum of the FFT. To eliminate such a
bogus periodicity, and at the same time keep the periodicity of 3 uncha-
-nged, the lengthened sequence is then shuffled M times with three cons-
-ecutive bases as a unit. A typical value of M used here is 10,000.
As mentioned above, based on the the format of the Zcurve, any DNA sequ-
-ence can be transformed into three series of digital signals, §xn, §yn,
and §zn, to which to apply the FFT algorithm. The power spectrum for
each series is calculated as follows:
-------------(6)
where PC(f) is the power spectrum associated with §Cn which represents
§xn, §yn and §zn. Now three values are obtained, m1=Px(N/3), m2=Py(N/3) and
m3=Pz(N/3). The Fisher linear discriminant equation is used for making the
coding/non-coding decision.
For a detailed description of the algorithm please refer the original
paper (Yan et al., 1998).
The program is available on request from C.-T. Zhang ctzhang@tju.edu.cn
FTG
FTG algorithm combines the properties of both GENESCAN and LENGTHEN-SHUF-
-FLE algorithms to improve the accuracy of gene prediction. The drawback
of the GENESCAN algorithm is that the spectrum of any short DNA sequence
say <150bp, a window-size generally used, is not clear thereby the peri-
-odicity becomes incoherent. Though the LENGTHEN-SHUFFLE algorithm tries
to address this problem, it has the drawback of not indicating the type
of periodicity that a DNA sequence has. FTG tries to overcome these limi-
-tations by combining the essential parts of the two algorithms- GENESCAN,
and LENGTHEN-SHUFFLE so as to amplify the three-base periodicity of the
DNA sequence.
FTG takes short nucleotide sequences of length D (where D<=1200) and then
amplifies it by repeating it K times (where K=(1200/D)+1). The bogus per-
-iodicity of D is removed by shuffling the sequence M times (where M=10000)
Now the extended DNA sequence is considered as a symbol string, {xj, j=1,2
....,N}, where xj is one of the four symbols A, C, G and T, and denotes
the occurrence of that particular nucleotide in position j. Now defined a
binary indicator function or projection operator Ua which selects the
elements of the sequence that are equal to the symbol a, namely Ua(xj)=1
if xj is a and 0 otherwise. Using the operators UA, UT, UG and UC successively
on a DNA sequence yields four binary sequence, as illustrated in Figure 1.
The four binary sequences obtained can then be Fourier analyzed in the
normal manner. Fourier spectrum of the DNA sequence is calculated using
equation 1, while the average spectrum is calculated by using equation 2.
Peak at f=1/3 is obtained and signal-to-noise ratio of the peak at f=1/3
is obtained using equation 3.
GENESCAN-WINDOW, LENGTHEN-SHUFFLE-WINDOW and FTG-WINDOW
The user can analyze long DNA sequences using these WINDOW option of each
algorithm. User has to specify Step-size and Window length for these three
algorithm. The program takes overlapping Window separated by the Step-size
and analyzes each Window and outputs the values for that Window. The adva-
-tage that this option offers is to analyze the whole input sequence con-
-tinuously window-by-window, saving precious time.
Step-Size_&_Window
Consider a DNA sequence of 10000 bp length. Now I want to analyze the
sequence with overlapping windows of 150 bp which are overlapping each
other. Suppose I want to take window after every 5 bp. Then I would give
a Step-size of 5 and Window of 150 in the submission form.
1--ACGTGCTAGCTGATGCTAGTGC---100--CATCGACTAGCATCAGCTACAGCTACGATCAGCACTGATC----10000
|----------------------|
<-------------Window-length----1st Window--------->
|----------------------|
<-------------2nd Window-------------------------->
|--|
Step-size
|----------------------|
<------------------3rd Window--------------------->
It is useful to remember that results will be much more reliable for a
smaller Step-size.
TABULAR_OUTPUT
GENESCAN
The tabular output option for GENESCAN outputs Power for different
frequencies. The average of the Peaks for a spectrum is also output
along with Peak-to-noise ratio at f=1/3. Trivia such as nucleotide
composition and dinucleotide content are also computed by the program.
LENGTHEN-SHUFFLE
The LENGTHEN-SHUFFLE algorithm outputs m1, m2, m3 as table along with
the sequence composition analysis report.
FTG
The tabular output option for FTG outputs Peaks at different freque-
-ncies. The average of the Peaks for a spectrum is also output along
with Peak-to-noise ratio at f=1/3. Nucleotide composition and dinuc-
-leotide content are also computed by the program.
GENESCAN-WINDOW
For each window analyzed the GENESCAN-WINDOW option outputs the Peak
or Power at f=1/3. The start-point in the table refers to the position
of first base of the window. Normal sequence analysis of the DNA seq-
-uence is also given.
LENGTHEN-SHUFFLE-WINDOW
For each window of the DNA sequence m1, m2, and m3 are computed and
output by the program for this option. Position in this table refers
to the first base position of the window in the input DNA sequence.
FTG-WINDOW
The position in the table refers to the first base of the window and
the corresponding Power refers to the Peak at f=1/3 for that window.
Normal nucleotide compositional analysis is also given.
GRAPHICAL_OUTPUT
GENESCAN
A plot of Power[S(f)] vs frequency(f) is output with this option.
The periodicity of three can be visible as a peak at f=0.33(1/3).
Average peak of the spectrum and Peak at f=1/3 is given.
A periodicity of 10 can be visible as a peak at f=0.10(1/10). Similarly
other periodicity can be observed with this plot.
LENGTHEN-SHUFFLE
Plots of m3 vs m1, m3 vs m2, m2 vs m1, are output with this option.
FTG
This plot is quiet similar to the plot obtained from GENESCAN option
except that FTG option works best for short nucleotide sequences.
GENESCAN-WINDOW
The Window version of the GENESCAN algorithm gives a plot of Peak at
f=1/3 for each window vs sequence length. The red horizontal line is
the default threshold for the coding/non-coding decision. If the line
of the spectrum goes above this threshold then the region is considered
coding.
LENGTHEN-SHUFFLE-WINDOW
Three graphs of sequence length vs m1, m2, and m3, respectively are
given. In addition a plot of m3 vs m1 for different window is given.
FTG-WINDOW
This graph is quiet similar to that of GENESCAN-WINDOW graph except
it gives a plot of Power at f=1/3 for overlapping windows of size less
less than 1200 (ideally less than 150).