SPECTRAL REPEAT FINDER - ALGORITHM V 1.0

Consider a DNA sequence length n,

α1 α2 α3...... αn

where αi is the base (A, T, G or C) at position i. The structure of symbolic correlations within the sequence is most simply explored through correlation functions of the type

Cαβ(r)=<Uα(i)Uβ(i+r)>;

α, β ε {A,T,G,C}........................................(1)

where Uα(i) = 1 if αi, the symbol at position i, is α, and 0 otherwise. Uα(i) is termed the 'indicator function' or 'projection operator' (Voss, 1992, Tiwari et al., 1997) and <..> denoted an average over the string (note that Cαβ(r) is not normalized). In the past several years, the self-correlation,

C(r) = ΣαCαα(r).......................................(2)

has been most commonly studied. If there is a nontrivial N-base correlation such as is caused by a repetitive unit, the correlation function has a near recurrence at this period, namely, C(r) ˜ C(N+r) (Herzel et al., 1999). It is computationally simpler to consider the Fourier Transform of C(r), which is an entirely equivalent method of examining correlations. One commonly used definition for the power spectrum (Silverman and Linsker, 1986, Li et al., 1994) is

S(f) = Σα1/n2|Σnj=1Uα(j)e2πifj|2.....................(3)

A peak at frequency f = 1/N in the spectrum indicates that the correlation function has N-base periodicity; see Li (1997) for a detailed discussion of the Fourier spectrum an its connection to the correlation function. It should be pointed out that while a N-base correlation does not necessarily imply the presence of a repeat unit of length N, the converse is true: the existence of repeats of length N will give rise to a peak at frequency f = 1/N. (In this context, recall that the 3-base periodicity in coding sequences gives a sharp peak in the power spectrum at frequency 1/3 (Fickett, 182, Tiwari et al., 1997) though, of course, coding regions are not composed of 3-base repeats).

Our study of the power spectrum of a number of sequences suggests a simple method to discover hidden periodicities. This applies in both instances, when the repeatsare tandem or dispersed, as well as when they are imperfect or 'noisy'. For a sequence string of length n, periodicities of the order of n/2 can in principle show up in the power spectrum which is computed at frequencies f = k/2n, k = 1,2,.....n. We first identify those freuqencies fi such that

S(fi)/Š > T.....................................................(4)

where the spectral average,
Š = 1/n {1 + 1/n - Σαρ2α}...........................(5)

2α is the frequency of nucleotide α in the sequence).

The significance of any spectral line should be assessed with respect to the spectral average, namely through its signal-to-noise (S/N) ratio. Peaks with S/N greater than 2, typically begin to be significant, but from a series of studies, we find that since random fluctuations can frequently drive a spectral line to this level, a more prudent criterion would require a higher threshold (T); this threshold is similar to the discriminator used in the gene finding method (Tiwari et al., 1997), and is quite insensitive to window and/or sequence length. We find from a number of simulations that a threshold of S/N =4 is useful (Tiwari et al., 1997), Ramachandran and Ramaswamy, 1999, aggarwal and Ramaswamy, 2002) and this makes it possible to locate repeats, which comprise as little as 2% of the total sequence.

Having found candidate repeat lengths Ni = 1/fi, the signal-to-noise ratio of the spectral peak at frequency fi is computed in a sliding window along the sequnece. As for coding regions (Tiwari et al., 1997) it is simple to identify the regions containing the repeats as those where the S/N exceeds the threshold. Since the length of the repeat (1/fi) and the region containing the repeats are both completely specified, the actual repeats can be easily identified by testing possible Ni-mers in the region for the occurrence of multiple copies. This can be done by exact enumeration or even by a heuristic local alignment method.

We summarize the SRF algorithm in the following steps:

Step 1: Input a DNA sequence of length n.

Step 2: Compute the power spectrum, S( f ), and the spectral average, Š, of the entire sequence.

Step 3: Identify all peaks with S(fi)/ Š > T (the threshold, here chosen to be 4). For each frequency fi so identified, there are potential repeats of length Ni = 1/fi.

Step 4: For each of these, compute the Pm( j ) = S(fi)/ Š in a sliding window of length m centered on position j in the sequence. Regions containing the repeat of length Ni can be identified directly as those where Pm( j ) exceeds the threshold.

Step 5: Since both the repeat length, Ni, and its location are known, an exact method (step 6) is used to identify the repeat units.

Step 6: Consider all Ni-mers in the repeat region and identify those occurring most frequently by local alignment. This automatically makes it possible to allow for insertions and deletions to any desired level.