Spectral Repeat Finder

Consider a DNA sequence length n,

α₁ α₂ α₃...... α_n

where α_i is the base (A, T, G or C) at position i. The structure of symbolic correlations within the sequence is most simply explored through correlation functions of the type

C_αβ(r)=<U_α(i)U_β(i+r)>;

α, β ε {A,T,G,C}........................................(1)

where U_α(i) = 1 if α_i, the symbol at position i, is α, and 0 otherwise. U_α(i) is termed the 'indicator function' or 'projection operator' (Voss, 1992, Tiwari et al., 1997) and <..> denoted an average over the string (note that C_αβ(r) is not normalized). In the past several years, the self-correlation,

C(r) = Σ_αC_αα(r).......................................(2)

has been most commonly studied. If there is a nontrivial N-base correlation such as is caused by a repetitive unit, the correlation function has a near recurrence at this period, namely, C(r) ˜ C(N+r) (Herzel et al., 1999). It is computationally simpler to consider the Fourier Transform of C(r), which is an entirely equivalent method of examining correlations. One commonly used definition for the power spectrum (Silverman and Linsker, 1986, Li et al., 1994) is

S(f) = Σ_α1/n²|Σⁿ_j=1U_α(j)e^2πifj|².....................(3)

A peak at frequency f = 1/N in the spectrum indicates that the correlation function has N-base periodicity; see Li (1997) for a detailed discussion of the Fourier spectrum an its connection to the correlation function. It should be pointed out that while a N-base correlation does not necessarily imply the presence of a repeat unit of length N, the converse is true: the existence of repeats of length N will give rise to a peak at frequency f = 1/N. (In this context, recall that the 3-base periodicity in coding sequences gives a sharp peak in the power spectrum at frequency 1/3 (Fickett, 182, Tiwari et al., 1997) though, of course, coding regions are not composed of 3-base repeats).

Our study of the power spectrum of a number of sequences suggests a simple method to discover hidden periodicities. This applies in both instances, when the repeatsare tandem or dispersed, as well as when they are imperfect or 'noisy'. For a sequence string of length n, periodicities of the order of n/2 can in principle show up in the power spectrum which is computed at frequencies f = k/2n, k = 1,2,.....n. We first identify those freuqencies f_i such that

S(f_i)/Š > T.....................................................(4)

where the spectral average,
Š = 1/n {1 + 1/n - Σ_αρ²_α}...........................(5)

(ρ²_α is the frequency of nucleotide α in the sequence).

The significance of any spectral line should be assessed with respect to the spectral average, namely through its signal-to-noise (S/N) ratio. Peaks with S/N greater than 2, typically begin to be significant, but from a series of studies, we find that since random fluctuations can frequently drive a spectral line to this level, a more prudent criterion would require a higher threshold (T); this threshold is similar to the discriminator used in the gene finding method (Tiwari et al., 1997), and is quite insensitive to window and/or sequence length. We find from a number of simulations that a threshold of S/N =4 is useful (Tiwari et al., 1997), Ramachandran and Ramaswamy, 1999, aggarwal and Ramaswamy, 2002) and this makes it possible to locate repeats, which comprise as little as 2% of the total sequence.

Having found candidate repeat lengths N_i = 1/f_i, the signal-to-noise ratio of the spectral peak at frequency f_i is computed in a sliding window along the sequnece. As for coding regions (Tiwari et al., 1997) it is simple to identify the regions containing the repeats as those where the S/N exceeds the threshold. Since the length of the repeat (1/f_i) and the region containing the repeats are both completely specified, the actual repeats can be easily identified by testing possible N_i-mers in the region for the occurrence of multiple copies. This can be done by exact enumeration or even by a heuristic local alignment method.

We summarize the SRF algorithm in the following steps:

Step 1: Input a DNA sequence of length n.

Step 2: Compute the power spectrum, S( f ), and the spectral average, Š, of the entire sequence.

Step 3: Identify all peaks with S(f_i)/ Š > T (the threshold, here chosen to be 4). For each frequency f_i so identified, there are potential repeats of length N_i = 1/f_i.

Step 4: For each of these, compute the P_m( j ) = S(f_i)/ Š in a sliding window of length m centered on position j in the sequence. Regions containing the repeat of length N_i can be identified directly as those where P_m( j ) exceeds the threshold.

Step 5: Since both the repeat length, N_i, and its location are known, an exact method (step 6) is used to identify the repeat units.

Step 6: Consider all N_i-mers in the repeat region and identify those occurring most frequently by local alignment. This automatically makes it possible to allow for insertions and deletions to any desired level.