In some fragments of the set of aligned sequences there exists a similarity between all or some layers (sequence fragments) of the alignment. These regions of similarity without gaps are called MOTIFS and characterized by the length (the number of letters in each fragment) and thickness (the number of layers in a motif).
The degree of similarity between motif layers is determined by probabilistic formulas and is characterized by the POWER of a motif It is based on a power of a pair motifs. Analogous reasoning allows one to introduce the alignment power. In various procedures of the algorithm either it or the traditional sum of mismatch weights serves as an alignment measure.
In the fist step of our algorithm for each pair of sequences all similar fragments are found by the DotHelix procedure. They form motifs of thickness 2. Similarities between sequences and motifs of thickness 2 form motifs of thickness 3. Further, motifs of thickness 4 are obtained from motifs of thickness 2 and 3 by linking them with suitable sequence fragments or motifs.
The alignment result can be represented as a chain of consistent (with regards to the order of letters in sequences) motifs interrupted by the necessary number of gaps. If an alignment cannot be extended by addition of new motifs at either end, it is called TOTAL. If an alignment can be extended at at least one end, it is called LOCAL alignment.
Often close neighboring motifs would extend one another if some gaps are introduced. Such region (called SUPERMOTIF) can have a greater similarity between fragments forming it, than individual motifs. Supermotifs and motifs are instances of local alignments. Note that in course of search for functionally important regions the knowledge of motifs and supermotifs seem to be of a greater importance for a biologist than an alignment itself. Thus the procedure for motif and supermotif determination has an independent value.
The algorithm proceeds in four stages:
The last stage can be implemented in two modes. The first method is an automated input of motifs and supermotifs ordered by their quality (greedy algorithm). The second method is a variant of the dynamic programming, that employs a similarity measure on pairwise motifs. In this method the quality of a total alignment is determined by the sum of mismatch weights accounting for gap penalties.
L S S ( W{b(r1,l), b(r2,l)} - M ) l 1<r1<r2<K A = ------------------------------------------- (1) sqrt (DL)where S is sign for sum, A is the power of the considered motif, L is its length, i.e. the common length of fragments forming it, K is the motif thickness, that is the number of layers in it, b(r,l) are the letters (residues) forming the r-th layer, W{A,C} is the weight of substitution of a letter A to a letter C, characterizing the similarity of the corresponding residues, M and D are user-defined constants. Increase of the power corresponds to increase of similarity between motif layers.
For random independent Bernoulli sequences, M should be mean substitution weight {M(W)}, D - the variance of sum of weights situated in one column of a motif. Then the power A can be considered as a random variable whose mean equals 0 and variance equals 1. When such choice of the constants m and D is made, the power A is the normalized sum over all motif positions of weights for pairwise substitutions of letters situated in one column. The probabilistic meaning of A is following:
ln{P(A>A0} is equivalent to -(A0 /2)**2, (we use sign "**" to designate raising to power).
Biological sequences, especially similar ones, cannot be assumed to be independent. Thus in some cases it is useful to set M to values larger than mean of substitution weights. It allows not to mark as similar substantially non-similar fragments.
The notion of power as defined by formula (1) is a base for other definitions: corrected motif power, conditional power, alignment power. The latter values are the ones directly used in the program.
The formula for A_cor is based on the following informal reasoning. Assume for simplicity that all sequences have a similar length n. The number of possible relative shifts of sequences) when motifs of thickness k are considered has the order n**(k-1), and it is approximately n**(k-1) times larger than that for motifs of thickness 2, for which no correction is performed. Thus the correction should decrease the corresponding probability P(A) at n**(k-2) times. n order to do that it is necessary to decrease a half of the squared power by (k-2)*ln(n). Thus we obtain the following formula for the corrected power:
A_cor = sqrt{A**2 - 2*(k-2)*ln(n)}.
One of the reasons for introducing the gap penalty is the correction similar to that in the corrected motif power. The problem is that the number of possible alignments sharply increases as the allowed number of gaps increases. Thus for random independent Bernoulli sequences the maximum of the sum of weights along the alignment path increases, and thus the power of alignments with many gaps is too large. In order to compensate for this effect, the gap penalty is introduced. Unfortunately, unlike the situation of the corrected motif power, it is difficult to estimate theoretically the gap penalty, since sums of weights along different path are strongly dependent.
Thresholds for power and lengths of motifs are set by a user.
Sets of motifs can be constructed employing different
matrices.
The described procedure is rather slow, since it requires to process a very large number of relative sequence shifts. In order to accelerate this process, it is possible to consider only the most promising shifts, which are processed further. Of course, this pre-processing can lead to the loss of some motifs. Such sorting of registers is performed by a Lipman-Pearson type procedure (Lipman and Pearson, 1985).
Then the sum of these values for all window positions for a given
shift is computed. To select a given shift the ratio this sum to
maxsum should be more then user-defined value (parameter "Coincidence
ratio").
A user-defined number of shifts is subject for further
processing.
Generally speaking, in order to find all possible motifs of different
lengths, all comparison registers should be analyzed, and for each of
them the set of non-intersecting sufficiently powerful motifs should
be found, as it is done in DotHelix procedure. However, this procedure
requires a prohibitively large time even for a rather small number of
sequences k, since the number of registers is proportional to n**(k-1),
where n is the average sequence length. In order to avoid the full
register search, we employ a procedure for motif determination, as stated
in Section 1. This procedure is based on the fact that in a sufficiently
powerful motif all or almost all pairs of layers are significally similar,
so a multiple motif should be the union of pair motifs.
Consider this algorithm in more detail. It is sufficient to describe the procedure of linking of two motifs (or a motif and a sequence fragment). If the motifs have no common layers, the DotHelix procedure is applied (each motif is considered as a unit). Linking of a fragment to a motif is performed in a similar manner. If motifs have common layers, then (if it is possible), the natural correspondence between layers is established, and then DotHelix is applied to thus defined comparison register.
In all three cases prior to processing by DotHelix motifs are slightly extended in order to allow more precise identification of arising motif boundaries.
The result of this procedure is a set of motifs of various length. Each motif is characterized by the corrected power A_cor computed when the motif was constructed. Motifs with the power exceeding some thresholds are retained.
Incomplete thickness of a motif means that only some sequences possess a good similarity in this region.
Intermediate storage of motifs is based on the stack principle, so
that only some best (with regards to the corrected power) motifs are
retained. When a new motif is constructed, the stack is reordered and
the worst of the stored motifs is deleted. It allows to save memory
and computation time.
The accelerating procedure described in Section 3 can be applied to
the construction of multiple motifs as well. It can be further
accelerated by considering color matches only in conserved
positions of motif (i.e. positions colored almost entirely in one
color). The percentage of one colored letters in a conservative position
is user-definded parameter "Min. letter frequence".
In the greedy method first the most powerful motif is considered, then it is linked with such motif, that the power of the constructed supermotif exceeds powers of both motifs belonging to it. Then the obtained supermotif is linked to the third motif (if it is possible) etc. This procedure is performed while the supermotif power increases.
In the cluster method the following scheme is employed. Initially each motif is considered as a supermotif. At each step two such supermotifs are linked, that the power of the obtained supermotif exceeds powers of its constituents, and it is the maximum one among all possible linkages.
It should be noted, that not all supermotifs can be constructed by these procedures. Construction of algorithms that would allow us to construct all existing supermotifs is an object of further research.
This algorithm can be generalized in the following way (the derived procedure would not be a greedy one). First one of t best (super)motifs is taken, where t is defined by a user. Then each obtained partial alignment (currently consisting of a single supermotif) is linked by a new supermotif. This is done by one of $t$ ways so that the power of the alignment if possible increases. The procedure continues while there are supermotifs that can be linked to an alignment. If t=1, this is the above described greedy procedure, while if t equals to the number of motifs in the stack, this procedure reduces to the full search. As t increases, the computation time also increases, but alignment quality improves.
We represent an alignment as a chain of motifs. These motifs can belong to an alignment not only completely, but partially as well. It is taken into account in our program, but below we for simplicity assume that motifs enter an alignment entirely. Alignment quality is measured by its "price" that is defined as a sum of centered weights W(al,bl) of mismatches along the alignment path minus gap penalties. For each motif it is possible to point out motifs that can precede or follow it in an alignment. Thus the set of motifs is a partially ordered set. If a partial alignment has been constructed, further aligning depends only on the most recently aligned fragment. Thus the dynamic programming principles, and in particular, a Needleman-Wunsch type procedure can be employed. For each motif we find the maximum price of alignments starting at the beginning of the sequences and ending at this motif. We do not consider all transitions from a current motif to subsequent ones: among motifs having close registers only transition to the nearest one is allowed.
If many sequences are aligned, a complete alignment also can be represented as a chain of pairwise motifs (since a multiple motif can be decomposed into pairwise ones). However, unlike the previous case, alignment depends not only on the last aligned motif, but on some previous motifs. Thus in this case the dynamic programming principle cannot be applied.
In this connection we introduce the following notion. Each partial alignment is set in correspondence with its FRONT defined as the set of the rightmost in each sequence pairwise motifs belonging to the alignment, and their relative positions in the alignment. Thus in order to define a front, it is necessary, first, to set for each sequence such motif, that one of its layers belongs to this sequence, and, second, to define the position of this motif in the alignment relative to the initial position of the leftmost motif in this front. The initial position of any frontal motif in a sequence, to which it is ascribed, should be situated rightwards relative to beginnings in this sequence of all other frontal motifs. One motif can be ascribed to two sequences (recall that all motifs are pairwise ones).
It is simple to demonstrate that further construction depends only on the front of the already aligned part. Thus a dynamic programming procedure is applicable.
It is clear, that as the number of motifs increases, the number of fronts increases at a very fast rate (roughly as (N/t)**t , where N is the number of motifs, t is the number of sequences). Thus it seems to be necessary to bound the number of fronts by a reasonable stack size so that alignment of natural sequences would be possible.
A set of source sequences at FASTA format look like following:
>[seq_name1] First sequence at one-letter code including spaces, possibly in several lines >[seq_name2] Second sequence at one-letter code including spaces, possibly in several lines ............The sign ">" should stand in the first position of a
sequence_name
line.
Example:
>s1 GAATCCAG GGAATTCC GACA >s2 AAATAAACCCGGGGGAA GG AAAAA >S3 GGGGGGGGGGGGGGGGGTTTTTTTTTTTAAA