MULTIPLE ALIGNMENT (algorithm)

The multiple alignment of several biopolymer sequences means a set of sequences of similar lengths, each of them is obtained by insertion of gaps into the corresponding data sequence: these gaps correspond either to deletions of letters in the given sequence, or to insertions of letters in other sequences.

In some fragments of the set of aligned sequences there exists a similarity between all or some layers (sequence fragments) of the alignment. These regions of similarity without gaps are called MOTIFS and characterized by the length (the number of letters in each fragment) and thickness (the number of layers in a motif).

The degree of similarity between motif layers is determined by probabilistic formulas and is characterized by the POWER of a motif It is based on a power of a pair motifs. Analogous reasoning allows one to introduce the alignment power. In various procedures of the algorithm either it or the traditional sum of mismatch weights serves as an alignment measure.

In the fist step of our algorithm for each pair of sequences all similar fragments are found by the DotHelix procedure. They form motifs of thickness 2. Similarities between sequences and motifs of thickness 2 form motifs of thickness 3. Further, motifs of thickness 4 are obtained from motifs of thickness 2 and 3 by linking them with suitable sequence fragments or motifs.

The alignment result can be represented as a chain of consistent (with regards to the order of letters in sequences) motifs interrupted by the necessary number of gaps. If an alignment cannot be extended by addition of new motifs at either end, it is called TOTAL. If an alignment can be extended at at least one end, it is called LOCAL alignment.

Often close neighboring motifs would extend one another if some gaps are introduced. Such region (called SUPERMOTIF) can have a greater similarity between fragments forming it, than individual motifs. Supermotifs and motifs are instances of local alignments. Note that in course of search for functionally important regions the knowledge of motifs and supermotifs seem to be of a greater importance for a biologist than an alignment itself. Thus the procedure for motif and supermotif determination has an independent value.

The algorithm proceeds in four stages:

  1. construction of pairwise motifs;
  2. construction of multiple motifs (of thickness exceeding 2);
  3. forming of supermotifs from these motifs;
  4. construction of multiple alignments from previously obtained motifs and supermotifs and choice of the best alignment.

The last stage can be implemented in two modes. The first method is an automated input of motifs and supermotifs ordered by their quality (greedy algorithm). The second method is a variant of the dynamic programming, that employs a similarity measure on pairwise motifs. In this method the quality of a total alignment is determined by the sum of mismatch weights accounting for gap penalties.

POWER

As it has been said above, we define the quality of various local similarities based on probabilistic considerations. The most fundamental is the notion of the MOTIF POWER. It is defined by the following formula:

       L
       S S         ( W{b(r1,l), b(r2,l)} - M )
       l 1<r1<r2<K
A = -------------------------------------------    (1)
               sqrt (DL)

where S is sign for sum, A is the power of the considered motif, L is its length, i.e. the common length of fragments forming it, K is the motif thickness, that is the number of layers in it, b(r,l) are the letters (residues) forming the r-th layer, W{A,C} is the weight of substitution of a letter A to a letter C, characterizing the similarity of the corresponding residues, M and D are user-defined constants. Increase of the power corresponds to increase of similarity between motif layers.

For random independent Bernoulli sequences, M should be mean substitution weight {M(W)}, D - the variance of sum of weights situated in one column of a motif. Then the power A can be considered as a random variable whose mean equals 0 and variance equals 1. When such choice of the constants m and D is made, the power A is the normalized sum over all motif positions of weights for pairwise substitutions of letters situated in one column. The probabilistic meaning of A is following:

ln{P(A>A0} is equivalent to -(A0 /2)**2, (we use sign "**" to designate raising to power).

Biological sequences, especially similar ones, cannot be assumed to be independent. Thus in some cases it is useful to set M to values larger than mean of substitution weights. It allows not to mark as similar substantially non-similar fragments.

The weights W{A,C} form a weight matrix defined by a user. Its choice can vary. In the case of protein sequences one of the most widely used is the Dayhoff matrix in which weights are set based on substitution frequencies in already known homologous proteins. In other cases weights correspond to physical-chemical properties of amino acids. Simultaneous use of several matrices in course of the alignment construction also is possible, in this case the algorithm links together independently obtained motifs.

The notion of power as defined by formula (1) is a base for other definitions: corrected motif power, conditional power, alignment power. The latter values are the ones directly used in the program.

Corrected power

The power A characterizes the probability to obtain the given sum of weights for an individual motif taken from Bernoulli sequences. However, as the thickness K increases, the number of possible motifs becomes larger for a fixed length. Thus for Bernoulli sequences it is simpler to encounter powerful motifs among thick ones than among thin ones. So thick motifs would prevail in the list of the most powerful motifs and their importance would be overstated. In order to avoid that, we perform a correction substituting the power A to the CORRECTED POWER (A_cor).

The formula for A_cor is based on the following informal reasoning. Assume for simplicity that all sequences have a similar length n. The number of possible relative shifts of sequences) when motifs of thickness k are considered has the order n**(k-1), and it is approximately n**(k-1) times larger than that for motifs of thickness 2, for which no correction is performed. Thus the correction should decrease the corresponding probability P(A) at n**(k-2) times. n order to do that it is necessary to decrease a half of the squared power by (k-2)*ln(n). Thus we obtain the following formula for the corrected power:

A_cor = sqrt{A**2 - 2*(k-2)*ln(n)}.

Alignment power

We characterize similarity of sequence fragments participating in an alignment (partial or total) by its POWER. It is defined analogously to the motif power A (formula 1). When weights are summed along the alignment path, the weight of a pair of corresponding residues b(l,r1) and b(l,r2) is taken from the matrix (as in (1)) if this pair occurs in a motif, and equals the mean weight M(W) otherwise. If one of the symbols b(l,r1) and b(l,r2) is a gap, then the weight equals the user-defined gap penalty which in any case does not exceed M(W). In the formula for the alignment power L is set to the alignment length.

One of the reasons for introducing the gap penalty is the correction similar to that in the corrected motif power. The problem is that the number of possible alignments sharply increases as the allowed number of gaps increases. Thus for random independent Bernoulli sequences the maximum of the sum of weights along the alignment path increases, and thus the power of alignments with many gaps is too large. In order to compensate for this effect, the gap penalty is introduced. Unfortunately, unlike the situation of the corrected motif power, it is difficult to estimate theoretically the gap penalty, since sums of weights along different path are strongly dependent.

CONSTRUCTION OF PAIRWAISE MOTIFS

As it was stated, the first step of our procedure is construction of pairwise motifs. This is performed by the DOTHELIX procedure (Leontovich et al, 1993).

Thresholds for power and lengths of motifs are set by a user. Sets of motifs can be constructed employing different matrices.

When the standard constant M=M(W) is chosen, a large number of the obtained motifs would be noise, despite their formal statistical significance. This fact, obstacling further alignment construction by increase of the search, is caused by the null hypothesis of independence of the sequence being aligned (indeed, the very desire to align sequences is an evidence of their dependence!). Clearly, for dependent (similar) sequences the mean mismatch weight increases as compared to formula (2). The program allows a user to account for this fact by increasing M (by setting parameter "Min. homology ratio" in interval from 0 to 1: 0.01 is recommended). Experiments demonstrate that this procedure allows one to filter out a majority of noise motifs.

Thus it is possible to find motifs for various M and different weight matrices. A user can lump together all obtained motifs, and it increases alignment possibilities.

The described procedure is rather slow, since it requires to process a very large number of relative sequence shifts. In order to accelerate this process, it is possible to consider only the most promising shifts, which are processed further. Of course, this pre-processing can lead to the loss of some motifs. Such sorting of registers is performed by a Lipman-Pearson type procedure (Lipman and Pearson, 1985).

Consider it in more detail. First the letters (residues) are divided into several classes ("colors"). This coloring is consistent with the weight matrix: substitution weights within a class should be sufficiently large. For each shift of sequences the following procedure is performed: for each position of a window (of length, say, 7) the value t(t-1)/2 is computed, where t is the number of color matches within this window.

Then the sum of these values for all window positions for a given shift is computed. To select a given shift the ratio this sum to maxsum should be more then user-defined value (parameter "Coincidence ratio").

The described characteristic is preferable as compared to the more widely used number of perfect matches or number of color matches (that is, the sum of t), since it is sensitive to determination of shifts setting in correspondence locally similar regions. On the other hand, our procedure is sufficiently fast, since the above characteristic can be computed using preformed looking tables of all color pairs (the distance between which does not exceed the window length) for all sequences.

A user-defined number of shifts is subject for further processing.

CONSTRUCTION OF MULTIPLE MOTIFS

Generally speaking, in order to find all possible motifs of different lengths, all comparison registers should be analyzed, and for each of them the set of non-intersecting sufficiently powerful motifs should be found, as it is done in DotHelix procedure. However, this procedure requires a prohibitively large time even for a rather small number of sequences k, since the number of registers is proportional to n**(k-1), where n is the average sequence length. In order to avoid the full register search, we employ a procedure for motif determination, as stated in Section 1. This procedure is based on the fact that in a sufficiently powerful motif all or almost all pairs of layers are significally similar, so a multiple motif should be the union of pair motifs.

Consider this algorithm in more detail. It is sufficient to describe the procedure of linking of two motifs (or a motif and a sequence fragment). If the motifs have no common layers, the DotHelix procedure is applied (each motif is considered as a unit). Linking of a fragment to a motif is performed in a similar manner. If motifs have common layers, then (if it is possible), the natural correspondence between layers is established, and then DotHelix is applied to thus defined comparison register.

In all three cases prior to processing by DotHelix motifs are slightly extended in order to allow more precise identification of arising motif boundaries.

The result of this procedure is a set of motifs of various length. Each motif is characterized by the corrected power A_cor computed when the motif was constructed. Motifs with the power exceeding some thresholds are retained.

Incomplete thickness of a motif means that only some sequences possess a good similarity in this region.

Intermediate storage of motifs is based on the stack principle, so that only some best (with regards to the corrected power) motifs are retained. When a new motif is constructed, the stack is reordered and the worst of the stored motifs is deleted. It allows to save memory and computation time.

Speaking metaphorically, the suggested method first selects potential similarity zones based on pairwise comparisons, and then some of them become more visible due to their similarity with fragments of other sequences, while other are not supported by additional similarities and disappear from the stack.

The accelerating procedure described in Section 3 can be applied to the construction of multiple motifs as well. It can be further accelerated by considering color matches only in conserved positions of motif (i.e. positions colored almost entirely in one color). The percentage of one colored letters in a conservative position is user-definded parameter "Min. letter frequence".

CONSTRUCTION OF ALIGNMENTS

(supermotifs and total alignments)
As it was noted above, each alignment (partial or total) is a consistent chain of motifs. Each aligned sequence consists of fragments similar to fragments of other sequences (and thus belonging to motifs), fragments not belonging to any motif, and gaps. Since it is hardly possible that biological sequences can possess "negative" similarity (one with a negative power), i.e. unprobable in a random sequence dissimilarity, it is reasonable to assume that the order of letters and gaps in motif-free fragments is unimportant. In our program in such regions first letters are written and then the necessary number of gaps is added. The only exclusion from this rule is the beginning of the sequence, which is indented to the right.

Construction of supermotifs

At the first step of construction of the total alignment supermotifs are formed. Supermotif is defined as a partial alignment whose power exceeds powers of all its subalignments. In particular, its power exceeds powers of all motifs belonging to it. In the program two methods for linking of motifs into supermotifs are implemented: greedy (direct) and cluster.

In the greedy method first the most powerful motif is considered, then it is linked with such motif, that the power of the constructed supermotif exceeds powers of both motifs belonging to it. Then the obtained supermotif is linked to the third motif (if it is possible) etc. This procedure is performed while the supermotif power increases.

In the cluster method the following scheme is employed. Initially each motif is considered as a supermotif. At each step two such supermotifs are linked, that the power of the obtained supermotif exceeds powers of its constituents, and it is the maximum one among all possible linkages.

It should be noted, that not all supermotifs can be constructed by these procedures. Construction of algorithms that would allow us to construct all existing supermotifs is an object of further research.

Construction of total alignment

Consider now construction of a total alignment. As it has been mentioned in Section 1, three methods are currently implemented in the program: greedy and dynamic programming.

Greedy procedure

Construction of total alignments by the greedy procedure is similar to the greedy construction of supermotifs, but it is not required that the power of obtained partial alignments necessarily increase. The procedure can be performed in several steps. First a total alignment of the most powerful motifs and supermotifs is constructed. Then it is supplemented by a new set of motifs and the alignment is modified without disturbing already aligned fragments. These new motifs can be obtained by either softening requirements to the power, or with the use of some other matrix, or allowing a lower homology threshold, or by combining some of the above parameters. Thus we obtain the second alignment, add to it more motifs, etc. Usually at the first step the most popular Dayhoff matrix is used.

This algorithm can be generalized in the following way (the derived procedure would not be a greedy one). First one of t best (super)motifs is taken, where t is defined by a user. Then each obtained partial alignment (currently consisting of a single supermotif) is linked by a new supermotif. This is done by one of $t$ ways so that the power of the alignment if possible increases. The procedure continues while there are supermotifs that can be linked to an alignment. If t=1, this is the above described greedy procedure, while if t equals to the number of motifs in the stack, this procedure reduces to the full search. As t increases, the computation time also increases, but alignment quality improves.

Dynamic programming

A dynamic programming procedure have been implemented for construction of a total alignment. First we describe this method in the case of two sequences.

We represent an alignment as a chain of motifs. These motifs can belong to an alignment not only completely, but partially as well. It is taken into account in our program, but below we for simplicity assume that motifs enter an alignment entirely. Alignment quality is measured by its "price" that is defined as a sum of centered weights W(al,bl) of mismatches along the alignment path minus gap penalties. For each motif it is possible to point out motifs that can precede or follow it in an alignment. Thus the set of motifs is a partially ordered set. If a partial alignment has been constructed, further aligning depends only on the most recently aligned fragment. Thus the dynamic programming principles, and in particular, a Needleman-Wunsch type procedure can be employed. For each motif we find the maximum price of alignments starting at the beginning of the sequences and ending at this motif. We do not consider all transitions from a current motif to subsequent ones: among motifs having close registers only transition to the nearest one is allowed.

If many sequences are aligned, a complete alignment also can be represented as a chain of pairwise motifs (since a multiple motif can be decomposed into pairwise ones). However, unlike the previous case, alignment depends not only on the last aligned motif, but on some previous motifs. Thus in this case the dynamic programming principle cannot be applied.

In this connection we introduce the following notion. Each partial alignment is set in correspondence with its FRONT defined as the set of the rightmost in each sequence pairwise motifs belonging to the alignment, and their relative positions in the alignment. Thus in order to define a front, it is necessary, first, to set for each sequence such motif, that one of its layers belongs to this sequence, and, second, to define the position of this motif in the alignment relative to the initial position of the leftmost motif in this front. The initial position of any frontal motif in a sequence, to which it is ascribed, should be situated rightwards relative to beginnings in this sequence of all other frontal motifs. One motif can be ascribed to two sequences (recall that all motifs are pairwise ones).

It is simple to demonstrate that further construction depends only on the front of the already aligned part. Thus a dynamic programming procedure is applicable.

It is clear, that as the number of motifs increases, the number of fronts increases at a very fast rate (roughly as (N/t)**t , where N is the number of motifs, t is the number of sequences). Thus it seems to be necessary to bound the number of fronts by a reasonable stack size so that alignment of natural sequences would be possible.

FASTA format

A set of source sequences at FASTA format look like following:

>[seq_name1]
First sequence at one-letter code including spaces, possibly in several
lines
>[seq_name2]
Second sequence at one-letter code including spaces, possibly in several
lines
............ 
The sign ">" should stand in the first position of a sequence_name line.

Example:

>s1
GAATCCAG GGAATTCC
GACA
>s2
AAATAAACCCGGGGGAA
GG AAAAA
>S3
GGGGGGGGGGGGGGGGGTTTTTTTTTTTAAA