DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 8, 1992

DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 8, 1992 ON THE OPTIMALITY OF THE DAYHOFF MATRIX FOR COMPUTING THE SIMILARITY SCORE BETWEEN FRAGMENTS OF BIOLOGICAL SEQUENCES

A.M. Leontovich

Institute of Physico-chemical Problems in Biology, laboratory bldg. A,
Moscow State University, Moscow 119899, Russia

abstract. One of the most important problems of the biological sequence analysis is the search for similar sequences or sequence fragments and measuring the observed similarity. One of the most widespread and theoretically substantiated definitions of the similarity measure is that by Staden. The exact numerical value of this measure depends on the choice of the letter substitution weight matrix that characterizes the similarity of individual letters (nucleotides or amino acid residues) constituting the compared sequences. In the case of amino acid sequences the most often used matrix is that by Dayhoff. Its construction is based on the statistics of amino acid substitutions in a set of a priori related (but sufficiently divergent) proteins. In this paper some theoretical corroboration for the Dayhoff matrix are presented. In particular, a statement is proved showing that in some sense the Dayhoff matrix is an optimal matrix for the search of similar fragments.

In the analysis of biological sequences (polynucleotide or polypeptide) it is often necessary to understand whether there is a similarity between these sequences or their fragments, and, if such a similarity exists, to estimate how large it is. One of the most widespread and theoretically substantiated definitions of a similarity measure is that by Staden [1]. The exact numerical value of this measure depends on the choice of letter substitution weights; these weights characterize the similarity of individual letters (nucleotides or amino acid residues), that constitute the compared sequences.

The choice of a matrix can vary for different situations. The simplest matrix is the unit matrix with weights equal 1 if the letters coincide and 0 if the letters differ. The unit matrix is usually considered as the weight matrix if polynucleotide sequences are compared. But in the case of polypeptide sequences other matrices are mostly used [2, p. 103]. Sometimes the choice of weights depends on certain physical chemistry considerations (for instance, the hydrophobicity of amino acids, the charge, and so on) [3]. Another method of the weight matrix construction stems from the genetic code table [2, p. 103], [4]. Finally, a very often (probably the most often) used matrix is the Dayhoff matrix [4], whose construction is based on the statistics of letter (amino acid) substitutions (it seems that biologists do not use this approach for analysis of polynucleotide sequences). In this matrix relatively large weights correspond to letter pairs with the tendency for mutual substitution (of course, large weights correspond also to coinciding letters).

The last choice (the Dayhoff matrix) is the object of this paper. It will be shown that in some sense the Dayhoff matrix is an optimal matrix for the search of similar fragments.

First, let us recall the definition of the Staden similarity measure for fragments.

Consider two biological sequences, either polynucleotide (DNA or RNA), or polypeptide (proteins). Let a₁,...,a_N1 and b₁,...,b_N2 be the letters constituting these sequences. If the polynucleotide sequences are being considered, then each of letters a_r, b_s is one of the four nucleotide bases (A, G, C, T/U). If polypeptide sequences are being considered, then a_r, b_s belong to the set of 20 amino acids. Denote by p_a, q_b the frequencies of the letters a and b in the first and the second sequences respectively, that is

(1)

where N¹_a is the number of occurences of the letter a in the first sequence (i.e. the number of subscripts l for which a_l = a,
1 l n₁), N²_b is the number of occurences of the letter b in the second sequence, and N₁ and N₂ are lengths of sequences.

For each pair of letters a, b we introduce the number r _a,b that characterizes the similarity of the letters a and b. This number is called the weight of the substitution of the letter a to the letter b. The weights p_a,b form the matrix P = ||r _a,b||. For polynucleotide sequences the matrix P is of dimension 4x4, while for polypeptides it is of dimension 20 x 20. Denote by m and D the mean and the variance of substitution weights in considered sequences:

Let (a_i, a_i+1,..., a_i+L-1) and (b_j, b_j+1,... ,b_j+L-1) be a pair of the sequence fragments of length L. Consider

This value A was suggested by Staden as a similarity measure between the compared fragments (a_i,a_i+1,... ,a_i+L-1) and (b_j,b_j+1,...,b _j+L-1).

If we assume that the sequences (a₁,...,a_N1), (b₁,...,b_N2) are independent Bernoulli sequences with letter frequencies p_a, q_b respectively, then the weights r _{a(i+l), b(j+l)} are random variables with the mean m and the variance D, and these random variables are independent for various l, 0 < l < L — 1. Then the similarity score A defined by formula (2) is a random variable. Its mean equals 0 and its variance equals 1. For L —> the distribution of this variable A tends to the standard normal distribution function N(0,1), that is,

For large L this formula allows us to find the probability P of the event that the sum of substitution weights for independent Bernoulli sequences of the length L is not less than the value of the corresponding sum for the fragments under comparison. This probability P is uniquely determined by A, i.e. P = P(A). Hence A is a natural similarity measure for fragments.

Now we describe the method to construct the Dayhoff matrix (as the author understands it).

A set of protein families is considered. Each family consists of presumably related proteins. In each case the presumption of relatedness is based not only on the similarity of the corresponding sequences, but rather on the biological considerations. However, the similarity between the proteins of a given family is certain but not too large, so that the sequences are not nearly identical. On the other hand, the proteins are similar enough in order to allow an unique alignment, in which the homologous fragments of the sequences are situated below each other. Such alignment is based not only on the sequences themselves, but also on the biological and chemical properties of the proteins.

Consider a pair of proteins a ,b belonging to one family. Align the protein sequences. After the alignment lengths of the sequences become equal. Denote by N the number of the alignment positions occupied by two amino acids (and not by an amino acid and a gap). Next, denote by N_aa the number of positions that are occupied by the amino acid a in the sequence a and an arbitrary amino acid (but not a deletion) in the sequence b . Similarly, denote by N_bb the number of the positions occupied by b in b and any amino acid in a . Finally, denote by N_aba
b the number of positions occupied by a in a and b in b . Consider

where a, b is an arbitrary pair of amino acids. (If r _aba
b > 1, than there is an observable tendency of the amino acid b to occupy the same position as the amino acid a). For fixed a, b the values r _aba
b are averaged for all protein pairs a , b belonging to same family, and then averaged for all families from the set. The obtained averaged values are considered as the weights r _àb in the Dayhoff matrix. (Actually, in the real Dayhoff matrix all values are then multiplied by 10 and rounded to integers, but this is not important, since, by (2), the procedure of simultaneous multiplication of all weights by a same number does not change the value A).

Note that despite the fact that r _aba
b and r _baa
b are, in general, different, the matrix P obtained after the averaging is symmetric (that is, r _ab = r _ba), since the averaging is made over all protein pairs and both the pair (a , b ) and the pair (b ,a ) are considered.

The Dayhoff matrix often used by biologists is presented in Table 1 (this matrix is taken from [6, Table 1.2]).

In the Dayhoff matrix, large values of weights r _ab correspond to pairs of amino acids a, b that have a tendency for mutual substitution (in particular, diagonal elements r _aa have large values). This seems to be perfectly natural. But is there a more formal foundation for the Dayhoff method and formula (3). Here we present a statement that, in our opinion, partially validates this formula.

Table 1. The Dayhoff weight matrix

Gly
Pro
Asp
Glu
Ala
Asn
Gin
Ser
Thr
Lys
Arg
His
Val
He
Met
Cys
Leu
Phe
Tyr
Trp

Gly
29

Pro
12
14

Asp
11
10
23

Glu
11
11
19
20

Ala
13
14
12
12
14

`Asn
11
10
14
12
12
17

Gin
10
11
14
14
11
12
21

Ser
12
11
12
11
13
13
11
13

Thr
10
10
10
10
12
12
11
13
18

Lys
7
8
10
10
9
12
11
10
9
24

Arg
4
4
7
7
5
10
13
7
6
22
75

His
6
6
9
8
8
13
12
9
9
11
20
59

Val
7
8
7
8
10
8
9
9
11
8
5
6
22

He
6
6
6
7
8
7
8
8
10
7
5
7
21
25

Met
5
6
6
6
8
7
8
8
9
8
10
6
16
16
26

Cys
6
4
5
5
7
6
5
11
9
4
2
3
11
9
12
166

Leu
4
4
5
6
6
5
6
6
7
6
4
6
15
18
21
5
40

Phe
2
2
2
3
4
3
3
4
5
3
4
7
7
11
11
2
12
70

Tyr
1
1
1
1
2
2
2
3
2
3
3
7
3
6
6
1
6
66
137

Trp
1
1
1
1
1
2
2
2
2
2
3
15
2
4
4
1
3
41
46
414

Let us have a pair of fragments (a₁,...,a_L), (b₁,...,b_L) of the same length L. Denote by n¹_a the number of positions in the first fragment occupied by the letter a, by n²_b the number of positions in the second fragment occupied by the letter b, and by n_àb the number of position occupied by a in the first fragment and b in the second fragment (i.e. the number of positions l such that a_l= a, b_l = b). Following (2), the similarity measure for the two fragments is

where

Each choice of weights r _àb corresponds to a value of the similarity measure computed by formulas (4) and (5), A = À{r _àb}. The larger is the value À{r _àb} (for the given pair of fragments), the more sensitive is the matrix P = ||r _àb|| for search of similarity between these fragments. The question is, which matrix P corresponds to the maximal value of À{r _àb} (for a given pair of fragments)? The answer is given by the following theorem.

Theorem. Let

Then for each weight matrix ||r _àb|| we have

Proof. Note first that, by (4) and (6), multiplication of all weights r _àb by the a constant and adding a constant to all weights r _àb does not change A{r _àb}. Hence it is sufficient to find a set of weights {r _àb} that maximizes A{r _àb} assuming that the following conditions are satisfied:

Therefore, we have a conditional extremum problem. By (8), this problem is evidently equivalent to the problem of maximization of a more simple variable

under conditions (8). Using the method of Lagrange factors it is easy to discover that the maximum value of S{r àb} corresponds to

where the factors l , m are found from conditions (8). This expression implies the left inequality in (7). Next, from (5) and (6) we easily see that

and the proof is completed.

Suppose that the letter frequencies in the fragments equal the corresponding frequencies in the entire sequences, that is

(so the fragments have average frequency characteristics). It follows from (6), (7), and (9) that in this case

The evident similarity between formulas (3) and (10) justifies the following interpretation of the above theorem: the choice of weights for the letter (amino acid) substitutions with the Dayhoff matrix is close to optimal if the compared sequences (proteins) have the intermediate similarity that is close to the similarity level in the protein families used as learning sets in the construction of the Dayhoff matrix. If the similarity level is different, some other matrix should be used.

As an example we consider the case of nearly identical sequences. For such sequences the values n_a¹, n_a², n_aa are close to each other, while the values ï_àb are small compared to n_a¹, n_a² for à a ¹ b. Formula (10) implies that in this case it is useful to choose as a weight matrix the diagonal matrix with elements inversely proportional to letter frequencies. (However, for such almost coinciding sequences the choice of the weight matrix is of no great importance since similar fragments would be surely found with the unit matrix, Dayhoff matrix, or many other matrices.)

Formulas (6) and (10) suggest possible modifications of the Staden approach to the definition of the similarity measure for fragments. In these modifications for the letter substitutions are not set once and forever, but depend of the sequences being considered. Two approaches are possible.

In the first approach the recomputation of weights is made during process of determination of similarity between fragments. This recomputation of weights depends on the sequences as a whole, but not on the fragments under consideration. First, some standard weight matrix ||r _àb|| (e.g., the Dayhoff matrix) is considered and similar fragments are found. Then the alignment is performed. Hence, the similarity level of those two sequences is determined. If necessary, weights are redefined using the following procedure. Denote by N the number of the alignment positions occupied by letters in both sequences (and not a letter and a gap). Denote by N_a¹ the number of positions occupied by the letter a in the first sequence and by an arbitrary letter in the second sequence; similarly, denote by n_b ² the number of the positions occupied by by the letter b in the second sequence. Finally, denote by N_ab the number of the positions occupied by a in the first sequence and by b in the second sequence. The new weights are then defined by the formula

(cf. (3), (10); note that the matrix ||r ^’ _àb|| is symmetric since for that very reason its elements in (12) are defined as half sums of the two terms). Then the more accurate search of similar fragment is performed using these new more adequate weights
r '_ab.

When this procedure is applied to independent Bernoulli sequences, the distribution of the random variable A' characterizing the similarity between fragments differs from the distribution of the corresponding random variable A defined using the preset weight matrix r _àb. In particular, it is evident that the distribution of A' is shifted to the right compared to the distribution for A, and that although A' has an asymptotically normal distribution N(0, 1) for L —> , N / L —> , it tends to the limit distribution more slowly than A. These facts should be considered during interpretation of observed results.

The described approach has an evident drawback compared to the standard Staden method. It is too complicated. Indeed, in order to find the weights ||r ^'_ab||, the usual Staden procedure is to be performed and then the sequences are to be aligned, which is a laborous task.

In the second approach the weights depend directly on the fragments that are compared, and not on the entire sequences from which the fragments were taken. Namely, it is suggested to use as the weights the values r _àb maximizing the similarity measure between the fragments (see (6)). Then the similarity measure between the fragments A^f is defined by formula (7), so that A^f =

A{r ^f_àb}. In this case letter frequencies (that are parts of (6) and (7)) can be computed using either the entire sequences (see (1)) or the fragments under consideration. (See (9). The weights r ^f_àb and the similarity measure A^f = À{r ^f _àb} are defined by formulas (10) and (11).) It is more convenient to compute not the similarity measure A^f, but its square A^f ². In this approach the recomputation of the weights is not performed, so it is less cumbersome. Moreover, the simplicity of formulas (7), (11) also is attractive.

To interpret the experimental results it is necessary to understand the nature of the distribution of the variable A^f (or A^f ²) when independent Bernoulli sequences are compared. Evidently, the distribution of A^f is fundamentally different from the distributions of A and A' discussed above. For example, the value A^f (unlike A and A') is automatically non-negative (see (7)). Of course, the distribution of A^f for L —> does not tend to the standard normal distribution N(0, 1).

Consider in more detail the case when the letter frequencies are computed using the entire sequences (i.e. by formula (1)). We analyse the distribution of A^f ². By (7), A^f ² is the sum of 20² = 400 positive terms

where

It is not difficult to compute

It can be shown that the variables x _ab, are asymptotically normal for L —> . (Therefore, A^f ² looks like the x² with 400 degrees of freedom, but by (15), the variances of x _ab differ from 1, and, what is even more important, by (16) these values are mutually dependent.) Formulas (14) and (15) imply that

In the case when all letters have equal frequencies p_a = q_b = 1/20, it is possible to find also the limit value (for L —> ) for the variance Var(A^f ²). Namely, computing the eigenvalues of the covariation matrix (15), (16), one sees that one of the eigenvalues equals 0, while all others equal 1. Using the asymptotic normality of the variables x _ab, it is possible to prove that in this case

Var(A^f ²) —> 2(20² - 1) = 798 for L —> .

In the general case we cannot compute the variance Var(A^f ²), since the eigenvalues of the covariation matrix cannot be found. However, the limit value of the variance would be certainly close to 798, even if the letter frequencies are not close to one another.

The variable A^f ² is not, of course, asymptotically normal for L —> (as mentioned above, it is more close to the variable x² with 399 degrees of freedom). However, since the number of terms in (13) is very large, the distribution of A^f ² is close to normal. Therefore in applications one can assume that A^f ² has a normal distribution with the mean 399 and the variance 798.

Since the mean E(A^f ²) is very large, the distribution of the variable A^f also is close to normal with the mean close to

and the variance close to

A similar situation for the distribution of (A^f ² and (A^f occures in the case when the letter frequencies are computed from the fragments, i.e. by formula (9). However, in this case the mean Å((A^f ²) and the variance Var((A^f ²) are slightly different:

E((A^f ²) = 20² - 2 • 20 + 1 = 361, Var((A^f ²) ~ 2 • 19³/21 ~ 653

(in the case of equal letter frequencies the variance coincides with the above value). As we have mentioned earlier, this approach has some advantages compared to the previous one (when the weights depend on the entire sequences and not on the compared fragments). But it has also serious drawbacks. One of them is that the value of (A^f ² is large not only if the fragments under comparison are very similar, but also when they are very dissimilar (i.e. when there is a tendency for non-similar letters to occupy same positions). Therefore, after computing the value (A^f ² it is necessary to find out why this value is large (due to the similarity or the dissimilarity of the fragments). The other drawback is that this approach (contrary to both the standard Staden technique and the first alternative technique) cannot be used by the DotHelix algorithm for the fast search of similar fragments in sequences [5].

Hence, both methods have serious drawbacks (besides the fact that they have not been studied analytically in any detail). On the other hand, the problem of the weight choice seems to be not too important for biologists. Nevertheless, the possibility of such approaches is worth bearing in mind.

References

[1] R. Staden, An interactive graphics program for comparing and aligning nucleic acid and ammo acid sequences, Nucl. Acids Res. 10 (1982), 2951-2961.

[2] R. F. Doolittle, Of URFs and ORFs. A primer on how to analyze derived amino acid sequences, University Science Books, Mill Valley, 1986.

[3] W. M. Fitch, An improved method of testing for evolutionary homology, J. Mol. Biol. 16 (1966), 9-16.

[4] M. D. Dayhoff, Atlas of protein sequence and structure, vol. 5, suppl. 3, Natl. Biomed. Res. Found., Washington, D.C., 1978.

[5] A. M. Leontovich, L. I. Brodsky, and A. E. Gorbalenya, Construction of the complete map of local similarity for two polymers, Biopolimery i Kletka 6 (1990), no. 6, 14-21. (Russian)

[6] G. E. Shulz and R. H. Schirmer, Principles of protein structure, Springer-Verlag, New York, 1979.

	Gly	Pro	Asp	Glu	Ala	Asn	Gin	Ser	Thr	Lys	Arg	His	Val	He	Met	Cys	Leu	Phe	Tyr	Trp
Gly	29
Pro	12	14
Asp	11	10	23
Glu	11	11	19	20
Ala	13	14	12	12	14
`Asn	11	10	14	12	12	17
Gin	10	11	14	14	11	12	21
Ser	12	11	12	11	13	13	11	13
Thr	10	10	10	10	12	12	11	13	18
Lys	7	8	10	10	9	12	11	10	9	24
Arg	4	4	7	7	5	10	13	7	6	22	75
His	6	6	9	8	8	13	12	9	9	11	20	59
Val	7	8	7	8	10	8	9	9	11	8	5	6	22
He	6	6	6	7	8	7	8	8	10	7	5	7	21	25
Met	5	6	6	6	8	7	8	8	9	8	10	6	16	16	26
Cys	6	4	5	5	7	6	5	11	9	4	2	3	11	9	12	166
Leu	4	4	5	6	6	5	6	6	7	6	4	6	15	18	21	5	40
Phe	2	2	2	3	4	3	3	4	5	3	4	7	7	11	11	2	12	70
Tyr	1	1	1	1	2	2	2	3	2	3	3	7	3	6	6	1	6	66	137
Trp	1	1	1	1	1	2	2	2	2	2	3	15	2	4	4	1	3	41	46	414