Notes on searching by similarity

Homepage

Belozersky Institute

GeneBee

Russian EMBnet Node

SEARCHING FOR HOMOLOGIES ALONG A SEQUENCE DATABANKS

COMPARISION SENSITIVITY of the GENEBEE SEARCH, BLAST and FASTA
REFERENCES
ALGORITHM
PARAMETERS

REFERENCES:

Brodsky L.I., Vasiliev A.V., Kalaidzidis Ya.L., Osipov Yu.S., Tatuzov R.L., Feranchuk S.I. GeneBee: the program package for biopolymer structure analysis, 1992, Dimacs, 8, 127-139. [Word97 doc]
Leontovich A.M. On the optimality of the Dayhoff matrix for computing the similarity score between fragments of biological sequences, 1992, Dimacs, 8, 1-9. [ Word97 doc ]
Leontovich A.M.,Brodsky L.I., Gorbalenya A.E. Construction of the full local similarity map for two biopolymers, 1993, Biosystems, 30,57-63.[Word97 doc]
Brodsky L.I., Ivanov V.V., Kalaidzidis Ya.L., Leontovich A.M., Nikolaev V.K., Feranchuk S.I., Drachev V.A. GeneBee-NET:Internet-based server for analyzing biopolymers structure, 1995, Biochemistry, 60, 8, 923-928. [Word97 doc]

ALGORITHM

Normally protein sequence (alignment) goes against protein databanks (SWISSPROT and PDB), but since the comletion of the protein databanks falls far behind the adds to the databank of nucleotide sequences (EMBL), there is the possibility of screening the protein sequence along desirable parts of EMBL (with translation of the direct and complementary chains in six reading frames).

Because of possible long duration of the screening (from 2 minutes to 24 hours) depending on the query sequence/alignment length and the volume of treated databank the result will be sent to you by E-mail along with it's WWW demonstration.

By default we suppose that the protein sequence/alignment should be analized against protein databank (SWISSPROT) and the nucleotide sequence/alignment - against complete nucleotide databank (EMBL). In protein case it will be taken Dayhoff matrix by default for similarity calculations and aminoacids will be divided to several groups ("colors") according to their physico-chemical properties.

The query sequence/alignment should be written in one-letter code (low or upper case) and can be divided to several strings. In case of alignment it have to be blank string between sequental batches and at the begining of every batch all sequence names should be repeated:

KAG2_CAVPO    -----------------------------CGGVLVDPQWVLTAAHCIND--S---N
KAG_PIG       -----------------------------CGGVLVNPKWVLTAAHCKND--N---Y
PLMN_PIG      parvvggcvsiphswpwqislryryrgHFCGGTLISPEWVLTAKHC----------
TRYP_PIG      -----------------------nsgsHFCGGSLINSQWVVSAAHCYKS--R---I
UROK_PIG      -----------------------------CGGSLISPCWVVSATHCFINYQQKEDY

KAG2_CAVPO    QVKLGRHNLFEDEDTA-----QHFLVSQSVPHPDFN
KAG_PIG       EVGWLRHNLFENENTA-----QFFGVTADFPHPGFN
PLMN_PIG      ------------lekssspssykvilgaheeyhlge
TRYP_PIG      QVRLGEHNIDVLEGNE-----QFINAAKIITHPNFN
UROK_PIG      IVYLGRQTLHSSTHGEMKFEVEKLILHEDYSADSLA

There are two variants of searching for homologies:

between a query sequence and the databank sequences or
between an alignment and the databank ones.

As a rule you will want to perform screening the protein sequence/alignment along protein databank (SWISSPROT) and screening of nucleotide one along single (or several) of EMBL parts, but since the completion of the protein databank falls far behind the adds to the databank of nucleotide sequences, the program sports the possibility of screening the protein sequence along the nucleotide bank (with translation of the direct and complementary chains in 6 reading frames).

The algoritm of searching for homologies of a sequence (or alignment) along a databank consists of a sequential tests of all sequences in a databank, selecting those, which either form a good alignment with the search one, or have a significant fragment, homologous to it.

The procedure consists of three steps. The first one finds those SHIFTS of the sequences, when homologous fragments are to coincide. The second one selects best local similarities without gaps (the "MOTIFS") for each shift, and the third step constructs the best total and local ALIGNMENTS of two sequences (or the query alignment and a databank sequence).

SHIFTS

The determination of sequence shifts is performed by the modified Lipman - Pearson procedure. It marks the shifts for which sufficiently large number of matching "equivalent" patterns can be observed. In most cases this method indeed discovers the shifts that lead to matching of similar fragments. The main advantage of this approach is that the search is rather fast (method of Looking table).

Let's consider it in details. On one hand, the letters (residues) are divided into several classes ("COLORS"). This coloring is consistent with the weight matrix: substitution weights within a class should be sufficiently large. For each shift of sequences the following procedure is performed: for each position of a window (of length, say, 7) the value T(T-1)/2 is computed, where T is the number of color matches within this window. Then the sum of these values for all window positions for a given shif t is computed. This characteristic is preferable as compared to the more widely used number of perfect matches or number of color matches (that is, the sum of T), since it is sensitive to determination of shifts setting in correspondence locally similar regions. On the other hand, our procedure is sufficiently fast, since the above characteristic can be computed using preformed looking table of all color pairs (the distance between which does not exceed the window length) for query sequence/ alignment. A user-defined number of best shifts is subject for further processing.

COLORS

These are following the standard groupes of aminoacids ("PROTEIN" colors):

A and C	are usually grouped together for having a small side chain,
C	is the only to build S-S bridges,
D,E,N,Q	a carboxil or an amid qroup in the side chain,
I,L,M,V	hydrophobic Interaction of the I kind,
F,Y,W	hydrophobic Interaction of the II kind,
H	positively charged and has a reactional ability,
K,R	positively carged side chain,
P	can't make hydrogen bonds,
S,T	make non-peptide hydrogen bonds in the side chain.

For DNA/RNA sequences only T and U are united in one group - all others are in separate groups ("DNA" colors).

MOTIFS

Motifs are sets of similar fragments of different sequences, having the same length. Those can be paired or multiple: the similarities between fragments of two sequences in the first case and of several sequences in the other.

Non-contradictional sets of motifs for the given pair of sequences can be united at local alignment (with gaps).

Mostly important motif/alignment characteristic is its statistical value in some definite statistical system. When supposing independent letter appearance, we can get the probability distribution of the given statistic, and, taking it into consideration, see the probability of fixed value of the choosen statistic, calculated for the given motif. This procedure selects less probable motifs and alignments (from the hypothesis of random letter appearance).

Traditionally, the mainly used statistic is the so called "sum of weights for changes" according to the MATRIX of residue's similarities. Different variants of this statistic can be formed, using different matrices (Dayhoff matrix, matrices of chemical and physical properties: hydrophobic, polarity, participation in secondary and 3D structure elements etc.). It is common to use the Standard Square Deviation (SD) units for measuring probability. We shall call the probability, measured in a such way, the POWER of the motif/alignment.

Power of motif highly depends on frequences of letters in comparing sequences. If frequences of letters in selected stretches of matching are significantly deviate from values in begining of calculations (for example the stretch is polyA), then it's necessary to recalculate the power with new frequences and this will decrease power of such unsignificant motif as, for example, the match of polyA stretch in query sequence with polyA stretch in databank sequence
( "Motif frequency recalc" parameter, "YES" is recommended ).

DOTHELIX PROCEDURE

We use DOTHELIX procedure to separate statistically significant motifs. Giving a fixed shift of two sequences (or alignments), it marks a set of non-overlapping motifs with the following property: the first motif has the maximum of the power (least possibility of appearing for random sequences of the statistic "sum of weights for letter changes along given matrix"), the second has the best power for the set of motifs, non-overlapping with the first and etc. Boundaries of the corresponding fragments can be found exactly since extending or shortening of fragments by one letter necessarily decreases their similarity (power).

The most important advantage of the suggested approach as compared with the traditional one (looking at the given sequence shift by window) is the following. First, the method is more sensitive (all motifs will be discovered independently on the difference at their lengths). Second, the method is more precise: exact boundaries of similar fragments are found and the power characterizing the similarity of the fragments is computed.

Dothelix procedure is not much less effective in comparison with "window method": as a rule it demands N * ln(N) operation istead of N for window method, but in sophisticated cases the number could be equal N * N. To eliminate such possibilities there is procedure parameter ("Accurate DotHelix") that bound the number of operations (case "NO").

If user marks several matrices, then Dothelix procedure will be performed for all of them separately. All the motifs with power more than threshold will be used for the best ALIGNMENTS construction.

ALIGNMENTS

On the next step of the examination of the current pair of sequences (or query alignment and databank sequence) two alignments are found: the best integral alignment (according to the sum of the residual exchange weights along total length) and the best local alignment of them (in the power terms). The both are built from the motifs previously found (for several matrices). If one of the following happens: the power of the local alignment or the alignment is higher than the threshold, then the sequence from the databank and the local alignment itself are saved. The results of the program are presented in two lists: the separate list of the best local alignments and a list of the best integral alignments. The inclusion of the two lists is brought up by the fact, that they show two biologically different senses: the presence of a best local alignment shows the existence of a good and, maybe, functionally significant pattern, while the integral alignment shows, there could be an evolutional relation of the two sequences.

Each aligned sequense consists of motif's fragments (fragments of query sequence/alignment similar to fragments of databank sequences), fragments not belonging to any motif, and gaps. At the first step supermotifs are formed. Supermotif is defined as a partial alignment whose power exceeds powers of all its local alignments. In particular, its power exceeds powers of all motifs belonging to it. In the program two methods for linking motifs into supermotifs are implemented: cluster method for local alignment and dinamic (for total alignment).

In the cluster method the following scheme is employed. Initially each motif is considered as a supermotif. At each step two such supermotifs are linked, that the power of the obtained supermotif exceeds powers of its constituents, and it is the maximum one among all possible linkages.

It have been implemented a dynamic programming procedure for construction of a total alignment of query and databank sequences. We represent an alignment as a chain of motifs. These motifs can belong to an alignment not only completely, but partially as well. It is taken into account in our program, but below we for simplicity assume that motifs enter an alignment entirely. Now total alignment quality is measured by its "price" that is defined as a sum of centered weights for letter exchange along the alignment path minus gap penalties. For each motif it is possible to point out motifs that can precede or follow given in an alignment. Thus the set of motifs is a partially ordered set. If a partial alignment has been constructed, further aligning depends only on the most recently aligned fragment. Thus the dynamic programming principles, and in particular, a Needleman-Wunsch type procedure can be employed. For each motif we find the maximum price of alignments starting at the beginning of the sequences and ending at this motif. As result the best total alignment is constructed. If it's PERCENT OF HOMOLOGY (see PARAMETERS) and its power are more than thresholds then the alignment will be included in result list.

RESULT

The result holds the selected sequences and the alignments formed. Each motif of such alignment is built from a fragment of the search sequence (or alignment) and a fragment of the databank one.

If the screening goes with query alignment, then in the results will be not pair motifs: from one side it will be there all the variant of "thickening" of the search alignment because of all the newly found homologous fragments, and all the sequences, taking part in the new motifs, from the other side.

Example:

 SUPERMOTIF  9    Power  12.901

psuste      (11)  WINWFLGIKGNEFLCHVPTNYFQDTFNQLGLEYFSQ---------pldvilk
                  **+*******++****+*++**+++*+++**+**++
DMSTEGHAb   (391) WINWFLGIKGNEFLCHVPTNYFQDTFNQMGLEYFSQhwt*s*srrstvprac

psuste      pafdsssglfyddekkwygmiharyirsergvndihrkymrgdfescpniscnrkntl

DMSTEGHAb   sttmrksgta*ftpdtsgrsva*micteti*eetlnrvrispvigrtpcqwasamcga

psuste      pvglRSTAHAVKTTSILKLIHSHVRAQLSGHLLNAAAELETAPGQPT-----------
                *+++*++*+++******+*+*++*.****+++++*+++**.*+
DMSTEGHAb   sqr*RSTAHAVKTTSILKLIHSHVRAQLPGHLLNAAAELETAPGRPTvsnspnivlvv


psuste      ---------SLCFXLHQKALMPLKSPKSSPKNIKYSSS
                     +*** **+*+*****+**++**.*  +.+
DMSTEGHAb   f*tkrlnlqSLCFLLHQKALMPLKSPKSSPKKIGSSAS



 SUPERMOTIF  10    Power  12.619

psuste      (11)  WINWFLGIKGNEFLCHVPTNYFQDTFNQLGLEYFSQ
                  **+*******++****+*++**+++*+++**+**++
DMSTEGHAb   (391) WINWFLGIKGNEFLCHVPTNYFQDTFNQMGLEYFSQ

The signs between aligned sequences shows the quality of the letter exchage according to matrix selected the first.

PARAMETERS

In the dialog window you should set the following parameters of the program:

The result can be sent to you by E-mail along with it's WWW demonstration.
list of analized databanks. The following list of sequence databanks is possible to select:
Protein databanks:
1. databank of protein sequences SWISSPROT,
2. Broohaven PDB - sequences from databank of biopolymer 3D structures,
Subdivisions of nucleotide sequence databank EMBL:
1. primates,
2. rodentes,
3. mammalians, except primates and rodents,
4. vertebrates, except mammalians,
5. invertebrates,
6. viruces,
7. synthetic plasmids,
8. plants, including fungies,
9. expression tags,
10. patents,
11. bacteries,
12. unannotated sequences,

query sequence (should be in one-letter code format):
[sequence in one-letter code]

for example:

        MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAA
        KSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRN AKLKPVYDSL DAVRRAAL
        INMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA
        YK

or query alignment which should be represented as following:


ACRO_PIG      --------------------------------------------------------
EL1_PIG       ggteaqrnswpsqislqyrsgsswahtCGGTLIRQNWVMTAAHCVDRELTFRVVVG
EL2_PIG       ggedarpnswpwqvslqydssgqwrhtCGGTLVDQSWVLTAAHCISSSRTYRVVLG
FA9_CAVPO     FRVVGGEDAKPGQFPWQVLLNGETEAFCGGSIVNEKWIVTAAHCILPGIKIEVVAG
FA9_PIG       IRIVGGENAKPGQFPWQVLLNGKIDAFCGGSIINEKWVVTAAHCIEPGVKITVVAG
KAG2_CAVPO    ---------------------------CGGVLVDPQWVLTAAHCINDS--NQVKLG
KAG_PIG       ---------------------------CGGVLVNPKWVLTAAHCKNDN--YEVGWL
PLMN_PIG      maensktspiarmrdvvlfekriylsecktgngknyrgttsktksgvicqkwsvss
TRYP_PIG      ddkivggytcaansipyqvslnsgshfCGGSLINSQWVVSAAHCYKSR--IQVRLG
UROK_PIG      kkfqgehceidtsqtcfegnghsyrgkantntggrpclpwnsatvllntyhahrpd


ACRO_PIG      ----------------------------
EL1_PIG       EHNLNQNDGTEQYVGVQKIVVHPYWNTD
EL2_PIG       RHSLSTNEPGSLAVKVSKLVVHQDWNSN
FA9_CAVPO     KHNIEKKEDTEQRRNVTQIILHHSYNAS
FA9_PIG       EYNTEETEPTEQRRNVIRAIPHHSYNAT
KAG2_CAVPO    RHNLFEDEDTAQHFLVSqdfta------
KAG_PIG       RHNLFENENTAQFFGVTADFPHPGFnls
PLMN_PIG      phipkyspekfplagleenycrnpdnde
TRYP_PIG      EHNIDVLEGNEQFINAAKIITHPNF---
UROK_PIG      alqlglgkhnycrnpdnqrrpwcyvqvg

the method of similarity selection: Best local similarity (local alignments) and/or Integral similarity (total alignments) - both methods are recommended;

minimum length of motifs selected (7 is recommended);

the motif's minimum power (3 - 5 is recommended);

minimum power of selected local alignemnts (6 - 9 is recommended);

minimum percent of homology (in part of complete similarity) for integral alignments found (0.15 - 0.25 is recommended);

Looking table technique uses its own set of the parameters:

the desired percent of coinciding (from the maximum in the case of complete similarity) "one-color" pair-patterns on the selected shift (in 1/100 of 1) (0.02 for protein and 0.08 for nucleotide query sequence);

the selected number of best shifts (5-10 is recommended);

the length of parts, into which the studied bank sequence is split (100 i s recommended);

maximum length between the elements of a "colored" pair (7 is recommended)

the percent of concervativeness of a position in the query alignment) This value defines, which positions in the alignment should be counted as conservative and is used, while building the Looking table for query alignment (0.7 is recommended).

The additional search parameters are:

gap penalty (in the units of standard deviation) (1 is recommended);

the width of the register equivalence zone (used, when searching for integral homologies) (3 is recommended);

required minimum percent of homology for each motif (0.01 is recommended);

whether translation is used, or not for nucleotide sequences into amino-acid ones (including stop-codons) and the genetic code (Code) of translation. Either any bank sequence can be translated in three reading frames (Only direct mode), or the complementary sequence can be translated with it (Direct and copml. mode). This option is used when comparing protein sequences with a nucleotide bank. In this case the resulting motifs will contain translated sequences (stop-codons will be represented as dummy symbols {"*"}).
This prompt allows you to receive full descriptions of the sequences homologous to your query sequence.
This prompt brings you 3D coordinates of the sequences homologous to your query if they come from PDB bank.
Restrict the number of matching sequences reported to the number specified, default limit is 50 sequences.
Setting the borders of the sequence (or alignment) zone (Start & End), for which the searching for homologies search will be done.
Strand. Option sets which frames will be processed. If 'Only forward' is choosen then three forward frames will be processed in the case protein against nucleotide databanks and single forward strand will be processed in the case nucleotide against nucleotide databanks. If 'Both' is choosen all six frames will be processed in the case protein against nucleotide databanks and both forward and backward strands will be processed in the case nucleotide against nucleotide databanks.
Clusterization type. This option has sense only in the case nucleotide against protein databanks OR protein against nucleotide databanks. If 'Each frame separately' is chosen then found motifs will be clustered (into supermotifs) separatly for each frame. If 'Codirectional joinly' is choosen then motifs found on codirectional frames (forward and backward) will be clustered joinly. So it is possible obtain supermotif containing motifs from 2 or 3 forward (or backward) frames. The reason of this option - propable errors in query or databanks sequences.

The same dialog window sets the DotHelix mode (accurate and simplified) and the list of amino-acid residue similarity MATRICES, which will be used to search for motifs. If more than one matrix is selected, then the motifs are searched for all of the defined types of similarity. Currently it's possible to use the following three matrices:

Dayhoff matrix (modified 250 PAM matrix from Atlas of Protein sequence and structure, (1978), v.5, suppl. 3, pp.345-358):

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A   12
C    8 22
D   10  5 14
E   10  5 13 14
F    6  6  4  5 19
G   11  7 11 10  5 15
H    9  7 11 11  8  8 16
I    9  8  8  8 11  7  8 15
K    9  5 10 10  5  8 10  8 15
L    8  4  6  7 12  6  8 12  7 16
M    9  5  7  8 10  7  8 12 10 14 16
N   10  6 12 11  6 10 12  8 11  7  8 12
P   11  7  9  9  5  9 10  8  9  7  8  9 16
Q   10  5 12 12  5  9 13  8 11  8  9 11 10 14
R    8  6  9  9  6  7 12  8 13  7 10 10 10 11 16
S   11 10 10 10  7 11  9  9 10  7  8 11 11  9 10 12
T   11  8 10 10  7 10  9 10 10  8  9 10 10  9  9 11 13
V   10  8  8  8  9  9  8 14  8 12 12  8  9  8  8  9 10 14
W    4  2  3  3 10  3  7  5  7  8  6  6  4  5 12  8  5  4 27
Y    7 10  6  6 17  5 10  9  6  9  8  8  5  6  6  7  7  8 10 20

BLOSUM62 matrix constructed from 2000 multiple aligned blocks (equivalent to 160 PAM, Henikoff S. et al, Proc. Nat. Acad. Sci. USA, (1992), v.89, pp.10915 -10919):

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A    8
C    4 13
D    2  1 10
E    3  0  6  9
F    2  2  1  1 10
G    4  1  3  2  1 10
H    2  1  3  4  3  2 12
I    3  3  1  1  4  0  1  8
K    3  1  3  5  1  2  3  1  9
L    3  3  0  1  4  0  1  6  2  8
M    3  3  1  2  4  1  2  5  3  6  9
N    2  1  5  4  1  4  5  1  4  1  2 10
P    3  1  3  3  0  2  2  1  3  1  2  2 11
Q    3  1  4  6  1  2  4  1  5  2  4  4  3  9
R    3  1  2  4  1  2  4  1  6  2  3  4  2  5  9
S    5  3  4  4  2  4  3  2  4  2  3  5  3  4  3  8
T    4  3  3  3  2  2  2  3  3  3  3  4  3  3  3  5  9
V    4  3  1  2  3  1  1  7  2  5  5  1  2  2  1  2  4  8
W    1  2  0  1  5  2  2  1  1  2  3  0  0  2  1  1  2  1 15
Y    2  2  1  2  7  1  6  3  2  3  3  2  1  3  2  2  2  3  6 11

Johnson matrix based on AA replacements in 65 sets structure aligned 3D structures (M.S.Johnson et al., J.Mol.Biol. (1993), v.233, pp. 716 - 738) :

    A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A  16
C   6 26
D   8  0 18
E   9  3 12 18
F   7  5  3  3 20
G   9  2  8  7  1 18
H   7  2  9  7  8  7 22
I   8  2  5  5 10  4  5 18
K   9  1  8 11  4  6 10  5 17
L   6  1  2  4 12  3  5 12  6 17
M   8  5  4  7  9  5  7 12  8 14 21
N   8  2 12  9  6  8 11  5 10  5  6 18
P   9  1  9  8  5  7  5  4  9  7  0  7 20
Q   9  3  9 12  3  7 11  3 11  5  9  9  6 19
R   8  4  6 10  4  7 10  4 13  6  6  8  6 12 20
S  10  2 10  8  5  8  7  5  8  5  5 11  9  9  9 16
T   9  4  8  9  5  6  7  7 10  5  7 10  8  9  8 12 17
V   9  5  5  6  8  4  6 14  6 12 10  4  5  6  5  5  8 17
W   4  1  4  2 13  3  6  6  4  9  9  4  2  2  6  4  0  5 25
Y   6  2  6  6 13  4  9  7  6  7  8  7  3  5  8  6  7  8 12 20

We should remind, that the strategy of parameter selection is very individual and depends highly on the problem solved.

The results, got from the main computer, will contain selected sequences and homologies found, the query sequence/alignment and the parameters of search.