SEARCHING FOR HOMOLOGIES ALONG A SEQUENCE DATABANKSCOMPARISION SENSITIVITY of the GENEBEE SEARCH, BLAST and FASTA
Normally protein sequence (alignment) goes against protein databanks (SWISSPROT and PDB), but since the comletion of the protein databanks falls far behind the adds to the databank of nucleotide sequences (EMBL), there is the possibility of screening the protein sequence along desirable parts of EMBL (with translation of the direct and complementary chains in six reading frames).
Because of possible long duration of the screening (from 2 minutes to 24 hours) depending on the query sequence/alignment length and the volume of treated databank the result will be sent to you by E-mail along with it's WWW demonstration.
By default we suppose that the protein sequence/alignment should be analized against protein databank (SWISSPROT) and the nucleotide sequence/alignment - against complete nucleotide databank (EMBL). In protein case it will be taken Dayhoff matrix by default for similarity calculations and aminoacids will be divided to several groups ("colors") according to their physico-chemical properties.
The query sequence/alignment should be written in one-letter code (low or upper case) and can be divided to several strings. In case of alignment it have to be blank string between sequental batches and at the begining of every batch all sequence names should be repeated:
KAG2_CAVPO -----------------------------CGGVLVDPQWVLTAAHCIND--S---N KAG_PIG -----------------------------CGGVLVNPKWVLTAAHCKND--N---Y PLMN_PIG parvvggcvsiphswpwqislryryrgHFCGGTLISPEWVLTAKHC---------- TRYP_PIG -----------------------nsgsHFCGGSLINSQWVVSAAHCYKS--R---I UROK_PIG -----------------------------CGGSLISPCWVVSATHCFINYQQKEDY KAG2_CAVPO QVKLGRHNLFEDEDTA-----QHFLVSQSVPHPDFN KAG_PIG EVGWLRHNLFENENTA-----QFFGVTADFPHPGFN PLMN_PIG ------------lekssspssykvilgaheeyhlge TRYP_PIG QVRLGEHNIDVLEGNE-----QFINAAKIITHPNFN UROK_PIG IVYLGRQTLHSSTHGEMKFEVEKLILHEDYSADSLA
There are two variants of searching for homologies:
As a rule you will want to perform screening the protein sequence/alignment along protein databank (SWISSPROT) and screening of nucleotide one along single (or several) of EMBL parts, but since the completion of the protein databank falls far behind the adds to the databank of nucleotide sequences, the program sports the possibility of screening the protein sequence along the nucleotide bank (with translation of the direct and complementary chains in 6 reading frames).
The algoritm of searching for homologies of a sequence (or alignment) along a databank consists of a sequential tests of all sequences in a databank, selecting those, which either form a good alignment with the search one, or have a significant fragment, homologous to it.
The procedure consists of three steps. The first one finds those SHIFTS of the sequences, when homologous fragments are to coincide. The second one selects best local similarities without gaps (the "MOTIFS") for each shift, and the third step constructs the best total and local ALIGNMENTS of two sequences (or the query alignment and a databank sequence).
The determination of sequence shifts is performed by the modified Lipman - Pearson procedure. It marks the shifts for which sufficiently large number of matching "equivalent" patterns can be observed. In most cases this method indeed discovers the shifts that lead to matching of similar fragments. The main advantage of this approach is that the search is rather fast (method of Looking table).
Let's consider it in details. On one hand, the letters (residues) are divided into several classes ("COLORS"). This coloring is consistent with the weight matrix: substitution weights within a class should be sufficiently large. For each shift of sequences the following procedure is performed: for each position of a window (of length, say, 7) the value T(T-1)/2 is computed, where T is the number of color matches within this window. Then the sum of these values for all window positions for a given shif t is computed. This characteristic is preferable as compared to the more widely used number of perfect matches or number of color matches (that is, the sum of T), since it is sensitive to determination of shifts setting in correspondence locally similar regions. On the other hand, our procedure is sufficiently fast, since the above characteristic can be computed using preformed looking table of all color pairs (the distance between which does not exceed the window length) for query sequence/ alignment. A user-defined number of best shifts is subject for further processing.
These are following the standard groupes of aminoacids ("PROTEIN" colors):
For DNA/RNA sequences only T and U are united in one group - all others are in separate groups ("DNA" colors).
Motifs are sets of similar fragments of different sequences, having the same length. Those can be paired or multiple: the similarities between fragments of two sequences in the first case and of several sequences in the other.
Mostly important motif/alignment characteristic is its statistical value in some definite statistical system. When supposing independent letter appearance, we can get the probability distribution of the given statistic, and, taking it into consideration, see the probability of fixed value of the choosen statistic, calculated for the given motif. This procedure selects less probable motifs and alignments (from the hypothesis of random letter appearance).
Traditionally, the mainly used statistic is the so called "sum of weights for changes" according to the MATRIX of residue's similarities. Different variants of this statistic can be formed, using different matrices (Dayhoff matrix, matrices of chemical and physical properties: hydrophobic, polarity, participation in secondary and 3D structure elements etc.). It is common to use the Standard Square Deviation (SD) units for measuring probability. We shall call the probability, measured in a such way, the POWER of the motif/alignment.
Power of motif highly depends on frequences of
letters in comparing sequences. If frequences of letters in selected stretches of matching
are significantly deviate from values in begining of calculations (for example the stretch
is polyA), then it's necessary to recalculate the power with new frequences and this will
decrease power of such unsignificant motif as, for example, the match of polyA stretch in
query sequence with polyA stretch in databank sequence
We use DOTHELIX procedure to separate statistically significant motifs. Giving a fixed shift of two sequences (or alignments), it marks a set of non-overlapping motifs with the following property: the first motif has the maximum of the power (least possibility of appearing for random sequences of the statistic "sum of weights for letter changes along given matrix"), the second has the best power for the set of motifs, non-overlapping with the first and etc. Boundaries of the corresponding fragments can be found exactly since extending or shortening of fragments by one letter necessarily decreases their similarity (power).
The most important advantage of the suggested approach as compared with the traditional one (looking at the given sequence shift by window) is the following. First, the method is more sensitive (all motifs will be discovered independently on the difference at their lengths). Second, the method is more precise: exact boundaries of similar fragments are found and the power characterizing the similarity of the fragments is computed.
Dothelix procedure is not much less effective in comparison with "window method": as a rule it demands N * ln(N) operation istead of N for window method, but in sophisticated cases the number could be equal N * N. To eliminate such possibilities there is procedure parameter ("Accurate DotHelix") that bound the number of operations (case "NO").
On the next step of the examination of the current pair of sequences (or query alignment and databank sequence) two alignments are found: the best integral alignment (according to the sum of the residual exchange weights along total length) and the best local alignment of them (in the power terms). The both are built from the motifs previously found (for several matrices). If one of the following happens: the power of the local alignment or the alignment is higher than the threshold, then the sequence from the databank and the local alignment itself are saved. The results of the program are presented in two lists: the separate list of the best local alignments and a list of the best integral alignments. The inclusion of the two lists is brought up by the fact, that they show two biologically different senses: the presence of a best local alignment shows the existence of a good and, maybe, functionally significant pattern, while the integral alignment shows, there could be an evolutional relation of the two sequences.
Each aligned sequense consists of motif's fragments (fragments of query sequence/alignment similar to fragments of databank sequences), fragments not belonging to any motif, and gaps. At the first step supermotifs are formed. Supermotif is defined as a partial alignment whose power exceeds powers of all its local alignments. In particular, its power exceeds powers of all motifs belonging to it. In the program two methods for linking motifs into supermotifs are implemented: cluster method for local alignment and dinamic (for total alignment).
In the cluster method the following scheme is employed. Initially each motif is considered as a supermotif. At each step two such supermotifs are linked, that the power of the obtained supermotif exceeds powers of its constituents, and it is the maximum one among all possible linkages.
It have been implemented a dynamic programming procedure for construction of a total alignment of query and databank sequences. We represent an alignment as a chain of motifs. These motifs can belong to an alignment not only completely, but partially as well. It is taken into account in our program, but below we for simplicity assume that motifs enter an alignment entirely. Now total alignment quality is measured by its "price" that is defined as a sum of centered weights for letter exchange along the alignment path minus gap penalties. For each motif it is possible to point out motifs that can precede or follow given in an alignment. Thus the set of motifs is a partially ordered set. If a partial alignment has been constructed, further aligning depends only on the most recently aligned fragment. Thus the dynamic programming principles, and in particular, a Needleman-Wunsch type procedure can be employed. For each motif we find the maximum price of alignments starting at the beginning of the sequences and ending at this motif. As result the best total alignment is constructed. If it's PERCENT OF HOMOLOGY (see PARAMETERS) and its power are more than thresholds then the alignment will be included in result list.
The result holds the selected sequences and the alignments formed. Each motif of such alignment is built from a fragment of the search sequence (or alignment) and a fragment of the databank one.
If the screening goes with query alignment, then in the results will be not pair motifs: from one side it will be there all the variant of "thickening" of the search alignment because of all the newly found homologous fragments, and all the sequences, taking part in the new motifs, from the other side.
SUPERMOTIF 9 Power 12.901 psuste (11) WINWFLGIKGNEFLCHVPTNYFQDTFNQLGLEYFSQ---------pldvilk **+*******++****+*++**+++*+++**+**++ DMSTEGHAb (391) WINWFLGIKGNEFLCHVPTNYFQDTFNQMGLEYFSQhwt*s*srrstvprac psuste pafdsssglfyddekkwygmiharyirsergvndihrkymrgdfescpniscnrkntl DMSTEGHAb sttmrksgta*ftpdtsgrsva*micteti*eetlnrvrispvigrtpcqwasamcga psuste pvglRSTAHAVKTTSILKLIHSHVRAQLSGHLLNAAAELETAPGQPT----------- *+++*++*+++******+*+*++*.****+++++*+++**.*+ DMSTEGHAb sqr*RSTAHAVKTTSILKLIHSHVRAQLPGHLLNAAAELETAPGRPTvsnspnivlvv psuste ---------SLCFXLHQKALMPLKSPKSSPKNIKYSSS +*** **+*+*****+**++**.* +.+ DMSTEGHAb f*tkrlnlqSLCFLLHQKALMPLKSPKSSPKKIGSSAS SUPERMOTIF 10 Power 12.619 psuste (11) WINWFLGIKGNEFLCHVPTNYFQDTFNQLGLEYFSQ **+*******++****+*++**+++*+++**+**++ DMSTEGHAb (391) WINWFLGIKGNEFLCHVPTNYFQDTFNQMGLEYFSQ
The signs between aligned sequences shows the quality of the letter exchage according to matrix selected the first.
In the dialog window you should set the following parameters of the program:
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAA KSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRN AKLKPVYDSL DAVRRAAL INMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA YK
or query alignment which should be represented as following:
ACRO_PIG -------------------------------------------------------- EL1_PIG ggteaqrnswpsqislqyrsgsswahtCGGTLIRQNWVMTAAHCVDRELTFRVVVG EL2_PIG ggedarpnswpwqvslqydssgqwrhtCGGTLVDQSWVLTAAHCISSSRTYRVVLG FA9_CAVPO FRVVGGEDAKPGQFPWQVLLNGETEAFCGGSIVNEKWIVTAAHCILPGIKIEVVAG FA9_PIG IRIVGGENAKPGQFPWQVLLNGKIDAFCGGSIINEKWVVTAAHCIEPGVKITVVAG KAG2_CAVPO ---------------------------CGGVLVDPQWVLTAAHCINDS--NQVKLG KAG_PIG ---------------------------CGGVLVNPKWVLTAAHCKNDN--YEVGWL PLMN_PIG maensktspiarmrdvvlfekriylsecktgngknyrgttsktksgvicqkwsvss TRYP_PIG ddkivggytcaansipyqvslnsgshfCGGSLINSQWVVSAAHCYKSR--IQVRLG UROK_PIG kkfqgehceidtsqtcfegnghsyrgkantntggrpclpwnsatvllntyhahrpd ACRO_PIG ---------------------------- EL1_PIG EHNLNQNDGTEQYVGVQKIVVHPYWNTD EL2_PIG RHSLSTNEPGSLAVKVSKLVVHQDWNSN FA9_CAVPO KHNIEKKEDTEQRRNVTQIILHHSYNAS FA9_PIG EYNTEETEPTEQRRNVIRAIPHHSYNAT KAG2_CAVPO RHNLFEDEDTAQHFLVSqdfta------ KAG_PIG RHNLFENENTAQFFGVTADFPHPGFnls PLMN_PIG phipkyspekfplagleenycrnpdnde TRYP_PIG EHNIDVLEGNEQFINAAKIITHPNF--- UROK_PIG alqlglgkhnycrnpdnqrrpwcyvqvg
The additional search parameters are:
The same dialog window sets the DotHelix mode (accurate and simplified) and the list of amino-acid residue similarity MATRICES, which will be used to search for motifs. If more than one matrix is selected, then the motifs are searched for all of the defined types of similarity. Currently it's possible to use the following three matrices: