1992, Dimacs, 8, 127-139.

GeneBee: the Program Package

for Biopolymer Structure Analysis

L.I. Brodsky, A.V. Vasilyev, Ya.L. Kalaydzidis, Yu.S. Osipov,

A.R.L.Tatuzov, S.I. Feranchuk

Creation of GeneBee package is the result of joined effort of mathematicians, biologists and programmers. This is the reason for its being the most widely used in Russia package for biopoly-mer structure analysis. With gratitude we list the names of persons, whose work allowed to reach this result: P. Avdanin, K. M. Chumakov, D. R. Davydov, A. L. Drachev, T. V. Dracheva, G. K. Frank, A. V. Gorbalenya, E. V. Koonin, A. M. Leontovich, I. Ya. Vakhutinskii, A. Voloboi.

Contents

I. Introduction

II. Main procedures

1. Local similarities and pairwise sequence alignment
2. Multiple sequence alignment
3. Statistical analysis of alignments
4. Scanning a sequence over the bank and mapping potentially important regions
5. Motif and pattern search in the bank. Pattern (site) mapping of a sequence
6. Prediction of the protein secondary structure
7. Prediction of protein-coding regions in nucleotide sequences
8. Construction of phylogenetic trees
9. Three-dimensional protein structure and work with it
10. Search for local similarities in three-dimensional structures
11. Additional procedures
  1. Parallel computations and the use of transputers

I. Introduction

GeneBee package is primarily devoted to analysis of amino acid sequences and three-dimensional protein structures. The goal is, first, to elucidate information about evolutionary and functional properties of a protein, based on the comparison of its primary sequence with the databanks of known proteins, and, second, to study the principles of protein folding into the three-dimensional structure that can be used in making conclusions about the protein function. This line of research requires the development of the integrated program library that can work with both primary sequence databanks and banks of three-dimensional structures. Evidently, main procedures of this library can be used in the analysis of nucleotide sequences also.

me software is being developed in two directions. The first one is creation of convenient procedures for simultaneous work with banks of primary and spatial structures. This includes representation in corresponding databases of information about structurally and functionally important fragments and patterns, taking into account obtained statistical information about sequences and three-dimensional structure data. The second direction is development of a multiple alignment procedure and statistical analysis of sufficiently large protein families revealing, in particular, conservative positions, correlations between individual positions and patterns common to all sequences. Those and other properties of sequences should be used to predict large-scale protein structure, functionally important regions, secondary and three-dimensional structure. Analogous problems can be considered for nucleotide sequences.

The present package has already been described in a number of papers [1-4]. The general description of the package was presented in [1], while papers [2-4] contained descriptions of some modules. However, since these publications the package got extended and developed. In particular, the modules working with three-dimensional biopolymer structures were introduced. Other modules were rewritten and new procedures were added to them. Many changes were made based on the experience of biologists which used the package.

The package is oriented to the IBM PC. This is caused by two reasons. First, an IMB PC supplied with transputers and a high resolution graphic card can compete with a work station, while it still is a desktop personal computer. The second reason is that molecular biologists in Russia now have access mostly to computers of this type.

GeneBee includes its own built-in database format and program facilities for database handling. Databases can be either produced by the user, converting ASCII sequence files, or can be generated by the special import program from standard databases EMBL, SWISS-PROT, PIR, GenBank and Brookhaven bank. The GeneBee format was designed for the maximally compact storage of sequences and additional information, which is crucial for PC users. Back transformation from the GeneBee format into one of the standard formats also is possible.

Large amount of computation in the biopolymer structure analysis leads to heavy requirements of the CPU time (particularly if the number of sequences is large or they are long). Thus the problem of the reduction of computation time is important. The traditional approach is to use the expensive hardware (thus supercomputers are employed and specialized chips are being developed). The alternative approach is to use parallel computing, and in the current situation this seems to be both cheaper and simpler to develop. GeneBee is oriented to the use of transputers.

Database handling in GeneBee is based on the principles employed in the popular package Norton Commander (NC). The internal structure of the database can be regarded as a natural extension of the DOS file system. The user simultaneously observes the contents of two directories or sub-databases in two NC-like panels. Not only files but also descriptors of individual entries within sub-databases are displayed. As in NC, the user can view, rename, select and unselect individual entries and sub-databases, create new sub-databases, merge them, and so on. The special built-in editor for individual sequences and sequence alignments is available. Also it is possible to search sequence entries for user-defined keywords.

  1. Main Procedures

2.1. Local similarities and pairwise sequence alignment. Construction of an optimal alignment is the central problem in the entire field of biological sequence analysis. When there are reasons to suspect that given sequences are similar, it is important to determine the degree of this similarity and to delineate similar segments with maximum precision.

The alignment problem is based on construction of the full local similarity map. In this map similar fragments of equal lengths are marked.

The similarity measure for two fragments is defined as follows. First the weights for all amino acid substitutions are introduced. There exist many variants of the substitution weight matrix: based on the physico-chemical considerations (charge, hydrophobicity, bulkiness and so on), based on the substitution statistics in course of the evolution (the Dayhoff matrix [10]) and so on. The similarity score for two fragments A is defined as the normalized sum of all substitution weights for the corresponding residues:

(1) a = (s - ml) / Ö dl

where S is the sum of the substitution weights for pairs of corrersponding residues, L is the length of the fragments, M and D are respectively the mean and the variance of the substitution weights for all possible residue pairs. Graphically a pair of fragments corresponds to a diagonal map segment of the length L. The similarity measure of two fragments computed by the above formula is called the power of the corresponding diagonale segment. The power is measured in the standard deviation units (SD).

When the length is sufficiently large, the similarity score A of two random Bernoulli fragments is independent of L and is distributed by the Gauss law with the standard parameters.

The alignment procedure for each shift of the sequences finds in the local similarity map a series of non-overlapping diagonal segments and computes its power. These segments correspond to similar sequence fragments. In our algorithm this is done by the following technique similar to the one by Altshul and Erikson [8]. On each diagonale the following system of segments is found: the most powerful segment; the most powerful segment among those not intersecting with the first one; the most powerful segment among those not intersecting with the first two, etc. The process terminates when the power of thus obtained segments becomes lower than the given threshold. The system of diagonal segments is obtained independently of other diagonales. The Altshul-Erikson procedure has a number of advantages as compared to the standard Diagon method [7]. In particular, it allows more exact delineating of boundaries of similar fragments. Our variant of the algoritm works faster than the Altshul-Erikson algorithm (more detailed description is presented in [2]).

As stated above, there exist various substitution weight matrices. For each matrix a local similarity map can be generated. While constructing subalignments and alignments it is useful to consider a map obtained by joining of several such maps.

The same power value corresponds to a different similarity score if the fragments length varies. Increasing M in formula (1) we would map segments with increasing similarity score. Thus maps generated for various values of M would differ. This allows to use them for successive refinement of aligment results.

The very important problem is the sequence alignment, that is, such arrangement of the sequences one above the other, that similar fragments match. Two variants of the problem can be stated. The first one is construction of the global alignment, while the second one is construction of optimal subalignments. In the former problem arrangement of entire sequences is considered, while in the latter case it is sufficient to align some fragments, not necessarily of similar lengths. This problem is specifically important if the similarity between the entire sequences is not large because the sequences include extended regions that are not similar at all, while some fragments match well. It is required to delineate such matching regions and to align them. In our algorithm both problems are solved based on the local similarity map and the (sub)alignment result is a chain of diagonal segments of this map.

In order to find optimal subalignments we use the cluster method of the segments joining [13]. At each step we join the segments (or already constructed subalignments) for which the summary power exceeds the powers of the initial segments and is maximal among all such pairs. The procedure of joining terminates when either of these two conditions cannot be satisfied. (The subalignment power is defined analogously to formula (1) and has the same probabilistic sense: it corresponds to the probability of obtaining the same substitution weights sum when a random alignment of the same fragments subject to random shuffling is considered).

The global alignment is performed by a procedure of the Needleman-Wunsch type [9], but unlike the classic Needleman-Wunsch algorithm on each step we consider not residue pairs, but previously found diagonal segments or subalignments of the local similarity map. In this case the alignment results are much more stable relative to the gap penalty.

If the considered sequences are long, construction of the local similarity map, generation of the set of optimal subalignments and construction of the global alignment are rather time-consuming procedures. Thus the design of this module allows to use transputers.

2.2 Multiple sequence alignment. The main idea of the multiple alignment algoritm employed in GeneBee is the same as the idea of the pairwise alignment, namely, the use of regions of multiple local similarities (motifs) as alignment repers [3]. Those motifs are defined as batches of similar fragments from several (not necessarily all) sequences.

In accordance to that, our multiple alignment procedure consists of the following steps:

— generation of multiple local similarities (motifs);

— construction of optimal subalignments consisting of motifs;

— construction of the global alignment as a chain of ordered subalignments.

Results of each step can be of independent interest.

Generation of the complete set of motifs formally requires to consider all possible shifts of all sequences relative to each other, and this is prohibitively time-

consuming. Thus we employ the following procedure. First, local similarity màðs for all pairs of sequences are generated. Diagonal elements of these maps constitute by definition the set of motifs of thickness 2. Then each of these motifs is shifted relative to each sequence not used for its generation and sequence fragments matching the motif or its parts are marked. The power of motif fragments matching (the chaining power which is defined analogously to (1)) should exceed the threshold. Thus motifs of thickness 3 are constructed. Then the process is repeated and motifs of thickness 4, 5 etc. are generated. Some of the motifs would grow to the full thickness, while the growths of other ones would terminate at some stage. Thus we would obtain the set of motifs of varying thickness that correspond to significant multiple local similarities of sequence fragments.

When the described technique of motif identification is employed, one would overlook significant motifs for which all pairwise motifs are not significant. In order to avoid such losses, it is desirable to set low thresholds for identification of motifs of low thickness.

The search for optimal subalignments and the global multiple alignment was completely reorganized as compared to [3]. Currently it is done in a manner somewhat similar to the case of two sequences: optimal subalignments are generated by the cluster method, while the global alignment is performed by the modernized Needleman-Wunsch procedure. However, when these algorithms are realized, completely new fundamental problems arise. The most important one is that one should define how to join motifs of partial thickness that include at least one common sequence.

The package includes one more algorithm related to the multiple alignment, namely, the algoritm of alignment of two already aligned not-intersecting sets of sequences. In this case each primary alignment is considered as a single sequence whose "letters" are arrays of corresponding residues from sequences of this aligned set. The procedure is analogous to that of the pairwise alignment and starts with generation of the local similarity map. The "correspondence weight" of two arrays is the sum of the substitution weights of all pairs of residues one of which belongs to the first array and the other to the second one.

2.3. Statistical analysis of alignments. The phenomenon of compensatory changes is well known in the analysis of the RNA secondary structure. Similar effects can take place in analysis of (functionally or evolutionally) related proteins. Besides local regularities, in primary structure there can exist correlations between far positions or fragments that can be considered as evidence of the structural dependence of these positions. On the other hand, the existence of conservative positions (fragments) usually indicates the functionally important structure fragments. All this information should be used in reconstruction of the three-dimensional structure and prediction of the protein function.

The statistical analysis module of GeneBee package allows to obtain three characteristics of the constructed alignment:

— plot of positional conservativity (with regards to the substitution weight matrix);

— mapping of significantly conservative (or significantly variable) continuous alignment regions;

— partitioning of the set of all positions into clusters ot mutually dependent ones. Such partitioning can take into account the equivalency of residues relative to some property (e.g. hydrophobicity). Besides, the set of positions almost conservative relative to this property is determined.

The most important procedure of this module is computation of positional correlations. The ideology of contingency tables is employed. For each pair of positions the table is constructed with rows corresponding to (non-equivalent) residues of the first position and columns corresponding to residues of the second position. Each cell contains the number of sequences in which two corresponding residues occur in the given positions. For such a table it is simple to compute the x2 value that measures the divergence of the positional residue frequences from the independence and thus characterizes the "correlation score" of the two alignment positions.

From the user's point of view this procedure looks as follows. First the partition of all residues into classes is defined and each class is assigned a specific color on the screen. Then the procedure of computation of the x2 values for each pair of positions starts. The so obtained "distance matrix" of positions is used for construction of the cluster tree. (The distance between two clusters is defined as the arithmetic mean of the x2 values of pairs belonging to these clusters).

If a threshold value is set, then the natural partition of the positions set into connected groups is generated. The mean x2 value within a group exceeds the threshold, while this value between groups is lower than that. Each such group of pairwise correlated positions can be of biological interest, since this correlation can reflect the compensatory changes in course of the evolution, and thus it can mean that these positions are linked conformationally.

Identification of a group of clearly correlated positions can allow to set a search pattern which, in turn can help to find in the bank sequences with the similar property that can be important biologically.

2.4. Scanning a sequence over the bank and mapping potentially important regions. The scanning procedure consists of the examination of all sequences from the bank with selection of those sequences that either can be aligned well with the given sequence, or have a significant subalignment with a fragment of it.

The procedure consists of two stages. First, for each sequence from the bank we find such shifts (relative to the given sequence), that can lead to matching of similar fragments. Second, for each shift similar fragments in the pair of sequences are found and the optimal subalignments together with the global alignment are constructed.

Determination of shifts is performed by the modified Pearson-Lipman procedure [12]. It marks the shifts for which sufficiently large number of matching "equivalent" patterns can be observed. In most cases this method indeed discovers the shifts that lead to matching of similar fragments. The main advantage of this approach is that the search is fast.

We use several variants of patterns. In the simplest case the pattern is a "window" of a given length, and in this case the shift quality is defined as the number of windows in corresponding positions matching up to the residue equivalency. A more subtle variant, that guarantees higher sensitivity of search for similar regions, consists of considering the sum of the number of matching windows of lengths from

2 to some specified value. In this case the definition of matching takes into account only terminal window positions while (mis)matching in the intermediate positions is ignored.

After determining of best shifts, their grouping is analyzed. If these shifts form a compact set (or if one "distinguished" shift is present), then for each such shift a system of non-intersecting diagonal segments is determined (each segment is a pair of similar sequence fragments). This is done using a simplified procedure of the local similarity map construction. From this system we select segments, for which the similarity score exceeds the given significance threshold. Then the found segments are joined into optimal subalignments and the global alignment by a procedure analogous to the pairwise alignment.

If the power of a subalignments or the similarity score of the global alignment exceed the corresponding thresholds, then both the sequence and the constructed subalignment (of the global alignment or both) are retained. The procedure result consists of two lists: the list of best subalignments and the list of best global alignments. A good global alignment is an evidence of the evolutionary relatedness of the sequences, while a significant subalignment allows only to speak about a common motif that can be important functionally. Using the list of best subalignments we can discover functionally important regions in analysed sequences. In order to do that, we mark the regions in the analyzed sequence that are similar to many fragments of bank sequences. If all corresponding sequences have a distinct functional property, than it is natural to assume that the analyzed sequence also has this property and the marked region is responsible for that.

An analogous bank scanning procedure can be performed if instead of a single query sequence we consider a set of already aligned sequences. In this case the determination of shifts is performed based on the pattern of conservative and correlated positions of this alignment.

Since the growths of protein data bank is slower than the growths of the nucleic acid data bank, the package allows to scan a protein sequence through the nucleic acid data bank (on both direct and complementary strands).

2.5. Motif and pattern search in the bank. Pattern (site) mapping of a sequence. One of the main operations in the entire field of biopolymer structure and function analysis is the motif and pattern search in amino acid sequences. Usually pattern is considered as a regular logically defined expression, while motif is a set of aligned relatively short sequence fragments that are reasonably conservative in a sufficiently large family of related proteins. It is natural to expect that such motifs (patterns) would be associated with some functions, e.g. to participate in active centers of enzymes. In some cases this assumption indeed was confirmed by the experimental data.

The procedure of search for regular expressions in the bank, which is contained in GeneBee package, considers patterns either extracted from specialized banks (e.g. the bank of protein patterns Prosite included into the EMBL Data Library) or created by other package modules. It is possible to perform simultaneous scanning for a set of patterns. In this case the program would find all sequences containing these paterns and mark them by different colors. Analogously, it is possible to search for known patterns in one sequence.

Anotner program related to this theme is the procedure that relaizes a simple variant of the "profile" search algorithm. It is a modification of an older module described in [4]. This program allows sensitive scanning of the bank for concordance with the given motif, statistical estimate of the similarity between the fragments of the extracted sequences and the query motif, and creation of the user's databases. Unlike the previous version, the current program allows to mark in sequences relatively fuzzy, but functionally important patterns.

We see two main related modes of application of the motif search program: (1) bank scanning aimed at determining of groups of related proteins and (2) analysis of novel amino acid sequences with mapping structural motifs in them and consequent attribution of these sequences to the existing groups. In order to solve these problems it is necessary to create the "library" of motifs and the corresponding mean similarity scores obtained during the scanning of the bank. That is currently being done. It should be noted that the above approach can be used in order to detect proteins containing the most typical motif sequences and thus, in some sense, to reconstruct the "ancestor" site.

2.6. Prediction of the protein secondary structure. Knowledge of the protein secondary structure is one of the important prerequisites of the function prediction. Since crystallographic structures are available only for a small number of proteins, theoretical predictions of the secondary structure together with the comparative analysis form a necessary tool that allows to make conclusions about the protein three-dimensional structure. Unfortunately, the existing methods of the secondary structure prediction are not sufficiently reliable, since the correct conformation is predicted for approximately 60-70% amino acid residues of the polypeptide chain.

The procedure employed in GeneBee package is based on relatively simple but, possibly, the most reliable Garnier-Robson algorithm [14]. This algorithm belongs to the group of methods that are based on the statistical analysis of crystallographic data bases of protein three-dimensional structures. For each i-th residue of the polypeptide chain its "disposition" to occur in the a -helix, b -sheet or b -turn conformation is computed. These "dispositions" are obtained by summing of the information about conformation probabilities contained in the (i - 8)-th through (i+8)-th residues. The information values are taken from the special tables based on consideration of the residue conformations in the protein three-dimensional structure database. The predicted residue conformation is the conformation with the highest disposition.

We supplemented the standard Garnier algorithm by the "smoothing" procedure. This procedure reduces to identification of zones where the marked majority of residues prefer one of the conformation types. The lower is the probability of such preference, the higher are the smoothed plots corresponding to the three conformations. The resulting prediction is a system of non-intersecting continuous segments. On some of these segments one of the three conformation types (a -helix, b -sheet or random coil) is significantly preferred, while the remaining segments display no marked preference.

Often the exact secondary structure of a protein is not known, but the proportions of various conformation types are determined experimentally. In this case the Garnier parameters can be optimized by the simplex method that also is realized in the present module. Besides, if the exact number ot a -helices present in a protein molecule is known, it is possible to optimize the smoothing parameter and to force the predicted number of a -helices to conform to the experimental results.

2.7. Prediction of protein-coding regions in nucleotide sequences. The module Protmake is devoted to the analysis of gene expression. Nucleotide sequences are "translated" in order to generate all possible open reading frames (ORFs) i.e. sequence segments containing no termination codons. Translation is performed in all 6 phases, that is, 3 phases on the direct strand, and 3 ones on the complementary strand. 5'-proximal initiation (AUG) codons are highlighted, and the resulting amino acid sequences can be stored in the GeneBee database format. Each ORF can be analyzed by a program implementing several algorithms that predict the probability of protein coding for this ORF [15, 16].

2.8. Construction of phylogenetic trees. Construction of phylogenetic trees is, in a sense, the final goal of analysis of biological sequences. In particular, these trees can be used in order to coordinate the information about sequence similarities with the general opinion about the evolution of corresponding species.

Construction of probable phylogenetic trees is based on the matrix of pairwise distances between sequences. These distances are computed given the residue substitution weights in the previuosly aligned sequences by the following formula:

(2)                                 d(i, j) = 1 - (S(i,j) - Sr(i,j)) / ((Smax(i,j) - Sr(i,j)) ,

where S(i,j) is the sum of residue substitution weights for the sequences i and j, Sr(i,j) is the analogous sum for these sequences subject to random shuffling, and Smax(i,j) is the maximum possible sum (corresponding to coinciding sequences).

The tree-construction algorithms realized to this module can be divided into topological and cluster ones. The main feature of the topological algorithms is the fact that they optimize the tree structure (i.e. the way the tree nodes are connected) first without consideration of branch lengths, that are reconstituted once the topological structure have been established [5], [6]. Contary to that, in the cluster algorithms the order of node connections is reconstituted together with the corresponding branch lengths [13]. The root is also determined in the natural way as a point on one of the branches such that the distances from it to all hanging nodes (corresponding to sequences) are equal. This property of cluster trees allows to introduce the distance from the root to each node and to draw the tree using this distance as a node abscissa.

The topological algorithms employed in the module are based on the so-called topological similarity principle [5]. This approach is tailored for the most precise representation of the internal structure of the analyzed distance matrix. In order to do that, the number of sequence quartets {i, j, k,l} is calculated, for which the inequality

d(i, j) + d(k, 1) < d(i, k) + d(j, I) < d(i, l) + d(j, k)

holds in the matrix, but does not hold in the tree, and vice versa, i.e. the topological mismatch value. The algorithm is aimed at construction of the tree, for which this number is minimal, i.e the maximum topological similarity tree. Although in

general the algorithm does not guarantee that the global minimum of the topological deviation is obtained, the constructed trees corresponding to local minima approximate the maximum topological similarity tree reasonably well.

The algorithm Topological-vertex uses the condition for the node connecting derived from this criterion. Based on this condition the tree is constructed by the consequent connecting of hanging nodes. In the algorithm Topological-branch the orientation of inner branches is changed according to this criterion. When the tree structure has been constructed by one of the algorithms, the branch lengths are determined by the least squares procedure so that the distances between nodes measured on the tree with the fixed structure are as close as possible to the corresponding distance matrix elements.

In cluster algorithms the notion of distance between groups of sequences is used for the setting of the branching order. In the algorithm Cluster-pair this distance is defined as the arithmetic mean of pairwise distances between elements of the two groups. In the algorithm Cluster-group the distance is defined using the alignment of the two groups.

2.9. Three-dimensional protein structure and work with it. The module Prot3D of the package GeneBee gives the opportunity of simultaneous work with the banks of amino acid sequences and Brookhaven bank of three-dimensional structures. The latter can be visualized and even modified so that the probable three-dimensional structure of the sequence close to the given one is constructed or the influence of point mutations is considered. The opportunity to predict the three-dimensional structure of an amino acid sequence based on its similarity to the sequence of a protein with known three-dimensional structure (construction of a chymeric protein) is based, first, on the analysis of primary structures of related proteins, second, on the estimates of the conformational stability of the discovered local similarities and on the alignment of known three-dimensional structures, and, third, on the energy minimization procedures. An important and, to some extent, solvable by this module, problem is the study of interaction of two three-dimensional structures (docking). The corresponding GeneBee program includes the following options:

— generate the skeletal model (C-alpha chain(s)) and the full atom model of a molecule, rotate and move them;

— zoom an arbitrary fragment of the skeletal model;

— "paint" various groups of atoms of the model in different colors;

— cut out an arbitrary group of atoms and perform with it all operations described above;

— rotate around an arbitrary bond one part of the molecule relative to the other;

— perform single amino acid substitutions;

— compute the energy of an arbitrary conformation;

— perform docking of two objects;

— create stick-and-ball models;

— create van der Waals model in the ray tracing mode (for VGA adapters).

In course of visualization of van der Waals models the realistic look of the picture is of great importance. The system of realistic visualization is based on the

physical modelling of tne photography process (ray tracing). In order to obtain a photograph it is necessary that the camera lens is reached by rays of various inten-sivity that would afterwards be focused into different points of the film. Of course, it is impossible to model the entire set of rays, but an approximate representation can be obtained by modelling a relatively small subset, for instance, one with the number of rays equal to the number of screen pixels. Thus the screen is considered as a discrete film surface with fixed points each of which receives a single light ray. Luminosity of a screen point would correspond to the brightness of the ray falling into this point.

In course of picture generation for each ray the following operations are performed:

— the first (the nearest to the observer) intersection of the ray with one of the three-dimensional objects is found;

— if such intersection does not exist, then the ray intensivity equals the background luminosity;

— if the intersection exists, then the ray intensivity is computed dependent on one of the following four main factors:

  1. properties of the object material;
  2. obstructions between the intersection point and the light source ("shadow");
  3. intensivities of the refracted and reflected rays;
  4. general illumination of the stage.

2.10. Search for local similarities in three-dimensional structures. Based on the methodology of search for local similarities of primary structures, we developed a search procedure for conformationally similar protein fragments. Of course, elements of a -helices are similar, as well as elements of b -sheets are. However, it is possible to find other similar details of three-dimensional protein structures.

Each amino acid correspond to a pair of torsion angles determining the path of the molecule Ca chain. Thus the three-dimensional structure can be represented as a sequence of angle pairs. Delineation of local similarities is based on this representation and is performed by the method analogous to that for the local similarity search in primary structures.

Based on so obtained local similarities, the search for three-dimensional subalign-ments and the global alignment of the three-dimensional structures is performed. The problem reduces to the optimal matching of already found spatially similar elements of the compared proteins.

2.11. Additional procedures. Besides the programs for analysis of proteins or both proteins and nucleic acids, the package includes some modules working solely with nucleotide sequences. Among them are:

— prediction of RNA secondary structure;

— construction of oligonucleotide primers;

— preparation of nucleotide sequences in the EMBL Data Library format.

III. Parallel Computations and the Use of Transputers

As mentioned above, the package allows parallelizing of some time-consuming operations with the use of transputers.

Transputer T800 produced by INMOSTM is a plate consisting ot a 32-bit microprocessor and a 80-bit floating point co-processor. Its tact frequency equals 20 MH, the efficiency equals 10 MIPS, 1.5 Mflops or 4548 Whetstone per second (cp. to 1860 Whetstone per second by the system Intel 80386 + 80387 at 20 MH). Main difference of the transputers from other types of processors is the possibility of very simple linking of them into a multiprocessor system consisting of ut to several hundred transputers.

The use of a transputer plate by GeneBee specifies the existence of a transputer program per se (i.e. a set programs loaded into each transputer module) and a server program resident in the PC. The transputer plate and the server maintain the data exchange by a protocol specified beforehand. In the present version the server is a set of user subroutines written in the language Ñ and linked together with other programs of GeneBee package. Each application transputer progam is storaged in three specific DOS files:

— the transputer net loader;

— the program module for the root transputer, including the so called controller and executer;

— the executor programming module for non-root transputers of the net.

For each problem the transputer program is oriented to the linear topology of the transputer net that is the most suitable for the plate B008 and the most simply extendable one. The dynamic loader of the transputer net is identical for all transputer programs used by GeneBee and is described below.

Parallelizing with transputers of the two main package procedures (search for similar regions and bank scanning) demonstrated sufficient efficiency. The computation rate with 3 transputers increased 10- to 15-fold relative to IBM AT (16 MH) and the degree of parallelism reaches in some cases 0.95, while the hardware cost is more than 10-fold lower than the cost of a workstation of similar productivity. The other advantage of this approach is the simplicity of the computation power increase (one additional Mflops costs less than $800). The limit of the power increase is 20-30 Mflops.

In the parallel variant of the local similarity search problem the independent processing of individual diagonales (i.e. relative shifts) is distributed on different transputers. This specifies the content of the so-called farm-out technology. The controlling process first distributes the pair of sequences through the chain of the executor processes (farmers). The first executor whose buffer is empty intercepts the incoming task (the buffer capacity is one task). The executor does not return the results to the controller, but places them into the common output buffer, in which they are stored until the overflow occurs. In that case the most high-scoring similar fragments are selected by a procedure of fast sorting, the remaining ones are deleted and the buffer resumes to be filled. After the termination of the process, the sorting in all transputers is performed with the consequent linkage of all results starting with the last transputer. The linkage process includes the joining.

The transputer module of the similarity search in the bank also is oriented to the most simply extendable linear topology of the transputer net. The transputer processing reduces to creation of large circle buffers in each transputer with the consequent loading of new bank sequences into freed space. Time-inhomogeneity

of the sequence loading from the external drive specifies the large volume or the buffers, that, in turn, can cause the disbalance of the computation load. In order to avoid that, the controller keeps the statistics of the "named" requests from the transputers and distributes the "named" tasks keeping the load balance. Besides, the obtained results cannot and should not be stored in transputers, that means that the controller has to output the results as they are obtained.

The dynamic loader of the transputer net differs from the static loader provided by INMOS. The difference is that the supplied loader can load the program on the set with arbitrary number of transputers and topology, that, however, should be specified previously, while the dynamic loader is intended for the fixed liner topology, but the number of transputers is undefined prior to the program start.

References

1. L. I. Brodsky, A. L. Drachev, R. L. Tatuzov, and Ê. Ì. Chumakov, The package of programs for sequence analysis: GeneBee, Biopolimery i Kletka 7 (1991), no. 1, 10-14. (Russian)

2. A. M. Leontovich, L. I. Brodsky, and A. E. Gorbalenya, Construction of full local similarity map of two biopolymers (the DotHelix module of the GeneBee package), Biopolimery i Kletka 6 (1990), no. 6, 14-22. (Russian)

3. L. I. Brodsky, A. L. Drachev, and A. M. Leontovich, A novel method of multiple alignment of biopolymer sequences (the H-Align module of the GeneBee package), Biopolimery i Kletka 7 (1991), no. 1, 14-22. (Russian)

4. E. V. Koonin, K. M. Chumakov, A. E. Gorbalenya, The method for search of structure motifs in amino acid sequences (the Site program of the GeneBee package), Biopolimery i Kletka 6 (1990), no. 6, 42-48. (Russian)

5. K. M. Chumakov and S. V. Yushmanov, The maximum topological similarity principle in molecular systematics, Mol. Genet. Microbiol. Virusol. 3 (1988), 3-9. (Russian)

6. S. V. Yushmanov and Ê. Ì. Chumakov, Algorithms of the maximum topological similarity phylogenetic trees construction, Mol. Genet. Microbiol. Virusol. 3 (1988), 9-15. (Russian)

7. R. Staden, An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences, Nucl. Acids Res. 10 (1982), 2951-2961.

8. S. F. Altschul and B. W. Erickson, A nonlinear measure of subalignment similarity and its significance levels, Bull. Math. Biol. 48 (1986), 617-632.

9. S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48 (1970), 443-453.

10. M. O. Dayhoff, W. C. Barker, and L. T. Hunt, Establishing homologies in protein sequences, Methods Enzymol 91 (1983), 524-545.

11. E. N. Trifonov, Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S-rRNA nucliotide sequences, J. Mol. Biol. 194 (1987), 643-652.

12. D. J. Lipman and W. R. Pearson, Rapid and sensitive protein similarity searches, Science 227 (1985), 1435-1441.

13. J. A. Hartigan, Clustering Algorithms, John Wiley and Sons, New York, 1975.

14. J.-F. Gibrat, J. Garnier, and B. Robson, Further developments of protein secondary structure prediction using information theory, J.Mol.Biol. 198 (1987), 425-443.

 

Belozerski Institute of Physical and Chemical Biology

at Moscow State University, Moscow, Russia