Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 Jul 26;102(30):10557-62.
doi: 10.1073/pnas.0409137102. Epub 2005 Jul 6.

An algorithm for progressive multiple alignment of sequences with insertions

Affiliations
Comparative Study

An algorithm for progressive multiple alignment of sequences with insertions

Ari Löytynoja et al. Proc Natl Acad Sci U S A. .

Abstract

Dynamic programming algorithms guarantee to find the optimal alignment between two sequences. For more than a few sequences, exact algorithms become computationally impractical, and progressive algorithms iterating pairwise alignments are widely used. These heuristic methods have a serious drawback because pairwise algorithms do not differentiate insertions from deletions and end up penalizing single insertion events multiple times. Such an unrealistically high penalty for insertions typically results in overmatching of sequences and an underestimation of the number of insertion events. We describe a modification of the traditional alignment algorithm that can distinguish insertion from deletion and avoid repeated penalization of insertions and illustrate this method with a pair hidden Markov model that uses an evolutionary scoring function. In comparison with a traditional progressive alignment method, our algorithm infers a greater number of insertion events and creates gaps that are phylogenetically consistent but spatially less concentrated. Our results suggest that some insertion/deletion "hot spots" may actually be artifacts of traditional alignment algorithms.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Alignment of sequences with insertions. (A)(Top) Example of four sequences undergoing insertion events (boxed). (Middle) Alignments at nodes x and y, with flags (•) for preexisting gaps (compared with ○, indicating nongapped positions), are shown on top and left of the matrix, respectively. Their alignment (x, y, and m for insertions in upper and lower subtrees, and a match, respectively) is depicted in gray, with steps penalized multiple times in light gray. The gaps skipped over without repeated penalty by our method are shown by an arrow. (Bottom) In practice, the dynamic programming is performed with a stack of matrices among which some moves, indicated by light gray arrows and squares, are without penalty. The matrices in front, middle, and back correspond to standard moves, moves skipping over gaps in sequence y (vertical), and moves skipping over gaps in sequence x (horizontal), respectively. A flagged two-base insertion in the subtree to the right of x is skipped over, and the alignment continues with an insertion in edge formula image leading from z to y. The example corresponds to a simple linear gap cost algorithm with two storage matrices, and only one vector of flags is shown because x and y do not have grandchild sequences. (B) Here, we illustrate a special case where independent insertions have occurred at the same site in both subtrees. (C) Four sequences are aligned by applying our gap-flagging approach. In the sequences for nodes a, b, and c, the height of each character represents its relative probability, and flags show gaps in the current sequence (bottom) and overlapping gaps in its existing child sequence (top). In a, a one-character gap is opened; in b, the gap is skipped over and extended by one character at each side, with the double flag at the fourth site denoting the overlapping insertion and deletion; and in c, only the one-character insertion in the grandchild is skipped over. If c were to be aligned to its sister sequence, only the one site in the middle could be skipped over, because the sites adjacent to it (denoted by gray circles) were inferred as deletions and are now hidden. The inferred alignment is shown at the bottom (gray box). See text for further notation details.
Scheme 1.
Scheme 1.
Alignment HMM.
Algorithm 1.
Algorithm 1.
Progressive alignment of sequences with insertions. A boolean variable formula image is true if site xi is within a gap, otherwise false; formula image is true if site xi and one of its child sites are within a gap, otherwise false. Their values are reversed by the! operator. Conditions in parentheses are true when the variables f they contain either all represent formula image or all represent formula image. We use ⋆ to indicate indices that range over all permitted values (X, Y, M for v; X, M for t; Y, M for u), and vE denotes the score of the full alignment (for clarity, the possibility of preexisting gaps at the end of sequences is ignored).
Fig. 2.
Fig. 2.
The guide tree used for the D-loop alignment (Left) and the central part of the alignment (Right) as inferred by clustalw (A), JC- (B), JC+ (C), HKY+ (D), and HKY+F (E). In A, the gaps are gathered at the same sites but are inconsistent with the phylogeny, and the outgroup sequence (top) is clearly overmatched. In comparison with B, the new correction in C preferentially infers insertions instead of multiple deletions. As shown by C and D, the matching of sites depends on the substitution model used. In E, the insertions-open-forever rule prevents the distant outgroup being matched to sites earlier inferred as insertions and creates indels that are strictly consistent with the phylogeny. Blue, green, orange, and red represent the bases A, C, G, and T, respectively, and gray is used for sites where the true base is unknown. Up (blue) and down (red) triangles above the alignments show the inferred deletion and insertion events, respectively; at sites marked with diamonds (green), the state of the root sequence is unknown. See text for further details.

Comment in

  • Mind the gaps: progress in progressive alignment.
    Higgins DG, Blackshields G, Wallace IM. Higgins DG, et al. Proc Natl Acad Sci U S A. 2005 Jul 26;102(30):10411-2. doi: 10.1073/pnas.0504801102. Epub 2005 Jul 18. Proc Natl Acad Sci U S A. 2005. PMID: 16027352 Free PMC article. No abstract available.

References

    1. Levenshtein, V. I. (1966) Sov. Phys. Dokl. 10, 707-710.
    1. Needleman, S. B. & Wunsch, C. D. (1970) J. Mol. Biol. 48, 443-453. - PubMed
    1. Sankoff, D. (2000) Bioinformatics 16, 41-47. - PubMed
    1. Gotoh, O. (1982) J. Mol. Biol. 162, 705-708. - PubMed
    1. Hirschberg, D. S. (1975) Commun. Assoc. Comput. Mach. 18, 341-343.

Publication types