Alignment-Based Phylogeny

September 15, 2021

Alignment-based phylogeny attempts to build phylogeny trees using elements of a multiple-alignments to a region of interest as nodes, with edges corresponding to string distances between the nodes.

Fit Quality

$Parsimony$ score measures the phylogeny quality; lower is better. Parsimony score is the sum of every edge weight in the phylogeny tree. When using alignment based methods, one may set the weight with a string distance function such as Hamming distance.

$Parsimony(T) = \sum\limits_{(u, v) \in T} Weight((u, v))$

Note that the parsimony method assumes the tree structure $T$ as an input. To specialize the problem more, we assume an input “Character” matrix $C$ of dimension $|T| \times m$. $C$ maps every node (both leaf and internal) to a string in the alignment. Assume the alignment does not allow indels, and therefore that every string is of length $m$. Therefore,

$Parsimony(T, C) = \sum\limits_{(u, v) \in T} HammingDistance(C_u, C_v)$

Small Parsimony

Given a fixed-length alignment $m$ (e.g. $AGGGT$ when $m=5$), for each of $n$ leaf nodes, and a proposed phylogeny tree $T$; output a proposed alignment (of length $m$) for each internal node of $T$ such that the alignments minimize the parsimony score of $T$.

More formally, given:

$m \times n$ alignment (or “Character”) matrix $C$, with $n$ leaf nodes having a length-$m$ alignment
A tree $T = \lbrace (u, v) \text{ such that } u \text{ is a parent of } v \rbrace$
The first $n$ rows of the character matrix $C$, corresponding to the alignment of the leaves

Output:

$C^* = \argmin\limits_{C} Parsimony(T, C)$

Naively minimizing $C^*$ is expensive, as the search space is exponential with the number of cells in $C$.

The key insight to solve the minimization is that the problem has two natural decompositions. First, decouple the columns: each of $C^*$’s columns can minimize hamming distance independently. Second, decouple the rows: the recursive structure of the tree allows for dynamic programming. Each row’s minimum score depends only on its children score.

A more formal solution outline is as follows:

Define $c_i = i^{th} \text{ column of }C$ (And: $c_i^* = i^{th} \text{ column of }C^*$).
Define column-wise parsimony, $$Parsimony(T, C) = \sum_{i=0}^m Parsimony(T, c_i)$$
Define a diff function, $$ \alpha(a, b) = \begin{cases} 0 &\text{ if } a = b \cr 1 &\text{ otherwise} \end{cases} $$
Define an alphabet $A$. Each alignment string will be in $A^m$. (For DNA strings, one could us the alphabet: $\lbrace a, g, c, t \rbrace$; one could also use an amino acid alphabet 20 characters)
Express the parsimony score $s$ recursively. The score $s_k(v)$ for assigning a given label $k \in A$ to node $v$ depends on $v$:

In the base case, when $v$ is a leaf node, find the diff from its assigned label. $s_k(v) = \alpha(C_{v, i}, k)$.
In the recursive case, when $v$ is a parent node, its score will depend on its children. For a given child, the score is the child’s minimum score, plus any difference in label from child to parent ($k$). The minimum parent score is the minimum score over all children.

$$s_k(v) = \sum\limits_{(u, v) \in T}\min_{i \in A}\lbrace s_i(u) + \alpha(i, k)\rbrace$$

Assign labels: column minimization gets the symbol minimizing its score. $c^*{i}(v) = \argmin\limits{k \in A} s_k(v)$. Break ties by keeping the parent’s label. The minimum parsimony score is the score of the root.
Output $C^*$ by concatenating the columns $ c_i^ *$.

Here, we assumed that $T$ was rooted. If $T$ is unrooted, we can make a new tree $T_{rooted}$ by adding an arbitrary root node to $T$, solve $SmallParsimony(T_{rooted}, C)$, and then remove the root.

Large Parsimony

The large parsimony problem finds a tree that minimizes parsimony for a set of leaves. While the exact solution is computationally hard, one can attempt to use heuristic searches to find a “good” tree.

The overall algorithm is to:

Begin with some tree $T$, perhaps computed from UPGMA or Neighbor Joining and a string-based distance matrix.
Solve the small parsimony problem on $T_0$.
Iteratively improve on the tree $T_k$.

Generate all “nearest neighbors” of $T_k$. A nearest neighbor trades children between two internal nodes.
Solve the small parsimony problem for all of the nearest neighbors of $T_k$.
Select a nearest-neighbor successor, $T_{k+1}$, which has an improved parsimony score.

Output the final tree $T_n$, when no small tweaks (nearest neighbor trees) can improve the score.

Source

This information comes from week 3 of the Coursera Molecular Evolution course.