Distance-Based Phylogeny

September 15, 2021

Distance-based phylogeny tree construction attempts to fit a tree to a distance matrix $D$.

Section	Summary
Additive Distance Matrix Fitting	When the Four Points Condition is met, tree construction exactly matches the distance matrix.
UPGMA	Greedy, bottom-up clustering, using an ultrametric of age to weight the resulting tree.
Neighbor Joining	Greedy, bottom-up clustering based on a modified distance matrix.

Assumptions

The distance function $D$ must be symmetric, non-negative, and satisfy the triangle inequality ($D_{i,k} + D_{k, j} \ge D_{i, j}$).¹

We also assume we could, for a given tree $T$, find pairwise distances between all leaves.

Fit Quality: Discrepancy

To measure the quality of the tree fitting, one can find the $Discrepancy$ between an approximation tree $T$ and a distance matrix $D$ by computing all-pairs shortest-path lengths $d$ from $T$, and comparing to $D$.

Specifically, one can find $Discrepancy(T, D) = \sum_{i \lt j}{(d_{i, j}(T) - D_{i,j})^2}$. Smaller is better.

In general, finding a tree that minimizes the discrepancy is a computationally hard problem.²

Four Point Theorem: Exact Fit

Sometimes, a tree can fit a distance matrix exactly ($Discrepancy(T, D) = 0$). In this rare case, the distance matrix is termed “additive.”

The distance matrix is additive $\iff$ a tree can be created from the distance matrix; each entry in the distance matrix $D_{i, j}$ must correspond to the path in the tree connecting $i, j$.

Four Point Theorem: Distance matrix is additive, $\iff \forall (i, j, k, l)$, two sums are equal and one is less than or equal to the other two out of:

$D_{i, j} + D_{k, l}$
$D_{i, k} + D_{j, l}$
$D_{i, l} + D_{j, k}$

Proof: If the distance matrix is additive, then it corresponds to a tree. The three above equations find the total distance for different pairings of leaves within the quartet. If node $i$ belongs to the same subtree as its partner, the total paired distances never connect subtrees. But, when the two pairs connect different subtrees, the additional distance connecting them is symmetrical, since each pair is connected symetrically across the divide. (Of course, if all four nodes belong to the same subtree, then the distances among them are always equal).

Forward: Therefore, given an additive distance matrix and its corresponding tree, either (a) one sum will be less (by the distance to connect subtrees), and the two (symmetrical) sums will be larger and equal or (b) the sums will be equal.
Reverse: if the matrix is not additive, then no tree can be constructed. There must therefore exist some cycle within the graph corresponding to the distance matrix. Selecting four nodes in that cycle will violate the constraint: specifically, the symmetrical paired-distances will not be equal.

Tree Construction From Additive Distance Matrices

Given an additive distance matrix $D$, form a phylogeny tree by recursively removing a leaf node and its limb length from the distance matrix (until two leaves are left); then recursively add back the limbs and leaves, attaching where the limb length is zero.

A limb as the edge connecting a leaf to its parent. Even though the parent is latent when building phylogenies, with an additive distance matrix, one could find the distances from a leaf to a pair of two other leaves, then subtract the distance between the leaves (which may go through the parent), and divide by two (since the distance is travelled from each leaf). The minimum distance will be the limb length.

Tree Fitting Non-Additive Distance Matrices

The additivity constraint is quite strong, and few real-world biological problems will be strictly additive (e.g. noise in alignment, sequencing. Life is messy).

UltraMetric

Making two assumptions, we can create an “Ultra-Metric” of age in the phylogeny tree:

Each internal node comes from a speciation event, in which a single ancestor species was split into a new species.
Each internal node can be assigned an “age,” relative to some grounding year.

By (2), assume all present-day species have age of zero, the “root” ancestor has the maximum age, and all internal species to the phylogeny have some age in-between. The path between two nodes in the phylogeny tree represents the time difference between each node’s speciation event.

UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

UPGMA performs greedy, bottom-up clustering on the distance matrix $D$. When creating the clusters, a tree is created with an UltraMetric of age.

The algorithm is roughly:

Given an initial distance matrix $D$, iteratively merge the two closest elements $(C_i, C_j)$ of $D$ into a new cluster $C_{new}$, which has $U(C_{new}) = Avg(U(C_i), U(C_j))$. Add a node to the tree $T$, via edges $(C_i, C_{new})$ and $(C_j, C_{new})$, with weights being the difference in the ultrametric. Return $T$ when there is only a single cluster (the root) remaining.

Neighbor-Joining

Neighbor-joining is also a greedy, bottom-up clustering algorithm. The joining uses a modified distance matrix $D^*$ to find pairs of leaves to join. The modified matrix depends on first computing total distances $d^{total}$ from one leaf the others:

$$d^{total}_ {i} = \sum_{j=0}^n D_{i,j}$$

Then, computing the leaf-joining distance matrix as follows:

$$D^*{i, j} = (n-2) \times D{i,j} - d^{total}_j - d^{total}_i$$

Breaking down each term in the minimization

The $(n-2) \times D_{i,j}$ term ensures that leaf pairs farther apart from eachother are not selected.
The $- d^{total}$ terms ensure that leaves with larger limb length are more likely to be selected.

The hamming distance between two alignment strings satisfies each property: symmetric (the differing positions are identical in each string), non-negative (minimum distance is 0, for matching strings), and satisfies the triangle inequality (intermediate string can only increase mismatches). ↩︎
In fact, it is NP-Complete. The tree-fitting problem for phylogenies is related to the Steiner Tree Problem; the proof is outlined Day, W. H. and D. Sankoff.“Computational complexity of inferring phylogenies from chromosome inversion data.” Journal of theoretical biology 124 2 (1987): 213-8 ., which reduces the vertex cover problem to the Steiner trees, and vice-versa. ↩︎