Alignment

March 27, 2021

It is helpful to compare geneomes by matching them on alignments.

There are many biological applications of determining alignment between two strings.

Application: Decoding NRPs

One major application of finding alignment how NRPs are encoded from amino acid strings: there are “control” positions that indicate a certain location in the amino acid string, and then “data” positions that indicate which non-proteinogenic amino acid, as performed by Maraheil to determine the A-domains coding for Asp and Orn amino acids.

Hamming Distance

Hamming Distance simply counts the number of positions where there is a mismatch.

$H(a, b) = \sum_{i=1}^{|a|} I(a_i \neq b_i)$

Hamming distance is defined for equal length strings. Hamming distance is not good for determining alignment; the positionality is not sensitive to small shifts. For example, consider that the hamming distance for the following strings 4 (100% mismatch):

AGCT
GCTA
❌❌❌❌

Yet, a single positional shift for the second string would reduce the mismatch to just 1:

AGCT-
-GCTA
❌✔️✔️✔️❌

Longest Common Subsequence

A subsequence of a string, is any substring that can be formed out of the characters of a string in the order they appear. For example, given the string "biology", we can form subsequences of "", "big", "log", "ily", "iology".

The longest common subsequence between two strings is exactly as it sounds: a subsequence shared between two strings, of maximum length. Crucially, the subsequence allows “insertions” and “deletions,” making it a better candidate for alighment. For example:

LCS("AGCT", "GCTA") = "GCT"
LCS("ab", "ccabc") = "ab"
LCS("abc", "axxxbxxxc") = "abc"

The Longest Common Subsequence problem is a natural fit to dynamic programming. Longest common subsequence is a special case of finding the longest path in a Directed Acyclic Graph (DAG). The [Needleman-Wunsch algorithm] can solve LCS with dynamic programming.

The simplest form of LCS will score matching indices with score 1, and a score of zero for all others.

However, one can also add an indel - or insertion-deletion marker. indel means the strings can match if a char from one is deleted, or iinserted, at the given position. Allowing indels should come at a penalty. Further, there should also be a mismatch penalty (when taking from both strings, when they don’t match).

Further, when matching DNA strings - some mutations are actually more likely, and so we would want to penalize biologically “likely” mismatches less. The PAM250 matrix is sometimes used, as well as BLOSUM62

Edit distance

Edit distance is the number of changes needed to a string a to convert it to string b. It can be defined as the number of mismatches and indels in a global alignment between two strings.

Types

Alignments can be global (whole-string to whole-string), local (patial-string to partial string), or fitting (one smaller string is globally matched to all substrings of another, longer string).

Loosely, the table of alignment types is given below:

	`a` Partial	`a` Full
`b` Partial	`Local`	`Fitting(a -> b)`
`b` Full	`Fitting(b -> a)`	`Global`

The global constraint can limit biological utility; one is not always sure which areas are relevant to compare among two strings.

Local alignment removes the constraint of beginning at the start, and ending at the end. Local alignments help to find comparable regions, when strings are roughly matching (for example, the DNA encoding homeobox genes for mice and humans).
Fitting alignment is helpful when attempting to match a smaller, known sequence into a closely-fitting region of a larger one. Effectively, fitting alignment only permits indels in the smaller string; only exact substring matches in the longer sequence are allowed.

Finally, overlap alignment allows matching the suffix of one string to the prefix of another, to determine how well two strings could fit together.

Matrix

The alignment matrix $M$ between two strings $a$ and $b$ is an $(|a| + 1) \times (|b| + 1)$ matrix where $M_{i, j}$ provides the maximum alignment score over all possible subsequences between $a_{i - 1}$ and $b_{j - 1}$. Notationally, I am assuming OFFSET notation (zero-indexing, $a_0$ is the first char of $a$).

With an indel penalty of $\sigma$ and a char-comparison matrix $Score$¹, the recurrence relation is typically:

$$ M_{i, j} = \max \begin{cases} Score(a_{i-1}, b_{i-1}) + M_{i - 1, j - 1} \cr M_{i - 1, j} - \sigma \cr M_{i, j - 1} - \sigma \cr \end{cases} $$

Some examples using the output:

$M_{0, 0}$ = the alignment score from empty-string to empty-string (0).
$M_{|a|, |b|}$ = the maximum alignment score from all possible subsequences of $a$ with all subsequences of $b$.
$\argmax_{i\in{0..|a|}}{M_{i, |b|}}$ = the maximum alignment score from all subsequences of all prefixes of $a$ with all subsequences of $b$.

Initialization of the first row and column can determine subseqeunces under consideration, also:

For global alignment, one can initialize the first row/column with accumulating indel penalties (thus requiring each char to be scored).
To consider subsequences of suffixes, one could initialize the first row/column to zero, allowing a new sequence to begin at each position.

Affine Penalty

DNA replication errors can insert or delete entire intervals of $k$ nucleotides at once. To make alignments more biologically relevant, one can apply an affine gap penalty to account for this case. Affine gap penatly is decomposed into an opening penalty $o$, and an extension penalty $e$, with the extension penalty typically being less than the opening penalty.

$AffineGapPenalty(k) = o + (k - 1) e$

Tracking the gap-opening penalty can be done by adding two more matrices, purely for indel (gap) tracking. Entering these new matrics from the original alignment matrix costs $o$ and extending a gap in the new matrices costs $e$ once entered. An extra edge from either gap matrix back to the original one costs zero.

Space-Efficient Alignment

The alignment matrix costs $O(nm)$ space to store the back-pointers, but the score’s recurrence relation is technically only $O(\max{\lbrace n, m\rbrace})$ space by the recurrence relation. At the cost of time complexity, we can compute the alignment in less space.

A divide-and-conquer algorithm can recursively reconstruct the path in linear space, by computing the middle node in the optimal alignment path. For a given column, there is not a uniqueness guarantee for the optimal score; however, there is a uniqueness guarantee for the node $n_i$ having both (1) number of preceding edges = $OptimalLengthFromSource(i)$ and (2) ancestor edges = $OptimalLengthToSink(i)$.

$OptimalLengthFromSource$ can be computed in linear space ($n \times \frac m 2$); $OptimalLengthToSink$ is a simple mirror of $OptimalLengthFromSource$; one can be derived from the other by reversing the strings.

Once the middle node is determined, the problem is divided into finding best path to the left, and the best path to the right.

In the ideal case with a lot of (mis)matches, the area swept by following subproblems reduces in half, and will take $2nm$ time. But, in the worst case, when the path is over pure indels, and the subproblems reduce by only one node per iteration (either a single row or a single column), the runtime blows up to $n^2m^2$.

Multiple Sequence Alignment

Often, there are multiple sequences of interest to align. Some different approaches to find alignments for these are

Naive: create an N-d tensor for N sequences, with a size-N backtrack vector. This explodes in memory requirements quite fast, and is not practical.
Greedy: align all the strings pair-wise, and take them in decreasing order of score.
Median string: Align all the strings to a median string, based on a profile matrix of frquencies.

To score a given multiple-alignment, one could use a few different functions:

LongestCommonSubsequence: column-level matches are rewarded (+1). Any mismatch or indel in the column has zero-score.
Entropy: column-level sum over $entropy = -\sum p_x \log(p_x)$, where $x$ is each symbol present in the column.
Sum-of-Pairs: sum the scores of each pairwise alignment $(i, j)$. Enables complex scoring functions/scoring matrices.

Typically, $Score$ will penalize mismatches and reward matches. ↩︎