Peptide Sequencing
One question that often comes up is, given a peptide, what is the string of amino acids that form the peptide string $P$?
Overall, this problem is hard, and not well solved. The techniques in this post provide the biologically relevant sequence about 30% of the time, according to Pevzner and Compeau in Bioinformatics Algorithms.
Mass Spectrometry
A mass spectrometer breaks molecules apart and weighs the fragments. We assume that the break point selected by the mass spectrometer will be uniformly-randomly distributed such that every sub-peptide if $P$ is massed, with roughly equal probability. Given millions of identical peptides, mass spectrum will provide the mass of every sub-peptide.
The theoretical spectrum, therefore, will simply be the mass of every sub-peptide of $P$. A sub-peptide Depending on if the peptide is circular or not, the
Amino Acid Mass Table
The table below shows the mass (in Daltons) of each amino acid, based on its composition.
amino acid | code | abbrev | composition | mono mass | avg mass |
---|---|---|---|---|---|
glycine | G | GLY | $C_2H_3NO$ | 57.021463735 | 57.05132 |
alanine | A | ALA | $C_3H_5NO$ | 71.037113805 | 71.0779 |
serine | S | SER | $C_3H_5NO_2$ | 87.032028435 | 87.0773 |
proline | P | PRO | $C_5H_7NO$ | 97.052763875 | 97.11518 |
valine | V | VAL | $C_5H_9NO$ | 99.068413945 | 99.13106 |
threonine | T | THR | $C_4H_7NO_2$ | 101.047678505 | 101.10388 |
cysteine | C | CYS | $C_3H_5NOS$ | 103.009184505 | 103.1429 |
leucine | L | LEU | $C_6H_{11}NO$ | 113.084064015 | 113.15764 |
isoleucine | I | ILE | $C_6H_{11}NO$ | 113.084064015 | 113.15764 |
asparagine | N | ASN | $C_4H_6N_2O_2$ | 114.042927470 | 114.10264 |
aspartic acid | D | ASP | $C_4H_5NO_3$ | 115.026943065 | 115.0874 |
glutamine | Q | GLN | $C_5H_8N_2O_2$ | 128.058577540 | 128.12922 |
lysine | K | LYS | $C_6H_{12}N_2O$ | 128.094963050 | 128.17228 |
glutamic acid | E | GLU | $C_5H_7NO_3$ | 129.042593135 | 129.11398 |
methionine | M | MET | $C_5H_9NOS$ | 131.040484645 | 131.19606 |
histidine | H | HIS | $C_6H_7N_3O$ | 137.058911875 | 137.13928 |
phenylalanine | F | PHE | $C_9H_9NO$ | 147.068413945 | 147.17386 |
selenocysteine | U | SEC | $C_3H_5NOSe$ | 150.953633405 | 150.3079 |
arginine | R | ARG | $C_6H_{12}N_4O$ | 156.101111050 | 156.18568 |
tyrosine | Y | TYR | $C_9H_9NO_2$ | 163.063328575 | 163.17326 |
tryptophan | W | TRP | $C_{11}H_{10}N_2O$ | 186.079312980 | 186.2099 |
pyrrolysine | O | PYL | $C_{12}H_{19}N_3O_2$ | 237.147726925 | 237.29816 |
Table 1: from this site |
A simplified version, the integer mass table, is given here (in the form of a C++ map):
static auto* kIntegerMassTable = new std::map<char, int>({
{'G', 57},
{'A', 71},
{'S', 87},
{'P', 97},
{'V', 99},
{'T', 101},
{'C', 103},
{'I', 113},
{'L', 113},
{'N', 114},
{'D', 115},
{'K', 128},
{'Q', 128},
{'E', 129},
{'M', 131},
{'H', 137},
{'F', 147},
{'R', 156},
{'Y', 163},
{'W', 186},
});
Note that I and L; K and Q have the same integer masses.
Consistency
Two spectrums $S_1$ and $S_2$ are consistent if one is a subset of the other. For example, the empty spectrum is consistent will all other spectra:
$S_1 = Spectrum(P = \epsilon) = \emptyset$ is a subset of all possible spectra.
And importantly, any sub-peptide of $P$ has a consistent spectrum with $Spectrum(P)$
$P_1 = Substring(P_2) \implies IsConsistent(Spectrum(P_1), Spectrum(P_2))$
When growing a peptide sequence $P$ of length $L \to L + 1$ using amino acid $A$, we will add $A$ to each mass in the spectrum.
Note that the consistency need only be for the linear spectrum.
Algorithm
Consistency properties allow reverse-engineering candidate amino acid sequences for specturm $S$, by:
- Initializing candidate sequences with a single candidate, $P_0 = \epsilon$ (empty string),
- Extending every candidate sequence with every possible amino acid, followed by
- Prune every candidate whose spectrum is not consistent with $S$
- Continue steps 2 and 3 until either (a) one or more candidate $P$ has $Spectrum(P) = S$, or (b) all candidates are pruned.
Beam Search
Of course, the algorithm above assumes no errors. In practice, the spectrum will not match the ideal. Therefore, user can modify step (3), from requiring strict consistency, to keeping the most promising candidates. The score of a candidate is determined by how many masses in the generated spectrum are present in the target spectrum.
Missing Masses: Spectral Convolution
To find missing masses, find the modal positive differences between elements of the spectrum. The common elements are likely to be individual amino acids (or other sub-components) of the molecule.
The masses with the top-k count, in the range of 57..200
, can correlate with the amino acids present in the sequence.
Statistical Significance
It can be helpful to compare any matches discovered, against a randomly generated baseline.