Peptide Sequencing

One question that often comes up is, given a peptide, what is the string of amino acids that form the peptide string $P$?

Overall, this problem is hard, and not well solved. The techniques in this post provide the biologically relevant sequence about 30% of the time, according to Pevzner and Compeau in Bioinformatics Algorithms.

Mass Spectrometry

A mass spectrometer breaks molecules apart and weighs the fragments. We assume that the break point selected by the mass spectrometer will be uniformly-randomly distributed such that every sub-peptide if $P$ is massed, with roughly equal probability. Given millions of identical peptides, mass spectrum will provide the mass of every sub-peptide.

The theoretical spectrum, therefore, will simply be the mass of every sub-peptide of $P$. A sub-peptide Depending on if the peptide is circular or not, the

Amino Acid Mass Table

The table below shows the mass (in Daltons) of each amino acid, based on its composition.

amino acidcodeabbrevcompositionmono massavg mass
glycineGGLY$C_2H_3NO$57.02146373557.05132
alanineAALA$C_3H_5NO$71.03711380571.0779
serineSSER$C_3H_5NO_2$87.03202843587.0773
prolinePPRO$C_5H_7NO$97.05276387597.11518
valineVVAL$C_5H_9NO$99.06841394599.13106
threonineTTHR$C_4H_7NO_2$101.047678505101.10388
cysteineCCYS$C_3H_5NOS$103.009184505103.1429
leucineLLEU$C_6H_{11}NO$113.084064015113.15764
isoleucineIILE$C_6H_{11}NO$113.084064015113.15764
asparagineNASN$C_4H_6N_2O_2$114.042927470114.10264
aspartic acidDASP$C_4H_5NO_3$115.026943065115.0874
glutamineQGLN$C_5H_8N_2O_2$128.058577540128.12922
lysineKLYS$C_6H_{12}N_2O$128.094963050128.17228
glutamic acidEGLU$C_5H_7NO_3$129.042593135129.11398
methionineMMET$C_5H_9NOS$131.040484645131.19606
histidineHHIS$C_6H_7N_3O$137.058911875137.13928
phenylalanineFPHE$C_9H_9NO$147.068413945147.17386
selenocysteineUSEC$C_3H_5NOSe$150.953633405150.3079
arginineRARG$C_6H_{12}N_4O$156.101111050156.18568
tyrosineYTYR$C_9H_9NO_2$163.063328575163.17326
tryptophanWTRP$C_{11}H_{10}N_2O$186.079312980186.2099
pyrrolysineOPYL$C_{12}H_{19}N_3O_2$237.147726925237.29816
Table 1: from this site

A simplified version, the integer mass table, is given here (in the form of a C++ map):

  static auto* kIntegerMassTable = new std::map<char, int>({
    {'G', 57},
    {'A', 71},
    {'S', 87},
    {'P', 97},
    {'V', 99},
    {'T', 101},
    {'C', 103},
    {'I', 113},
    {'L', 113},
    {'N', 114},
    {'D', 115},
    {'K', 128},
    {'Q', 128},
    {'E', 129},
    {'M', 131},
    {'H', 137},
    {'F', 147},
    {'R', 156},
    {'Y', 163},
    {'W', 186},
  });

Note that I and L; K and Q have the same integer masses.

Consistency

Two spectrums $S_1$ and $S_2$ are consistent if one is a subset of the other. For example, the empty spectrum is consistent will all other spectra:

$S_1 = Spectrum(P = \epsilon) = \emptyset$ is a subset of all possible spectra.

And importantly, any sub-peptide of $P$ has a consistent spectrum with $Spectrum(P)$

$P_1 = Substring(P_2) \implies IsConsistent(Spectrum(P_1), Spectrum(P_2))$

When growing a peptide sequence $P$ of length $L \to L + 1$ using amino acid $A$, we will add $A$ to each mass in the spectrum.

Note that the consistency need only be for the linear spectrum.

Algorithm

Consistency properties allow reverse-engineering candidate amino acid sequences for specturm $S$, by:

  1. Initializing candidate sequences with a single candidate, $P_0 = \epsilon$ (empty string),
  2. Extending every candidate sequence with every possible amino acid, followed by
  3. Prune every candidate whose spectrum is not consistent with $S$
  4. Continue steps 2 and 3 until either (a) one or more candidate $P$ has $Spectrum(P) = S$, or (b) all candidates are pruned.

Of course, the algorithm above assumes no errors. In practice, the spectrum will not match the ideal. Therefore, user can modify step (3), from requiring strict consistency, to keeping the most promising candidates. The score of a candidate is determined by how many masses in the generated spectrum are present in the target spectrum.

Missing Masses: Spectral Convolution

To find missing masses, find the modal positive differences between elements of the spectrum. The common elements are likely to be individual amino acids (or other sub-components) of the molecule.

The masses with the top-k count, in the range of 57..200, can correlate with the amino acids present in the sequence.

Statistical Significance

It can be helpful to compare any matches discovered, against a randomly generated baseline.