Peptide Sequencing

March 20, 2021

One question that often comes up is, given a peptide, what is the string of amino acids that form the peptide string $P$?

Overall, this problem is hard, and not well solved. The techniques in this post provide the biologically relevant sequence about 30% of the time, according to Pevzner and Compeau in Bioinformatics Algorithms.

Mass Spectrometry

A mass spectrometer breaks molecules apart and weighs the fragments. We assume that the break point selected by the mass spectrometer will be uniformly-randomly distributed such that every sub-peptide if $P$ is massed, with roughly equal probability. Given millions of identical peptides, mass spectrum will provide the mass of every sub-peptide.

The theoretical spectrum, therefore, will simply be the mass of every sub-peptide of $P$. A sub-peptide Depending on if the peptide is circular or not, the

Amino Acid Mass Table

The table below shows the mass (in Daltons) of each amino acid, based on its composition.

amino acid	code	abbrev	composition	mono mass	avg mass
glycine	G	GLY	$C_2H_3NO$	57.021463735	57.05132
alanine	A	ALA	$C_3H_5NO$	71.037113805	71.0779
serine	S	SER	$C_3H_5NO_2$	87.032028435	87.0773
proline	P	PRO	$C_5H_7NO$	97.052763875	97.11518
valine	V	VAL	$C_5H_9NO$	99.068413945	99.13106
threonine	T	THR	$C_4H_7NO_2$	101.047678505	101.10388
cysteine	C	CYS	$C_3H_5NOS$	103.009184505	103.1429
leucine	L	LEU	$C_6H_{11}NO$	113.084064015	113.15764
isoleucine	I	ILE	$C_6H_{11}NO$	113.084064015	113.15764
asparagine	N	ASN	$C_4H_6N_2O_2$	114.042927470	114.10264
aspartic acid	D	ASP	$C_4H_5NO_3$	115.026943065	115.0874
glutamine	Q	GLN	$C_5H_8N_2O_2$	128.058577540	128.12922
lysine	K	LYS	$C_6H_{12}N_2O$	128.094963050	128.17228
glutamic acid	E	GLU	$C_5H_7NO_3$	129.042593135	129.11398
methionine	M	MET	$C_5H_9NOS$	131.040484645	131.19606
histidine	H	HIS	$C_6H_7N_3O$	137.058911875	137.13928
phenylalanine	F	PHE	$C_9H_9NO$	147.068413945	147.17386
selenocysteine	U	SEC	$C_3H_5NOSe$	150.953633405	150.3079
arginine	R	ARG	$C_6H_{12}N_4O$	156.101111050	156.18568
tyrosine	Y	TYR	$C_9H_9NO_2$	163.063328575	163.17326
tryptophan	W	TRP	$C_{11}H_{10}N_2O$	186.079312980	186.2099
pyrrolysine	O	PYL	$C_{12}H_{19}N_3O_2$	237.147726925	237.29816
Table 1: from this site

A simplified version, the integer mass table, is given here (in the form of a C++ map):

  static auto* kIntegerMassTable = new std::map<char, int>({
    {'G', 57},
    {'A', 71},
    {'S', 87},
    {'P', 97},
    {'V', 99},
    {'T', 101},
    {'C', 103},
    {'I', 113},
    {'L', 113},
    {'N', 114},
    {'D', 115},
    {'K', 128},
    {'Q', 128},
    {'E', 129},
    {'M', 131},
    {'H', 137},
    {'F', 147},
    {'R', 156},
    {'Y', 163},
    {'W', 186},
  });

Note that I and L; K and Q have the same integer masses.

Consistency

Two spectrums $S_1$ and $S_2$ are consistent if one is a subset of the other. For example, the empty spectrum is consistent will all other spectra:

$S_1 = Spectrum(P = \epsilon) = \emptyset$ is a subset of all possible spectra.

And importantly, any sub-peptide of $P$ has a consistent spectrum with $Spectrum(P)$

$P_1 = Substring(P_2) \implies IsConsistent(Spectrum(P_1), Spectrum(P_2))$

When growing a peptide sequence $P$ of length $L \to L + 1$ using amino acid $A$, we will add $A$ to each mass in the spectrum.

Note that the consistency need only be for the linear spectrum.

Algorithm

Consistency properties allow reverse-engineering candidate amino acid sequences for specturm $S$, by:

Initializing candidate sequences with a single candidate, $P_0 = \epsilon$ (empty string),
Extending every candidate sequence with every possible amino acid, followed by
Prune every candidate whose spectrum is not consistent with $S$
Continue steps 2 and 3 until either (a) one or more candidate $P$ has $Spectrum(P) = S$, or (b) all candidates are pruned.

Beam Search

Of course, the algorithm above assumes no errors. In practice, the spectrum will not match the ideal. Therefore, user can modify step (3), from requiring strict consistency, to keeping the most promising candidates. The score of a candidate is determined by how many masses in the generated spectrum are present in the target spectrum.

Missing Masses: Spectral Convolution

To find missing masses, find the modal positive differences between elements of the spectrum. The common elements are likely to be individual amino acids (or other sub-components) of the molecule.

The masses with the top-k count, in the range of 57..200, can correlate with the amino acids present in the sequence.

Statistical Significance

It can be helpful to compare any matches discovered, against a randomly generated baseline.