Sequencing Quality
When a genome is assembled, in practice, we cannot assemble the whole genome at once, and instead rely on “contigs.” According to Coursera we will therefore, typically assemble contigs as the longest path without splits in the de bruijn graph (when doing de-novo sequencing). Contigs are non-overalpping.
A ordered set of contigs ordered as they appear in the genome is referred to as a scaffold.
We would like to evaluate the quality of scaffolds, to understand how well the system did to assemble.
Percentile metrics
All things being equal, longer reads are better than shorter reads. Too many short reads could indicate a problem in the assembly, whereas many longer reads could indicate the assembly was working well. Therefore, we can define a few metrics to attempt to capture that idea:
$N50$: the maximal contig length such that, the length of all contigs of length $N50$ or greater consists of more than 50% of the sum of all contig lengths. In the case of a perfect, single read, $N50$ would be equal to the length of the single contig; in the worst-case where just a single amino acid could be read many times, $N50$ would be equal to 1.
$NG50$: When the length of the genome is known to be $L_g$, the maximal contig length such that all contigs of length $NG50$ or greater consist of greater than 50% of the genome length.
$NGA50$: Given a reference genome, break apart contigs in the scaffold along errors (misassemblies). Then, compute $NG50$ on the shorter contigs.
QUAST (Quality Assessment Tool for Genome Assemblies)
One can evaluate some of these metrics using the QUAST tool. Given a FASTA file, the QUAST tool will determine some of these percentile quality metrics.