Sequencing Quality

When a genome is assembled, in practice, we cannot assemble the whole genome at once, and instead rely on “contigs.” According to Coursera we will therefore, typically assemble contigs as the longest path without splits in the de bruijn graph (when doing de-novo sequencing). Contigs are non-overalpping.

A ordered set of contigs ordered as they appear in the genome is referred to as a scaffold.

We would like to evaluate the quality of scaffolds, to understand how well the system did to assemble.

Percentile metrics

All things being equal, longer reads are better than shorter reads. Too many short reads could indicate a problem in the assembly, whereas many longer reads could indicate the assembly was working well. Therefore, we can define a few metrics to attempt to capture that idea:

QUAST (Quality Assessment Tool for Genome Assemblies)

One can evaluate some of these metrics using the QUAST tool. Given a FASTA file, the QUAST tool will determine some of these percentile quality metrics.