Sequencing

January 30, 2021

The following page is a summary on DNA sequencing.

The main idea of sequencing is to take some DNA as input, and determine the base pair ordering of the sequence. Since the DNA -> RNA -> Protein creation (“central dogma of biology”) is the building block of life, too, sequencing DNA is a key basis for understanding biology.

With the ability to isolate specific DNA, biologists can build libraries of DNA that encode specific proteins, and then synthesize these proteins cheaply.

Current-Technology

Reads

The genome is cut into pieces of a certain size (k-mer). Each k-mer is then stepped through a tube with light at a certain rate, and each base-pair will turn a different color given the light. A machine measures the light to establish the identity of the base, and assigns some quality score.

Contigs

Shorter reads are assembled using de Bruijn graphs. An ordered collection of contigs is called a scaffold.

One could order the contigs using a reference genenome from the same organism, similar organism, or slowly piece together known genome sequences using tools like restriction digest to cut the genome into pieces.

Caveats

Cutting the DNA into k-mers makes reconstruction difficult, especially for short k. Imagine setting k=1 - there would be a lot of As Gs Cs and Ts - and it would be impossible to reconstruct the original sequence. In practice, it seems that a k value of 50-85 could reconstruct the genome prokaryotic genomes pretty well. There are also paired read techniques that allow artificially lengthening the “k” value beyond what the hardware would otherwise support.
Is it really true that the DNA sequence of all cells for the same person are the same? DNA sequencing is typically done on cheek cells, so how could the cheek cell DNA be relevant for other parts of the body? On the whole, mutation rates are shockingly low - see thought experiment on cellular mutation.

Nanopore Sequencing

Nanopore sequencing attempts to achieve longer reads by sucking the DNA sequence through a small pore and testing one base pair. Nanopore sequencing can find longer k-mers, but there are still some accuracy issues to resolve (as far as I can tell, as of 2021).

File Formats

Genome sequencing is a complex process that leads to an alphabet soup of file formats.

My sense is the underlying explosion of file formats comes from a few different places:

Different use cases (the fundamental complexity coming from different processes of NGS). Raw read data, read quality scores, aligned reads to a reference, or variant calls all come to mind.
Text or binary: initial research formats used text for iteration but at scale binary is significantly more efficient. So for example SAM files can also be expressed as BAM or CRAM files - same information, just stored in binary and more efficiently.
Index files: since one would often like to search for a specific region of interest most of the sequence and alignment files can also have corresponding index files for efficiency. FAI indexes FASTA, CRAI indexes CRAN, BAI indexes BAM, and so on.

Sources:

UCSC provides a good overview here.
GA4GH, provides official specifications here.
The Broad Institude also provides a good index.

Brief summaries of these 3 sources are in this table:

File type	Extension	Use Case	Summary
FASTA	.fasta	Raw unaligned sequence	Simple text-based file for sequences. Can contain DNA, protein, or any format. This format is the output of sequencing. A single FASTA file can contain multiple related sequences, such as each chain of a multimer protein, or different genes within a chromosome.
FASTQ	.fastq	Unaligned sequence + Quality	Sequence + quality information; quality is stored as an integer encoded as ascii (PHRED). This format is a detailed output of sequencing.
BED (link)	.bed	Sequence annotations	Sequence annotation information, often for a set of exons. This file format is typically metadata and not other sequence information. It is quite generic.
Big BED (link)	.bb	Binary BED	Efficient format for BED sequence annotations, based on AutoSQL. Since BED is generic, this storage format can be used to back other “bigXXX” types of files.
AutoSQL (link)	.as	Data format - likely deprecated	Data format used to load arbitrary records to/from the genome browser. This format is an implementation detail.
SAM (Sequence Alignment Map and Overview)	.sam	Alignments	Alignment information (such as to a reference genome), which can store how a sequence aligns to a reference one. Text-based.
BAM (Binary Alignment Map)	.bam	Alignments	Same as SAM but binary/compact.
BAM Index	.bai	Indexed Alignments	Allows for efficient searching of a BAM file.
CRAM	.cram	Alignments	A better BAM format for storing sequence information, which is fully compatible with BAM and supports further optimization. Officially endorsed format.
CRAI (CRAM Index)	.crai	Indexed Alignments	Index for efficient searching of a CRAM file.
VCF (Variant Call Format)	.vcf	Variants and Mutations	Variant calling file stores gene sequence variations with respect to a reference. Since variation can be small with respect to the entire genome, the VCF format is relatively compact (see also: wikipedia).
BCF (Binary call format)	.bcf	Variants and Mutations	Binary version of VCF.
WIG (Wiggle)	.wig	Sequence annotations	Another format similar to BED for storing sequence annotations. More tailored to continuous values.
HAL	.hal	Alignments of multiple sequences	Hierarchical alignment of many genomes. Can store ~1000s of genomes.
Newick Standard Tree (link)	.nh	Tree format	Stores a tree format such as created with Phylogenies.
PSL (link)	.psl	Alignments (from search)	Stores the output from BLAT comparing query and target sequences.

Sources

This post is based on a summary paper: Computational analysis of next generation sequencing data and its applications in clinical oncology, and bioinformatics algorithms book.