Sequencing

The following page is a summary on DNA sequencing.

The main idea of sequencing is to take some DNA as input, and determine the base pair ordering of the sequence. Since the DNA -> RNA -> Protein creation (“central dogma of biology”) is the building block of life, too, sequencing DNA is a key basis for understanding biology.

With the ability to isolate specific DNA, biologists can build libraries of DNA that encode specific proteins, and then synthesize these proteins cheaply.

Current-Technology

Reads

The genome is cut into pieces of a certain size (k-mer). Each k-mer is then stepped through a tube with light at a certain rate, and each base-pair will turn a different color given the light. A machine measures the light to establish the identity of the base, and assigns some quality score.

Contigs

Shorter reads are assembled using de Bruijn graphs. An ordered collection of contigs is called a scaffold.

One could order the contigs using a reference genenome from the same organism, similar organism, or slowly piece together known genome sequences using tools like restriction digest to cut the genome into pieces.

Caveats

Nanopore Sequencing

Nanopore sequencing attempts to achieve longer reads by sucking the DNA sequence through a small pore and testing one base pair. Nanopore sequencing can find longer k-mers, but there are still some accuracy issues to resolve (as far as I can tell, as of 2021).

File Formats

Genome sequencing is a complex process that leads to an alphabet soup of file formats.

My sense is the underlying explosion of file formats comes from a few different places:

  1. Different use cases (the fundamental complexity coming from different processes of NGS). Raw read data, read quality scores, aligned reads to a reference, or variant calls all come to mind.
  2. Text or binary: initial research formats used text for iteration but at scale binary is significantly more efficient. So for example SAM files can also be expressed as BAM or CRAM files - same information, just stored in binary and more efficiently.
  3. Index files: since one would often like to search for a specific region of interest most of the sequence and alignment files can also have corresponding index files for efficiency. FAI indexes FASTA, CRAI indexes CRAN, BAI indexes BAM, and so on.

Sources:

  1. UCSC provides a good overview here.
  2. GA4GH, provides official specifications here.
  3. The Broad Institude also provides a good index.

Brief summaries of these 3 sources are in this table:

File typeExtensionUse CaseSummary
FASTA.fastaRaw unaligned sequenceSimple text-based file for sequences. Can contain DNA, protein, or any format. This format is the output of sequencing. A single FASTA file can contain multiple related sequences, such as each chain of a multimer protein, or different genes within a chromosome.
FASTQ.fastqUnaligned sequence + QualitySequence + quality information; quality is stored as an integer encoded as ascii (PHRED). This format is a detailed output of sequencing.
BED (link).bedSequence annotationsSequence annotation information, often for a set of exons. This file format is typically metadata and not other sequence information. It is quite generic.
Big BED (link).bbBinary BEDEfficient format for BED sequence annotations, based on AutoSQL. Since BED is generic, this storage format can be used to back other “bigXXX” types of files.
AutoSQL (link).asData format - likely deprecatedData format used to load arbitrary records to/from the genome browser. This format is an implementation detail.
SAM (Sequence Alignment Map and Overview).samAlignmentsAlignment information (such as to a reference genome), which can store how a sequence aligns to a reference one. Text-based.
BAM (Binary Alignment Map).bamAlignmentsSame as SAM but binary/compact.
BAM Index.baiIndexed AlignmentsAllows for efficient searching of a BAM file.
CRAM.cramAlignmentsA better BAM format for storing sequence information, which is fully compatible with BAM and supports further optimization. Officially endorsed format.
CRAI (CRAM Index).craiIndexed AlignmentsIndex for efficient searching of a CRAM file.
VCF (Variant Call Format).vcfVariants and MutationsVariant calling file stores gene sequence variations with respect to a reference. Since variation can be small with respect to the entire genome, the VCF format is relatively compact (see also: wikipedia).
BCF (Binary call format).bcfVariants and MutationsBinary version of VCF.
WIG (Wiggle).wigSequence annotationsAnother format similar to BED for storing sequence annotations. More tailored to continuous values.
HAL.halAlignments of multiple sequencesHierarchical alignment of many genomes. Can store ~1000s of genomes.
Newick Standard Tree (link).nhTree formatStores a tree format such as created with Phylogenies.
PSL (link).pslAlignments (from search)Stores the output from BLAT comparing query and target sequences.

Sources

This post is based on a summary paper: Computational analysis of next generation sequencing data and its applications in clinical oncology, and bioinformatics algorithms book.