Sequencing
The following page is a summary on DNA sequencing.
The main idea of sequencing is to take some DNA as input, and determine the base pair ordering of the sequence. Since the DNA -> RNA -> Protein creation (“central dogma of biology”) is the building block of life, too, sequencing DNA is a key basis for understanding biology.
With the ability to isolate specific DNA, biologists can build libraries of DNA that encode specific proteins, and then synthesize these proteins cheaply.
Current-Technology
Reads
The genome is cut into pieces of a certain size (k-mer). Each k-mer is then stepped through a tube with light at a certain rate, and each base-pair will turn a different color given the light. A machine measures the light to establish the identity of the base, and assigns some quality score.
Contigs
Shorter reads are assembled using de Bruijn graphs. An ordered collection of contigs is called a scaffold.
One could order the contigs using a reference genenome from the same organism, similar organism, or slowly piece together known genome sequences using tools like restriction digest to cut the genome into pieces.
Caveats
Cutting the DNA into k-mers makes reconstruction difficult, especially for short k. Imagine setting k=1 - there would be a lot of As Gs Cs and Ts - and it would be impossible to reconstruct the original sequence. In practice, it seems that a k value of 50-85 could reconstruct the genome prokaryotic genomes pretty well. There are also paired read techniques that allow artificially lengthening the “k” value beyond what the hardware would otherwise support.
Is it really true that the DNA sequence of all cells for the same person are the same? DNA sequencing is typically done on cheek cells, so how could the cheek cell DNA be relevant for other parts of the body? On the whole, mutation rates are shockingly low - see thought experiment on cellular mutation.
Nanopore Sequencing
Nanopore sequencing attempts to achieve longer reads by sucking the DNA sequence through a small pore and testing one base pair. Nanopore sequencing can find longer k-mers, but there are still some accuracy issues to resolve (as far as I can tell, as of 2021).
File Formats
Genome sequencing is a complex process that leads to an alphabet soup of file formats.
My sense is the underlying explosion of file formats comes from a few different places:
- Different use cases (the fundamental complexity coming from different processes of NGS). Raw read data, read quality scores, aligned reads to a reference, or variant calls all come to mind.
- Text or binary: initial research formats used text for iteration but at scale binary is significantly more efficient. So for example SAM files can also be expressed as BAM or CRAM files - same information, just stored in binary and more efficiently.
- Index files: since one would often like to search for a specific region of interest most of the sequence and alignment files can also have corresponding index files for efficiency. FAI indexes FASTA, CRAI indexes CRAN, BAI indexes BAM, and so on.
Sources:
- UCSC provides a good overview here.
- GA4GH, provides official specifications here.
- The Broad Institude also provides a good index.
Brief summaries of these 3 sources are in this table:
File type | Extension | Use Case | Summary |
---|---|---|---|
FASTA | .fasta | Raw unaligned sequence | Simple text-based file for sequences. Can contain DNA, protein, or any format. This format is the output of sequencing. A single FASTA file can contain multiple related sequences, such as each chain of a multimer protein, or different genes within a chromosome. |
FASTQ | .fastq | Unaligned sequence + Quality | Sequence + quality information; quality is stored as an integer encoded as ascii (PHRED). This format is a detailed output of sequencing. |
BED (link) | .bed | Sequence annotations | Sequence annotation information, often for a set of exons. This file format is typically metadata and not other sequence information. It is quite generic. |
Big BED (link) | .bb | Binary BED | Efficient format for BED sequence annotations, based on AutoSQL. Since BED is generic, this storage format can be used to back other “bigXXX” types of files. |
AutoSQL (link) | .as | Data format - likely deprecated | Data format used to load arbitrary records to/from the genome browser. This format is an implementation detail. |
SAM (Sequence Alignment Map and Overview) | .sam | Alignments | Alignment information (such as to a reference genome), which can store how a sequence aligns to a reference one. Text-based. |
BAM (Binary Alignment Map) | .bam | Alignments | Same as SAM but binary/compact. |
BAM Index | .bai | Indexed Alignments | Allows for efficient searching of a BAM file. |
CRAM | .cram | Alignments | A better BAM format for storing sequence information, which is fully compatible with BAM and supports further optimization. Officially endorsed format. |
CRAI (CRAM Index) | .crai | Indexed Alignments | Index for efficient searching of a CRAM file. |
VCF (Variant Call Format) | .vcf | Variants and Mutations | Variant calling file stores gene sequence variations with respect to a reference. Since variation can be small with respect to the entire genome, the VCF format is relatively compact (see also: wikipedia). |
BCF (Binary call format) | .bcf | Variants and Mutations | Binary version of VCF. |
WIG (Wiggle) | .wig | Sequence annotations | Another format similar to BED for storing sequence annotations. More tailored to continuous values. |
HAL | .hal | Alignments of multiple sequences | Hierarchical alignment of many genomes. Can store ~1000s of genomes. |
Newick Standard Tree (link) | .nh | Tree format | Stores a tree format such as created with Phylogenies. |
PSL (link) | .psl | Alignments (from search) | Stores the output from BLAT comparing query and target sequences. |
Sources
This post is based on a summary paper: Computational analysis of next generation sequencing data and its applications in clinical oncology, and bioinformatics algorithms book.