Zoonomia

The Zoonomia project compares genomes across many mammals across the animal kingdom. Paper is here with citation:

Zoonomia Consortium. A comparative genomics multitool for scientific discovery and conservation. Nature 587, 240–245 (2020). https://doi.org/10.1038/s41586-020-2876-6

The website is also a cool resource.

Approach

The key contributions of the Zoonomia project are to sequence many mammal genomes, and align them to find orthologous genes across species.

The authors used DISCOVAR used to assemble sequences de-novo, and used the Cactus pipeline (paper, code) to assemble large-scale aligned datasets. Tools for reading/modifying format are provided in the hal toolkit. In my opinion, this work is an impressive accomplishment of computational scaling and algorithmic innovation to push the limits of number of genomes processed, and makes smart use of evolutionary trees to make the many-to-many alignment problem more tractable.

The genome10k project seems to be a continuation and future direction for the Zoonomia project.

Browsing Output

(For more information on file formats for NGS, see note on Sequencing File Formats).

The data investigated here is called 241-mammalian-2020v2.hal (browser link). It is 806GB and contains the genomes for 241 mammals.

A naive implementation of alignment would require pairwise alignments, or roughly 30k alignments to do. The Cactus alignment algorithm provides significant optimization as it only requires about 500 alignments, by building an alignment tree - rather than aligning every single genome to all the others, inferred ancestor genomes can be aligned to other ancestor genomes, which allows computing pairwise alignments on-demand in a relatively quick manner. The HAL toolkit provides mechanisms for modifying this tree, as well as simple operations traversing the tree such as liftover to convert gene annotations from one genome to other genomes by traversing the alignment graph (BED, WIG).

The overall tree can be found on the the Zoonomia website. I made a copy with the Tree of Life project here for searching/browsing different branches in this post.

Analysis with HAL and BLAST

The following section attempts to perform some simple analysis on a subset of the Zoonomia dataset.

TP53

The Zoonomia paper hints that large mammals may contain extra copies of the tp53 gene, which could be part of the solution to Peto’s Paradox that cancer rates do not necessarily scale to number of cells in an organism.

The p53 protein is the “guardian of the genome” and it helps to slow reproduction and destroy damaged cells. The gene that codes p53 in humans is TP53 which is given here. P53 is one of the most studied genes, which is why we look at it here in the Zoonomia dataset.

Limiting analysis to Seals using halExtract

Limiting analysis to two similar seals can give us a sense for using HAL tools, and also test a hypothesis that larger sea mammals may have multiple copies of this gene.

Using halExtract we pull out a sub-tree with just two seals, Mirounga angustirostris (Northern Elephant Seal, genome assembly link) and Leptonychotes weddellii (Weddell Seal, arctic, genome assembly link):

$ halExtract --root fullTreeAnc216 241-mammalian-2020v2.hal seals.hal

$ halStats seals.hal confirms we extracted the data correctly.

GenomeNameNumChildrenLengthNumSequencesNumTopSegmentsNumBottomSegments
fullTreeAnc21622269286127106924402077225826238
Leptonychotes_weddellii0315688615916710251646440
Mirounga_angustirostris02407319723321256265212330

Finding tp53 in Large Mammal genomes BLAST

Whole-genome analysis can be quite slow overall, so ideally we want to look at some subset of the genome of interest. The “long” way to do this is to do a manual BLAST search for a particular gene (in this case, tp53). We will use BLAST to find the gene within the Homo sapiens Zoonomia genome, then use HAL to compare across species. The main species of interest are the two seal genomes mentioned above, as well as the savanah elephant which is known to have multiple copies of this gene.

To find this gene (or a possibly orthologous one), we start by downloading 25 isoform genes for TP53 from here) and put them into the file tp53.fasta for querying.

To do the actual querying, we create a BLAST databases for each of the genomes of interest. Then we’ll go ahead and search the databases for the 25 isoform genes of tp53 downloaded above.

function make_blast_db() {
    # Pull out a genome into a FASTA file and index it into a BLAST DB.
    GENOME_NAME=$1
    echo "Extracting $GENOME_NAME FASTA File"
    hal2fasta 241-mammalian-2020v2.hal "$GENOME_NAME" > "${GENOME_NAME}.fasta"
    mkdir -p "blast/${GENOME_NAME}"
    makeblastdb -dbtype nucl \
        -in "${GENOME_NAME}.fasta" \
        -out "blast/${GENOME_NAME}/db"
}

function search_blast_db() {
    # Search an existing blast DB for sequences in a given FASTA file.
    GENOME_NAME=$1
    QUERY_FASTA_FILE=$2
    QUERY_NAME=$(basename $QUERY_FASTA_FILE | cut -d. -f1)
    echo "Searching $GENOME_NAME genome for $QUERY_FASTA_FILE"
    blastn \
        -db "blast/${GENOME_NAME}/db" \
        -query "$QUERY_FASTA_FILE" \
        -out "${GENOME_NAME}_${QUERY_NAME}.out"
}

# Index and search for tp53.
$ make_blast_db Mirounga_angustirostris
$ make_blast_db Leptonychotes_weddellii
$ search_blast_db Mirounga_angustirostris tp53.fasta
$ search_blast_db Leptonychotes_weddellii tp53.fasta

# Check matches in human and elephant genomes to confirm positive controls.
$ make_blast_db Homo_sapiens
$ search_blast_db Homo_sapiens tp53.fasta
$ make_blast_db Loxodonta_africana
$ search_blast_db Loxodonta_africana tp53.fasta

The results for searching for isoform g (NM_001276761.3) are:

Finding tp53 in Seal genomes with halLiftover

The HAL format is optimized for finding similar sequences across genomes.

From the TP53 site we can create a BED file to annotate the Homo_sapiens for the gene in question. (Note that the BED file is expected to be tab-separated). For negative-strand matches returned in BLAST in the form (end, start) we return (start, end) and keep the strand as - per blast_to_bed.py.

# Put the location we found using BLAST into BED format. Use Python to print tabs.
python -c 'print(\
    "\t".join(["chr17", "7668420", "7669699", "tp53", "1000", "-"]))' \
    > tp53_homo_sapiens.bed

# Run Liftover from Homo_sapiens to Leptonychotes_weddellii.
halLiftover --bedType 6 \
241-mammalian-2020v2.hal \
Homo_sapiens tp53_homo_sapiens.bed \
Leptonychotes_weddellii Leptonychotes_weddellii_tp53_liftover.bed 

The result is that halLiftover found a much longer alignment along, 336374 - 335437, but allowed for more gaps. It’s likely that halLiftover was more permissive than with BLAST.

Retrying for positive control with the African elephant, Loxodonta_africana, gives large overlap spanning about 1kbp, but I think due to the way HAL handles the duplicated genomes it only finds one single chromosome rather than the 16 that BLAST found.

halLiftover --bedType 6 \
241-mammalian-2020v2.hal \
Homo_sapiens tp53_homo_sapiens.bed \
Loxodonta_africana Loxodonta_africana_tp53_liftover.bed

Looking at Seal Mutation Stats

Let’s look at the mutations across the species with halSummarizeMutations seals.hal:

GenomeNameParentNameBranchLengthGenomeLengthParentLengthSubtitutionsTransitionsTransversionsMatchesGapInsertionsGapInsertedBasesGapDeletionsGapDeletedBasesInsertionsInsertionBasesDeletionsDeletionBasesInversionsInvertedBasesDuplicationsDuplicatedBasesTranspositionsTranspositionBasesOther
Leptonychotes_weddelliifullTreeAnc2160.01315688615922692861272159122286620953832333217296018472327220727701406051119842001344331674833293897829577053251059036528881513585619183560549174259
Mirounga_angustirostrisfullTreeAnc2160.0021824073197232269286127160912841076339147490672239260145105759832129151648709775936421356961815883484053791930299461796236031259627713073633030429965
Total0.01218556420588245385722543768250619425486858140044122203291780870528568530547601974356434800222929921287383674963562415208322889194109862749919890853204224
Average0.0060927821029412269286127188412539712743429070022061101648904352642842152738098717821740011146496064369133748173127604161444592054931324954945426102113

To be honest, I’m not really sure what to make of these stats, but we can just keep them here for future reference.

Finding tp53 mutations in all of Zoonomia using halSnps

The analysis now concludes with a search for tp53 across all of Zoonomia mammals.

First, we need to confirm that the alignments we got from BLAST are matching the tp53 gene. To perform a quick sanity check, we export the sequence using hal2fasta and confirm that the resulting FASTA file’s reverse-complement (per bioinformatics.org) is matching the expected region for tp53 within the BLAST output.

hal2fasta --sequence chr17 --start 7668420 --length 1279 --upper \
--outFaPath Homo_sapiens_tp53_in_hal_file.fasta \
241-mammalian-2020v2.hal Homo_sapiens

Comparing the beginning of the match, we see:

BLAST:

Query  1230     CTCACTCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCA  1289
                ||| || |||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  7669699  CTC-CTACAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCA  7669641

RC FAST:        CTC CTACAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCA

The end of the match is the same:

BLAST:

Query  2490     ACAATAAAACTTTGCTGCCA  2509
                ||||||||||||||||||||
Sbjct  7668440  ACAATAAAACTTTGCTGCCA  7668421

RC FASTA:       ACAATAAAACTTTGCTGCCA

With more confidence that the extracted region is what we’ve intended it to be, we can run halSnps on our 4 previously studied genomes. Assuming halSnps handles strandedness as BLAST does, it should be fine to search for the reverse complement (seems like it does in the code).

halSnps will report the number of nucleotide matches (totalSnps) and cross-species mismatches (totalCleanOrthologousPairs) for orthologous genes to the target.

We run the command and check the output for positive control:

halSnps --refSequence chr17 --start 7668420 --length 1279 \
241-mammalian-2020v2.hal Homo_sapiens \
Homo_sapiens,Loxodonta_africana,Mirounga_angustirostris,Leptonychotes_weddellii

Predictably, Homo sapiens has a perfect match (so it has zero SNPs with the entire sequence length matching). The other species do contain more matches, which was not something discovered using BLAST. It seems like the orthologous genes may be “forced” in the alignment.

targetGenometotalSnpstotalCleanOrthologousPairs
Homo_sapiens01279
Loxodonta_africana212872
Mirounga_angustirostris221842
Leptonychotes_weddellii216844

Now that we’ve confirmed reasonable output, we can run halSnps on all the zoonomia mammals. We test that at least 10% of the species (24) must have a mutation at the position for it to be labeled as a SNP, for smoothing. The 10% threshold was selected somewhat arbitrarily but should account for some noise in the sequencing.

function all_genome_names() {
    hal_file=$1
    echo $(halStats $hal_file | grep -v -e "fullTreeAnc\|hal v\|^$\|GenomeName" | cut -d "," -f1 | sort | uniq | paste -s -d, -)
}

halSnps --refSequence chr17 --start 7668420 --length 1279 --minSpeciesForSnp 24 \
241-mammalian-2020v2.hal Homo_sapiens \
"$(all_genome_names 241-mammalian-2020v2.hal)" | sed "s/ /,/g"

Certain primates share nearly identical copies of this gene. Future directions may plot similarity of this gene and others with expected lifespan. It could be interesting to cluster genes detected here, as there seems to be a modal peak with ~800bp in common with the human TP53 gene.

The full output for all 241 mammals is:

targetGenometotalSnpstotalCleanOrthologousPairs
Myotis_lucifugus23
Homo_sapiens01279
Myotis_myotis33207
Acinonyx_jubatus226770
Myrmecophaga_tridactyla43217
Acomys_cahirinus179497
Nannospalax_galili35120
Ailuropoda_melanoleuca228833
Nasalis_larvatus651263
Ailurus_fulgens248834
Neomonachus_schauinslandi223844
Allactaga_bullata231822
Neophocaena_asiaeorientalis201885
Alouatta_palliata216
Noctilio_leporinus417
Ammotragus_lervia256879
Nomascus_leucogenys501262
Anoura_caudifer25143
Nycticebus_coucang2497
Antilocapra_americana244877
Ochotona_princeps145420
Aotus_nancymaae210
Octodon_degus232827
Aplodontia_rufa00
Odobenus_rosmarus219829
Artibeus_jamaicensis27145
Odocoileus_virginianus163536
Ateles_geoffroyi03
Okapia_johnstoni253878
Balaenoptera_acutorostrata194885
Ondatra_zibethicus195678
Balaenoptera_bonaerensis193885
Onychomys_torridus70209
Beatragus_hunteri256879
Orcinus_orca205885
Bison_bison250875
Orycteropus_afer144640
Bos_indicus253875
Oryctolagus_cuniculus733
Otolemur_garnettii2999
Bos_mutus249875
Ovis_aries257879
Bos_taurus252875
Ovis_canadensis258879
Bubalus_bubalis175625
Callicebus_donacophilus16
Pan_paniscus101269
Callithrix_jacchus216
Panthera_onca245861
Camelus_bactrianus218884
Panthera_pardus245861
Camelus_dromedarius219884
Panthera_tigris227758
Camelus_ferus219884
Pantholops_hodgsonii258877
Canis_lupus205796
Pan_troglodytes81279
Canis_lupus_familiaris205796
Papio_anubis571263
Capra_aegagrus257879
Paradoxurus_hermaphroditus23121
Capra_hircus257879
Perognathus_longimembris00
Capromys_pilorides233825
Peromyscus_maniculatus3878
Carollia_perspicillata39191
Petromus_typicus46168
Castor_canadensis1649
Phocoena_phocoena202885
Catagonus_wagneri144477
Piliocolobus_tephrosceles581264
Cavia_aperea237806
Pipistrellus_pipistrellus754
Cavia_porcellus161496
Pithecia_pithecia17
Cavia_tschudii161496
Platanista_gangetica195885
Cebus_albifrons211
Pongo_abelii281277
Cebus_capucinus211
Procavia_capensis00
Ceratotherium_simum220884
Propithecus_coquereli53257
Ceratotherium_simum_cottoni220884
Psammomys_obesus117354
Cercocebus_atys581263
Pteronotus_parnellii37172
Cercopithecus_neglectus591263
Pteronura_brasiliensis251834
Chaetophractus_vellerosus174780
Pteropus_alecto191814
Pteropus_vampyrus185814
Cheirogaleus_medius35140
Puma_concolor121482
Chinchilla_lanigera210819
Pygathrix_nemaeus641263
Chlorocebus_sabaeus581263
Rangifer_tarandus163536
Choloepus_didactylus18129
Rattus_norvegicus57229
Choloepus_hoffmanni19129
Rhinolophus_sinicus193798
Chrysochloris_asiatica153593
Colobus_angolensis41930
Rhinopithecus_bieti651263
Condylura_cristata1554
Rhinopithecus_roxellana641263
Craseonycteris_thonglongyai30105
Rousettus_aegyptiacus197809
Cricetomys_gambianus24
Saguinus_imperator421
Cricetulus_griseus60195
Saiga_tatarica00
Crocidura_indochinensis2566
Saimiri_boliviensis212
Cryptoprocta_ferox242847
Scalopus_aquaticus1250
Ctenodactylus_gundi249855
Semnopithecus_entellus611263
Ctenomys_sociabilis233830
Sigmodon_hispidus86248
Cuniculus_paca221837
Solenodon_paradoxus98388
Dasyprocta_punctata216837
Sorex_araneus237800
Dasypus_novemcinctus195786
Spermophilus_dauricus224
Daubentonia_madagascariensis40203
Spilogale_gracilis238842
Suricata_suricatta235820
Delphinapterus_leucas200885
Desmodus_rotundus27144
Sus_scrofa144494
Tadarida_brasiliensis18137
Dicerorhinus_sumatrensis208881
Tamandua_tetradactyla38184
Diceros_bicornis216884
Tapirus_indicus191885
Dinomys_branickii219817
Tapirus_terrestris191885
Dipodomys_ordii212
Thryonomys_swinderianus01
Dipodomys_stephensi622
Tolypeutes_matacus70307
Dolichotis_patagonum242814
Tonatia_saurophila33151
Echinops_telfairi223689
Tragulus_javanicus260858
Eidolon_helvum200815
Trichechus_manatus198854
Elaphurus_davidianus250879
Tupaia_chinensis155608
Elephantulus_edwardii815
Tupaia_tana145558
Ellobius_lutescens23
Ellobius_talpinus79264
Tursiops_truncatus206885
Enhydra_lutris250835
Uropsilus_gracilis948
Eptesicus_fuscus32224
Ursus_maritimus231833
Equus_asinus204887
Vicugna_pacos222884
Equus_caballus205878
Vulpes_lagopus209796
Equus_przewalskii205878
Xerus_inauris171788
Erinaceus_europaeus257806
Zalophus_californianus224838
Erythrocebus_patas551261
Zapus_hudsonius33113
Eschrichtius_robustus191885
Ziphius_cavirostris195882
Eubalaena_japonica195885
Eulemur_flavifrons29122
Eulemur_fulvus31124
Felis_catus237824
Felis_nigripes00
Fukomys_damarensis210825
Galeopterus_variegatus1372
Giraffa_tippelskirchi249876
Glis_glis195817
Gorilla_gorilla81274
Graphiurus_murinus207828
Helogale_parvula238825
Hemitragus_hylocrius254879
Heterocephalus_glaber201818
Heterohyrax_brucei00
Hippopotamus_amphibius210862
Hipposideros_armiger186799
Hipposideros_galeritus200818
Hyaena_hyaena236837
Hydrochoerus_hydrochaeris232822
Hystrix_cristata214823
Ictidomys_tridecemlineatus1870
Indri_indri32135
Inia_geoffrensis41220
Jaculus_jaculus1749
Kogia_breviceps206885
Lasiurus_borealis221797
Lemur_catta27113
Leptonychotes_weddellii216844
Lepus_americanus512
Lipotes_vexillifer201885
Loxodonta_africana212872
Lycaon_pictus205796
Macaca_fascicularis551257
Macaca_mulatta521175
Macaca_nemestrina541013
Macroglossus_sobrinus193803
Mandrillus_leucophaeus541262
Manis_javanica201830
Manis_pentadactyla203829
Marmota_marmota1364
Megaderma_lyra22110
Mellivora_capensis251836
Meriones_unguiculatus114350
Mesocricetus_auratus65202
Mesoplodon_bidens195883
Microcebus_murinus38175
Microgale_talazaci189610
Micronycteris_hirsuta25136
Microtus_ochrogaster195680
Miniopterus_natalensis30198
Miniopterus_schreibersii30198
Mirounga_angustirostris221842
Mirza_coquereli58246
Monodon_monoceros200885
Mormoops_blainvillei23129
Moschus_moschiferus231770
Mungos_mungo236822
Murina_feae53265
Muscardinus_avellanarius59277
Mus_caroli75242
Mus_musculus2877
Mus_pahari71246
Mus_spretus77252
Mustela_putorius248822
Myocastor_coypus236806
Myotis_brandtii32200
Myotis_davidii14103

Summary

Overall, it seems like HAL and Zoonomia are interesting places to start with comparative genomics. Simply having all the genomes aligned and in one file is quite powerful and combined with AlphaFold (for certain proteins) might yield some results to better understand how mutations may affect function.

Some interesting directions might be to try to take the Fauna Bio approach to compare how hibernating mammals and reptiles can survive in sub-freezing temperatures, and try to understand using halLiftover if there are any genes that these reptiles and mammals share in common. For example, according to this paper on Reptile Freeze Tolerance, reptiles express genes related to iron binding, antioxidant defense, and serine protease inhibitors in their heart and liver after being exposed to freezing temperatures.