Schema for Phylogeny: Public - Phylogenetic Tree and Nucleotide Substitution Mutations in Sequences in Public Databases
  Database: wuhCor1    Primary Table: sarsCov2PhyloPubAllMinAf01
VCF File Download: /gbdb/wuhCor1/sarsCov2PhyloPub/public.all.minAf.01.vcf.gz
Format description: The fields of a Variant Call Format data line
fielddescription
chromAn identifier from the reference genome
posThe reference position, with the 1st base having position 1
idSemi-colon separated list of unique identifiers where available
refReference base(s)
altComma separated list of alternate non-reference alleles called on at least one of the samples
qualPhred-scaled quality score for the assertion made in ALT. i.e. give -10log_10 prob(call in ALT is wrong)
filterPASS if this position has passed all filters. Otherwise, a semicolon-separated list of codes for filters that fail
infoAdditional information encoded as a semicolon-separated series of short keys with optional comma-separated values
formatIf genotype columns are specified in header, a semicolon-separated list of of short keys starting with GT
genotypesIf genotype columns are specified in header, a tab-separated set of genotype column values; each value is a colon-separated list of values corresponding to keys in the format column

Sample Rows
 
chromposidrefaltqualfilterinfoformatgenotypes
NC_045512v266C66TCT..AC=4973;AN=227157GT0000000000...
NC_045512v2204G204TGT..AC=46439;AN=227157GT0000000000...
NC_045512v2241C241TCT..AC=213269;AN=227157GT0000000000...
NC_045512v2313C313TCT..AC=2780;AN=227157GT0000000000...
NC_045512v2445T445CTC..AC=70916;AN=227157GT0000000000...
NC_045512v2913C913TCT..AC=14563;AN=227157GT0000000000...
NC_045512v21059C1059TCT..AC=17817;AN=227157GT0000000000...
NC_045512v21163A1163TAT..AC=16458;AN=227157GT0000000000...
NC_045512v21210G1210TGT..AC=2341;AN=227157GT0000000000...
NC_045512v21513C1513TCT..AC=2496;AN=227157GT0000000000...

Phylogeny: Public (sarsCov2PhyloPub) Track Description
 

Description

This track displays a phylogenetic tree relating public SARS-CoV-2 genome sequences available from NCBI Virus / GenBank, COG-UK and the China National Center for Bioinformation, contributed by laboratories around the world, and mutations found in those sequences. By default, only very common mutations (alternate allele found in at least 1% of samples) are displayed, but other subtracks may be made visible in order to see more rare mutations.

The phylogenetic tree is inferred by the sarscov2phylo pipeline (Lanfear). For display in the narrow space to the left of the main genome browser image, nodes in the tree are collapsed unless a mutation is associated with a node; i.e. the only branching points displayed are those at which mutations occurred.

The tree is colored by Pangolin lineage (Rambaut et al.). The coloring scheme is adapted from Figure 1 of (Alm et al.) which presents a unified view of a simplified phylogenetic tree, Pangolin lineages, Nextstrain clades and GISAID clades.

colorPangolin lineage(s)Nextstrain cladeGISAID clade
      A 19B S
      B.n (n > 1) 19A L
      n/a (color not used when coloring by lineage; overlaps on tree with B.4 - B.7) n/a (overlaps on tree with 19A) O
      n/a (color not used when coloring by lineage; overlaps on tree with B.2) n/a (overlaps on tree with 19A) V
      B.1.5, B.1.6, B.1.8, other B.1.n that overlap GISAID clade G 20A (partial) G
      B.1.9, B.1.13, B.1.22, B.1.22, B.1.36, B.1.37 20A (partial) GH (partial)
      B.1.3, B.1.12, B.1.26, other B.1.n that overlap GISAID clade GH 20C GH (partial)
      B.1.1 20B GR

Display Conventions

In "dense" mode, a vertical line is drawn at each position where there is a mutation. In "squish" and "pack" modes, the display shows a plot of all samples' mutations, with samples ordered using the phylogenetic tree in order to highlight patterns of linkage. "Full" display mode shows each mutation on its own row, ordered by position instead of lineage.

Each sample is placed in a horizontal row of pixels; when the number of samples exceeds the number of vertical pixels for the track, multiple samples fall in the same pixel row and pixels are averaged across samples.

Each mutation is a vertical bar at its position in the SARS-CoV-2 genome with white (invisible) representing the reference allele; the non-reference allele is shown in red if it changes the protein sequence of a gene, green if it falls within a gene but does not change the protein, and black if it does not fall within a gene. Tick marks are drawn at the top and bottom of each mutation's vertical bar to make the bar more visible when most alleles are reference alleles. Only single-nucleotide substitutions are displayed, not insertions or deletions.

The phylogenetic tree showing inferred relationships between the samples is depicted in the left column of the display. Mousing over this will show the sample identifiers. At the default track height, about 100 samples are averaged into each row of pixels. The track height can be adjusted in the track controls, which can be reached by clicking on the gray button to the left of the tree or by right-clicking on the image.

Methods

Rob Lanfear regularly runs the sarscov2phylo pipeline on all complete, high-coverage sequences available from GISAID EpiCoV™. The pipeline aligns all sequences to the same reference genome used by the Genome Browser (RefSeq NC_045512.2, GenBank MN908947.3, GISAID sample hCoV-19/Wuhan/Hu-1/2019|EPI_ISL_402125|2019-12-31) using MAFFT (Katoh et al.). It masks sites identified as problematic by the ProblematicSites_SARS-CoV2 repository (De Maio et al.), as well as sites that are N's or gaps in >50% of samples. fasttree (Price et al.) is used to infer the phylogenetic tree; sequences on very long branches are removed using TreeShrink (Mai et al.). The tree is re-rooted to hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05.

For full details, see the sarscov2phylo documentation.

UCSC makes a reduced version of the tree that contains only samples from fully public databases (GenBank, COG-UK direct release, CNCB) that do not prohibit UCSC from offering sequence mutations for download (see Data Access). UCSC also makes several adjustments to the phylogenetic tree for compact display:

  • We shorten "2019" and "2020" in dates to "19" and "20".
  • We change the root of the tree to the reference genome used by the Genome Browser (NC_045512.2, Wuhan/Hu-1).
  • Nodes that do not have an associated mutation are collapsed using UShER (Turakhia et al.).

Data Access

Files are available from our Download Server:

The VCF data can be explored interactively with the Table Browser or the Data Integrator, and accessed from scripts through our API.

The sarscov2phylo repository includes all releases of the full phylogenetic tree.

Credits

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge the authors and the originating laboratories where the clinical specimen or virus isolate was first obtained and the submitting laboratories, where sequence data have been generated and submitted to public databases, on which this research is based.

Special thanks to Rob Lanfear for developing, running and sharing the sarscov2phylo pipeline and results.

Data usage policy

The data presented here is intended to rapidly disseminate analysis of important pathogens. Unpublished data is included with permission of the data generators, and does not impact their right to publish. Please contact the respective authors if you intend to carry out further research using their data. Authors and/or institutions that provided the sequences are listed in acknowledgements.tsv.gz.

References

Lanfear, R. A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883. 2020.

Rambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. PMID: 32669681

Alm E, Broberg EK, Connor T, Hodcroft EB, Komissarov AB, Maurer-Stroh S, Melidou A, Neher RA, O'Toole Á, Pereyaslov D et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020. Euro Surveill. 2020 Aug;25(32). PMID: 32794443; PMC: PMC7427299

Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013 Apr;30(4):772-80. PMID: 23329690; PMC: PMC3603318

De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Masking strategies for SARS-CoV-2 alignments. virological.org. 2020 May 13.

De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N. Updated analysis with data from 12th June 2020. virological.org. 2020 July 14.

Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, Haussler D, and Corbett-Detig R. Ultrafast Sample Placement on Existing Trees (UShER) Empowers Real-Time Phylogenetics for the SARS-CoV-2 Pandemic. bioRxiv. 2020 September 28.

Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010 Mar 10;5(3):e9490. PMID: 20224823; PMC: PMC2835736

Mai U, Mirarab S. TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics. 2018 May 8;19(Suppl 5):272. PMID: 29745847; PMC: PMC5998883