Schema for Problematic Sites - Problematic sites where masking or caution are recommended for analysis
  Database: wuhCor1    Primary Table: problematicSitesMask Data last updated: 2021-11-01
Big Bed File Download: /gbdb/wuhCor1/problematicSites/problematicSitesMask.bb
Item Count: 114
The data is stored in the binary BigBed format.

Format description: SARS-CoV-2 locations with attributes that make them problematic for analysis
fieldexampledescription
chromNC_045512v2Reference sequence ID
chromStart20055Start position in reference
chromEnd20056End position in reference
nameambiguous,homoplasic,single_srcReason site is problematic
refGReference allele
altK,AAlternate allele(s)
submitterRussell Corbett-DetigName of submitter(s) of this site to ProblematicSites_SARS-CoV2 repository
labPublic Health England (UK)Source laboratory(ies) of samples with the variant
geneorf1abGene in which site falls, if applicable
aaPos6597Position of amino acid residue within gene
refAaEReference amino acid residue
altAaX,KList of alternative amino acid residues (IUPAC ambiguity code)

Sample Rows
 
chromchromStartchromEndnamerefaltsubmitterlabgeneaaPosrefAaaltAa
NC_045512v22005520056ambiguous,homoplasic,single_srcGK,ARussell Corbett-DetigPublic Health England (UK)orf1ab6597EX,K
NC_045512v22012220123ambiguous,homoplasic,single_srcTC,YRussell Corbett-DetigPublic Health England (UK)orf1ab6620IL,L
NC_045512v22046420465highly_homoplasic,single_srcAR,W,GRussell Corbett-DetigCenter for Global Health, University of New Mexico Health Sciences Center (USA)orf1ab6734DX,X,V
NC_045512v22114821149single_src,highly_homoplasicGALanden GozashtiUniversity of Washington (USA)orf1ab6962GK
NC_045512v22115021151single_src,highly_homoplasicGD,TLanden GozashtiUniversity of Washington (USA)orf1ab6962GX,D
NC_045512v22120821209single_src,highly_homoplasic,neighbour_linkedTC,HLanden GozashtiUniversity of Washington (USA)orf1ab6982MR,X
NC_045512v22121121212single_src,highly_homoplasic,neighbour_linkedGALanden GozashtiUniversity of Washington (USA)orf1ab6983GN
NC_045512v22154921550ambiguous,homoplasic,narrow_srcAC,MNicola De Maio, Russell Corbett-DetigThe Council of Scientific and Industrial Research (India)orf1ab7095NT,T
NC_045512v22155021551ambiguous,homoplasic,narrow_srcAR,T,WNicola De Maio, Russell Corbett-DetigThe Council of Scientific and Industrial Research (India)orf1ab7096NX,S,X
NC_045512v22157421575highly_homoplasicCT,YNicola De Maio, Russell Corbett-DetigS5LF,X

Problematic Sites (problematicSites) Track Description
 

Description

Attempts to infer phylogenetic relationships, sites under selection, or evidence of recombination from SARS-CoV-2 genome sequences can be led astray by sequencing errors, contamination, and hypermutable sites. In order to make reliable inferences, it is important to identify probable errors and susceptible sites within the genome sequences, carefully consider how those might affect the specific analysis one is about to perform, and perhaps exclude problematic sites from analysis.

This track shows locations in the SARS-CoV-2 genome that have been identified as problematic for analysis for various reasons. They have been collected in the github repository https://github.com/W-L/ProblematicSites_SARS-CoV2/. Locations have been separated into two subtracks and colored corresponding to levels of severity:

  • Mask: Problems are expected to affect most types of analysis, so it is recommended to mask out these sites before analysis.
  • Caution: Some types of analysis may be affected while other types may not; caution is recommended.

Locations are labeled with the following terms to indicate the type of potential problem:

  • ambiguous: Sites which show an excess of ambiguous basecalls relative to the number of alternative alleles, often emerging from a single country or sequencing laboratory
  • amended: Previous sequencing errors which now appear to have been fixed in the latest versions of the GISAID sequences, at least in sequences from some of the sequencing laboratories
  • highly_ambiguous: Sites with a very high proportion of ambiguous characters, relative to the number of alternative alleles
  • highly_homoplasic: Positions which are extremely homoplasic - it is sometimes not necessarily clear if these are hypermutable sites or sequencing artefacts
  • homoplasic: Homoplasic sites, with many mutation events needed to explain a relatively small alternative allele count
  • interspecific_contamination: Cases (only one instance as of July 2020) in which the known sequencing issue is due to contamination from genetic material that does not have SARS-CoV-2 origin
  • nanopore_adapter: Cases in which the known sequencing issue is due to the adapter sequences in nanopore reads
  • narrow_src: Mutations which are found in sequences from only a few sequencing labs (usually two or three), possibly as a consequence of the same artefact reproduced independently
  • neighbour_linked: Proximal mutations displaying near perfect linkage
  • seq_end: Alignment ends are affected by low coverage and high error rates (masking recommended, but might be more stringent than necessary)
  • single_src: Only observed in samples from a single laboratory

Methods

Multiple groups applied various methods (De Maio, Walker et al.; De Maio, Gozashti et al.; Turakhia et al.) to identify sites that were homoplasic, likely contaminated, likely sequencing error and/or observed in multiple virus lineages by only one or a few laboratories. They contributed their observations and recommendations to the github repository https://github.com/W-L/ProblematicSites_SARS-CoV2/. UCSC downloaded the collection, split the sites into Mask and Caution subsets depending on the recommended action and reformatted the data for display in the Genome Browser.

Data Access

The original data file was downloaded from github: https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf. You can download the bigBed files underlying this track (problematicSites*.bb) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API.

References

De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Issues with SARS-CoV-2 sequencing data. virological.org. 2020 May 5.

De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N. Updated analysis with data from 12th June 2020. virological.org. 2020 July 14.

Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, and Corbett-Detig R. Stability of SARS-CoV-2 Phylogenies. bioRxiv. 2020 June 9.