Problematic Regions Tracks
 
Problematic/special genomic regions for sequencing or very variable regions tracks   (All Mapping and Sequencing tracks)

Display mode:   

 All
Highly Reproducible Regions  Highly Reproducible genomic regions for sequencing  
Problematic Regions  Problematic/special genomic regions for sequencing or very variable regions  

Description

This container track helps call out sections of the genome that often cause problems or confusion when working with the genome. There are three subtracks for now, Anshul Kundaje's ENCODE Blacklist, GRC (Genome Reference Consortium) Exclusions, and the UCSC Unusual Regions track.

The hg19 genome has a track with the same name, but with many more subtracks, as the GeT-RM and Genome-in-a-Bottle artifact variants do not exist yet for hg38, to our knowledge. If you are missing a track here that you know from hg19 and have an idea how to add it hg38, do not hesitate to contact us.

The Problematic Regions track contains the following subtracks:

  • The UCSC Unusual Regions subtrack contains annotations collected at UCSC, put together from other tracks, our experiences and support email list requests over the years. For example, it contains the most well-known gene clusters (IGH, IGL, PAR1/2, TCRA, TCRB, etc) and annotations for the GRC fixed sequences, alternate haplotypes, unplaced contigs, pseudo-autosomal regions, and mitochondria. These loci can yield alignments with low-quality mapping scores and discordant read pairs, especially for short-read sequencing data. This data set was manually curated, based on the Genome Browser's assembly description, the FAQs about assembly, and the NCBI RefSeq "other" annotations track data.
  • The ENCODE Blacklist subtrack contains a comprehensive set of regions which are troublesome for high-throughput Next-Generation Sequencing (NGS) aligners. These regions tend to have a very high ratio of multi-mapping to unique mapping reads and high variance in mappability due to repetitive elements such as satellite, centromeric and telomeric repeats.
  • The GRC Exclusions subtrack contains a set of regions that have been flagged by the GRC to contain false duplications or contamination sequences. The GRC has now removed these sequences from the files that it uses to generate the reference assembly, however, removing the sequences from the GRCh38/hg38 assembly would trigger the next major release of the human assembly. In order to help users recognize these regions and avoid them in their analyses, the GRC have produced a masking file to be used as a companion to GRCh38, and the BED file is available from the GenBank FTP site.

The Highly Reproducible Regions track highlights regions and variants from eight samples that can be used to assess variant detection pipelines. The "Highly Reproducible Regions" subtrack comprises the intersection of the reproducible regions across all eight samples, while the "Variants" subtracks contain the reproducible variants from each assayed sample. Both tracks contain data from the following samples:

  • a Chinese Quartet, samples CQ-5, CQ-6, CQ-7, CQ-8
  • a HapMap Trio, samples NA10385, NA12248, NA12249
  • a Genome in a Bottle sample, NA12878s
Please refer to the Pan et al reference for more information on how these regions were defined.

Display Conventions and Configuration

Each track contains a set of regions of varying length with no special configuration options. The UCSC Unusual Regions track has a mouse-over description, all other tracks have at most a name field, which can be shown in pack mode. The tracks are usually kept in dense mode.

The Hide empty subtracks control hides subtracks with no data in the browser window. Changing the browser window by zooming or scrolling may result in the display of a different selection of tracks.

Data access

The raw data can be explored interactively with the Table Browser or the Data Integrator.

For automated download and analysis, the genome annotation is stored in bigBed files that can be downloaded from our download server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g.
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/problematic/comments.bb -chrom=chr21 -start=0 -end=100000000 stdout

Methods

Files were downloaded from the respective databases and converted to bigBed format. The procedure is documented in our hg38 makeDoc file.

Credits

Thanks to Anna Benet-Pagès, Max Haeussler, Angie Hinrichs, Daniel Schmelter, and Jairo Navarro at the UCSC Genome Browser for planning, building, and testing these tracks. The underlying data comes from the ENCODE Blacklist and some parts were copied manually from the HGNC and NCBI RefSeq tracks.

References

Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019 Jun 27;9(1):9354. PMID: 31249361; PMC: PMC6597582

Pan B, Ren L, Onuchic V, Guan M, Kusko R, Bruinsma S, Trigg L, Scherer A, Ning B, Zhang C et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biol. 2022 Jan 3;23(1):2. PMID: 34980216; PMC: PMC8722114