Schema for Tandem Dups - Paired identical sequences
  Database: mm39    Primary Table: tandemDups    Row Count: 858,983   Data last updated: 2020-07-30
Format description: Browser extensible data
On download server: MariaDB table dump directory
fieldexampleSQL type description
bin 608smallint(5) unsigned Indexing field to speed chromosome range queries.
chrom chr1varchar(255) Reference sequence chromosome or scaffold
chromStart 3069940int(10) unsigned Start position in chromosome
chromEnd 3087936int(10) unsigned End position in chromosome
name chr1:3069941-3087936varchar(255) Name of item
score 32int(10) unsigned Optional score, nominal range 0-1000
strand +char(1) + or -
thickStart 3069940int(10) unsigned Start of where display should be thick (start codon)
thickEnd 3087936int(10) unsigned End of where display should be thick (stop codon)
reserved 0int(10) unsigned Used as itemRgb as of 2004-11-22
blockCount 2int(10) unsigned Number of blocks
blockSizes 32,32longblob Comma separated list of block sizes
chromStarts 0,17964longblob Start positions relative to chromStart

Sample Rows
 
binchromchromStartchromEndnamescorestrandthickStartthickEndreservedblockCountblockSizeschromStarts
608chr130699403087936chr1:3069941-308793632+306994030879360232,320,17964
608chr130705163088497chr1:3070517-308849736+307051630884970236,360,17945
608chr130716803089674chr1:3071681-308967438+307168030896740238,380,17956
608chr130821643084245chr1:3082165-308424530+308216430842450230,300,2051
608chr130869653103121chr1:3086966-310312136+308696531031210236,360,16120
608chr130873733103490chr1:3087374-310349030+308737331034900230,300,16087
608chr130875903103707chr1:3087591-310370730+308759031037070230,300,16087
608chr130892603105366chr1:3089261-310536633+308926031053660233,330,16073
608chr130926143109860chr1:3092615-310986034+309261431098600234,340,17212
608chr130927263109971chr1:3092727-310997133+309272631099710233,330,17212

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

Tandem Dups (tanDups) Track Description
 

Description

There are two tracks in this composite collection:

  • Gap Overlaps - Paired exactly identical sequence on each side of a gap
  • Tandem Dups - Paired exactly identical sequence survey over entire genome assembly
The Gap Overlaps is thus a subset of the full Tandem Dups track.

This investigation began when an unusual number of paired sequences around gaps was noticed during the mouse strain sequencing project. This naturally raised the question, how common is this feature, and what type of assemblies can it be found in.

The Gap Overlaps track indicates any pair of exactly identical sequence on each side of gaps. Where a gap is any run of N's, including a single N. The end of an upstream sequence before the gap is duplicated exactly at the beginning of the downstream sequence following the gap in the assembly.

The Tandem Dups track is a similar survey over the entire genome assembly. The separation gap between these paired sequences can range from 1 base up to 20,000 bases.

Methods

The Gap Overlap duplicate sequences were found by extracting 1,000 bases before and after each gap and aligned to each other with the blat command:

  blat -q=dna -minIdentity=95 -repMatch=10 upstreamContig.fa downstreamContig.fa
Filtering the PSL output for a perfect match, no mis-matches, and therefore of equal size matching sequence, where the alignment ends exactly at the end of the upstream sequence, and begins exactly at the start of the downstream sequence.

The Tandem Dups paired sequences were found with the following procedure:

  • Generate 29 base kmers for the entire genome, allow only kmers with bases: A C T G, no N's allowed.
  • Pair up identical kmers with at least one base separation and up to 20,000 bases separation.
  • Collapse overlapping kmer pairs when they are the same size of sequence and the same spacing between the pairs. This procedure preserves the definition of duplicated identical pairs.
  • The resulting pairs can now be longer sequences with smaller separation then the constituent pairs
  • Final result selects sizes of 30 bases or more for the size of the paired sequence, and at least one base remaining as a separation gap.
  • Collapsed pairs that close the gap are discarded. They appear to indicate simple repeat sequences when this happens. It would be interesting to have this result available, but that is not available at this time.

The reason for starting with 29 base sized pairs and then selecting results of at least 30 base sized pairs results in a reasonable number of 30 base pairs. If the procedure starts with 30 base sized pairs, it produces way too many 30 base kmer pairs for a reasonable count.

See Also

Interactive tables of all results:

Credits

Thank you to Joel Armstrong and Benedict Paten of the Computational Genomics Lab at the U.C. Santa Cruz Genomics Institute for identifying this characteristic of genome assemblies.

The data and presentation of this track were prepared by Hiram Clawson, U.C. Santa Cruz Genomics Institute