Schema for CHM13 alignments - CHM13 (GCA_009914755.4) v1

Home
Genomes
Genome Browser
Tools
Mirrors
- Euro/Asia Mirrors
- Mirroring Instructions
- US Server
- European Server
- Asian Server
Downloads
My Data
Projects
Help
About Us
- News
- Publications
- Blog
- Cite Us
- Credits
- Release Log
- Staff
- Conditions of Use
- Our History
- Jobs
- Licenses
- Contact Us

field

example

description

chrom

chr1

Reference sequence chromosome or scaffold

chromStart

161547113

Start position in chromosome

chromEnd

196727625

End position in chromosome

name

Name or ID of item, ideally both human readable and unique

score

1000

Score (0-1000)

strand

+ or - for strand

tSize

249250621

size of target sequence

qName

CP068277.2

name of query sequence

qSize

248387328

size of query sequence

qStart

160921704

start of alignment on query sequence

qEnd

196104909

end of alignment on query sequence

chainScore

35124313

score from chain

chrom

chromStart

chromEnd

name

score

strand

tSize

qName

qSize

qStart

qEnd

chainScore

chr1

161547113

196727625

1000

249250621

CP068277.2

248387328

160921704

196104909

35124313

chr1

196812309

205922707

1000

249250621

CP068277.2

248387328

196104909

205218266

9096975

chr1

206072707

206332221

1000

249250621

CP068277.2

248387328

42854502

43114183

259397

chr1

206482221

207716082

1000

249250621

CP068277.2

248387328

205573083

206808078

1233338

chr1

207734639

223725637

1000

249250621

CP068277.2

248387328

206808078

222790986

15948836

chr1

223797866

228765531

1000

249250621

CP068277.2

248387328

222799709

227766292

4961373

chr1

228765531

228780304

1000

249250621

CP068277.2

248387328

227811016

227825782

14737

chr1

228780304

228782271

1000

249250621

CP068277.2

248387328

227924274

227926241

1967

chr1

228782271

235192060

1000

249250621

CP068277.2

248387328

228024754

234447361

6404382

chr1

235242227

236878326

1000

249250621

CP068277.2

248387328

234470636

236116214

1631310

Description

These tracks show the one-to-one v1_nfLO alignments of the GRCh37/hg19 to the T2T-CHM13 v2.0 assembly.

Display Conventions

The track displays boxes joined together by either single or double lines, with the boxes represent aligning regions, single lines indicating gaps that are largely due to a deletion in the CHM13 v2.0 assembly or an insertion in the GRCh37/hg19, and double lines representing more complex gaps that involve substantial sequence in both assembly.

Methods

Alignment and Chain Creation

For the minimap2-based pipeline, the initial chain file was generated using nf-LO v1.5.1 with minimap2 v2.24 alignments. These chains were then split at all locations that contained unaligned segments greater than 1 kbp or gaps greater than 10 kbp. Split chain files were then converted to PAF format with extended CIGAR strings using chaintools (v0.1), and alignments between nonhomologous chromosomes were removed. The trim-paf operation of rustybam (v0.1.29) was next used to remove overlapping alignments in the query sequence, and then the target sequence, to create 1:1 alignments. PAF alignments were converted back to the chain format with paf2chain commit f68eeca, and finally, chaintools was used to generate the inverted chain file.

Full commands with parameters used were:


    nextflow run main.nf --source GRCh37.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
    python chaintools/src/split.py -c input.chain -o input-split.chain
    python chaintools/src/to_paf.py -c input-split.chain -t target.fa -q query.fa -o input-split.paf
    awk '$1==$6' input-split.paf | rb break-paf --max-size 10000  | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
    paf2chain -i out.paf > out.chain
    python chaintools/src/invert.py -c out.chain -o out_inverted.chain

The above process does not add chain ids or scores. The UCSC utilities chainMergeSort and chainScore are used to update the chains:


    chainMergeSort out.chain | chainScore stdin chm13v2.0.2bit hg19.2bit chm13v2.0-hg19.chain
    chainMergeSort out_inverted.chain | chainScore stdin hg19.2bit chm13v2.0.2bit hg19-chm13v2.0.chain

Rustybam trim-paf uses dynamic programming and the CIGAR string to find an optimal splitting point between overlapping alignments in the query sequence. It starts its trimming with the largest overlap and then recursively trims smaller overlaps.

Results were validated by using chaintools to confirm that there were no overlapping sequences with respect to both CHM13v2.0 and GRCh37 in the released chain file. In addition, trimmed alignments were visually inspected with SafFire to confirm their quality.

Chains were swapped to make GRCh37/hg19 the target.

Credits

The v1_nflo chains were generated by Nae-Chyun Chen<naechyun.chen@gmail.com> and Mitchell Vollger<mvollger@uw.edu>

References

Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.