Schema for CHM13 alignments - CHM13 (GCA_009914755.4) v1_nfLO liftOver alignments
  Database: hg19    Primary Table: chm13LiftOver Data last updated: 2022-03-30
Big Bed File Download: /gbdb/hg19/bbi/chm13LiftOver/hg19-chm13v2.ncbi-qnames.over.chain.bb
Item Count: 690
The data is stored in the binary BigBed format.

Format description: bigChain pairwise alignment
fieldexampledescription
chromchr1Reference sequence chromosome or scaffold
chromStart161547113Start position in chromosome
chromEnd196727625End position in chromosome
name36Name or ID of item, ideally both human readable and unique
score1000Score (0-1000)
strand++ or - for strand
tSize249250621size of target sequence
qNameCP068277.2name of query sequence
qSize248387328size of query sequence
qStart160921704start of alignment on query sequence
qEnd196104909end of alignment on query sequence
chainScore35124313score from chain

Sample Rows
 
chromchromStartchromEndnamescorestrandtSizeqNameqSizeqStartqEndchainScore
chr1161547113196727625361000+249250621CP068277.224838732816092170419610490935124313
chr1196812309205922707371000+249250621CP068277.22483873281961049092052182669096975
chr1206072707206332221651000-249250621CP068277.22483873284285450243114183259397
chr1206482221207716082381000+249250621CP068277.22483873282055730832068080781233338
chr1207734639223725637391000+249250621CP068277.224838732820680807822279098615948836
chr1223797866228765531401000+249250621CP068277.22483873282227997092277662924961373
chr1228765531228780304411000+249250621CP068277.224838732822781101622782578214737
chr1228780304228782271421000+249250621CP068277.22483873282279242742279262411967
chr1228782271235192060431000+249250621CP068277.22483873282280247542344473616404382
chr1235242227236878326441000+249250621CP068277.22483873282344706362361162141631310

CHM13 alignments (chm13LiftOver) Track Description
 

Description

These tracks show the one-to-one v1_nfLO alignments of the GRCh37/hg19 to the T2T-CHM13 v2.0 assembly.

Display Conventions

The track displays boxes joined together by either single or double lines, with the boxes represent aligning regions, single lines indicating gaps that are largely due to a deletion in the CHM13 v2.0 assembly or an insertion in the GRCh37/hg19, and double lines representing more complex gaps that involve substantial sequence in both assembly.

Methods

Alignment and Chain Creation

For the minimap2-based pipeline, the initial chain file was generated using nf-LO v1.5.1 with minimap2 v2.24 alignments. These chains were then split at all locations that contained unaligned segments greater than 1 kbp or gaps greater than 10 kbp. Split chain files were then converted to PAF format with extended CIGAR strings using chaintools (v0.1), and alignments between nonhomologous chromosomes were removed. The trim-paf operation of rustybam (v0.1.29) was next used to remove overlapping alignments in the query sequence, and then the target sequence, to create 1:1 alignments. PAF alignments were converted back to the chain format with paf2chain commit f68eeca, and finally, chaintools was used to generate the inverted chain file.

Full commands with parameters used were:


    nextflow run main.nf --source GRCh37.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
    python chaintools/src/split.py -c input.chain -o input-split.chain
    python chaintools/src/to_paf.py -c input-split.chain -t target.fa -q query.fa -o input-split.paf
    awk '$1==$6' input-split.paf | rb break-paf --max-size 10000  | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
    paf2chain -i out.paf > out.chain
    python chaintools/src/invert.py -c out.chain -o out_inverted.chain

The above process does not add chain ids or scores. The UCSC utilities chainMergeSort and chainScore are used to update the chains:


    chainMergeSort out.chain | chainScore stdin chm13v2.0.2bit hg19.2bit chm13v2.0-hg19.chain
    chainMergeSort out_inverted.chain | chainScore stdin hg19.2bit chm13v2.0.2bit hg19-chm13v2.0.chain

Rustybam trim-paf uses dynamic programming and the CIGAR string to find an optimal splitting point between overlapping alignments in the query sequence. It starts its trimming with the largest overlap and then recursively trims smaller overlaps.

Results were validated by using chaintools to confirm that there were no overlapping sequences with respect to both CHM13v2.0 and GRCh37 in the released chain file. In addition, trimmed alignments were visually inspected with SafFire to confirm their quality.

Chains were swapped to make GRCh37/hg19 the target.

Credits

The v1_nflo chains were generated by Nae-Chyun Chen<naechyun.chen@gmail.com> and Mitchell Vollger<mvollger@uw.edu>

References

Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.