- Home
 - Genomes
 - Genome Browser
 - Tools
 - Mirrors
 - Downloads
 - My Data
 - Projects
 - Help
 - About Us
 
An Assembly Data Hub is a set of Internet-accessible data files that define the reference sequence to be used for a browser instance, as well as all the data files that define the annotation for that sequence. Assembly Data Hubs allow researchers to use the UCSC Genome Browser to view their own sequences with associated annotation, without the requirement that UCSC support a browser on that sequence.
Note: if you are working with a genome that has already been submitted to the NCBI Assembly system, it may already be available in the UCSC Genome Browser. Please check the GenArk Assembly Hub collection to see if your genome of interest is already available. If it is not listed there, you can use the UCSC Assembly Request page to request that the genome assembly be added.
To display a novel genome sequence in the UCSC Genome Browser, a web server hosted by the institution (or a free service such as Cyverse) can be used. For environments operating behind a firewall, hub files can also be loaded locally through GBiB to provide access to the UCSC Genome Browser. Hosting hub files over HTTP is strongly recommended, as it is significantly more efficient than FTP. A hierarchical directory structure must then be established to organize the files associated with the genome sequence. For example:
myHub/ - directory to organize your files on this hub
    hub.txt - primary reference text file to define the hub, refers to:
    genomes.txt - definitions for each genome assembly on this hub
        newOrg1/ - directory of files for this specific genome assembly
            newOrg1.2bit - '2bit' file constructed from your fasta sequence
            description.html - information about this assembly for users
            trackDb.txt - definitions for tracks on this genome assembly
            groups.txt - definitions for track groups on this assembly
            bigWig and bigBed files - data for tracks on this assembly
            external track hub data tracks
The hub can be referenced by a URL such as: http://yourLab.yourInstitution.edu/myHub/hub.txt
The initial file, hub.txt is the primary URL reference for the assembly hub:
Format of the file:
hub hubName shortLabel genome longLabel Comment describing this hub contents genomesFile genomes.txt email contactEmail@institution.edu descriptionUrl aboutHub.html
shortLabel is the name that will appear in the genome pull-down menu at the UCSC gateway page.
genomesFile is a reference to the next definition file in this chain that will describe the assemblies and tracks available at this hub. Typically, genomes.txt is at the same directory level as this hub.txt; however, it can also be a relative path reference to a different directory level.
email provides users with a contact point for questions related to this assembly hub.
descriptionUrl specifies a relative path or URL link to a webpage describing the hub.
You can view a working example at hub.txt
The genomes.txt file provides references to the genome assemblies and tracks available in the assembly hub.
genome ricCom1 trackDb ricCom1/trackDb.txt groups ricCom1/groups.txt description July 2011 Castor bean twoBitPath ricCom1/ricCom1.2bit organism Ricinus communis defaultPos E09R7372:1000000-2000000 orderKey 4800 scientificName Ricinus communis htmlPath ricCom1/description.html transBlat yourLab.yourInstitution.edu 17777 blat yourLab.yourInstitution.edu 17777 isPcr yourLab.yourInstitution.edu 17779
Multiple assembly definitions can be included in a single file, separated by blank lines. The file references are relative paths. In this example, the subdirectory ricCom1 contains the files for this specific assembly.
Note: it is strongly recommended that each genome stanza includes defaultPos, scientificName, organism, description, so that the hub loads with meaningful defaults and can be more easily searched from the Gateway page.
The .2bit file is constructed from the FASTA sequence for the assembly using the faToTwoBit kent program (available from the downloads page).
Example:
faToTwoBit ricCom1.fa ricCom1.2bit
Use twoBitInfo to verify sequences and create a chrom.sizes file, which is not used in the hub itself but is helpful for constructing big* files:
twoBitInfo ricCom1.2bit stdout | sort -k2rn > ricCom1.chrom.sizes
The .2bit file can also be hosted at a URL:
twoBitInfo -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes
To extract sequences from a .2bit file:
twoBitToFa -seq=chrCp -udcDir=https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubPlants/cshl2013/ricCom1/ricCom1.2bit stdout > ricCom1.chrCp.fa
The chromAlias setting enables the Genome Browser to automatically convert chromosome
names in submitted custom track data from alternate naming schemes to the names used in the
assembly. The chromAlias setting uses a chromAlias.txt file. This
functionality applies to both custom track data and assembly hub data.
chromAlias.txt Format
The first line of the chromAlias.txt file begins with a pound symbol (#)
followed by a blank space. Each subsequent word on this line, separated by tab characters,
specifies the source authority for the sequence names in that column. The first column contains the
sequence names used in the Genome Browser assembly, while the subsequent columns provide alternate
naming schemes.
All lines following the header line consist of columns of sequence names separated by a tab character. If no equivalent name exists in a particular naming scheme, the column remains empty, resulting in two adjacent tab characters.
Example:
# ucsc assembly genbank ncbi refseq ensembl chr1 1 CM000663.2 1 NC_000001.11 1 chr10 10 CM000672.2 10 NC_000010.11 10 chrM MT J01415.2 MT NC_012920.1 MT chrX X CM000685.2 X NC_000023.11 X
In this example, the columns represent:
chrN nameschr2acc file in the assembly_structure/ hierarchyAssembly Hub Usage
To use the chromAlias.txt file in an assembly hub, add the following line to the
genome stanza of the hub.txt file:
chromAlias thisGenome.chromAlias.txt
This is a relative path reference from the hub.txt file.
Example genome stanza:
genome GCF_000001405.39 taxId 9606 groups groups.txt description human twoBitPath GCF_000001405.39.2bit twoBitBptUrl GCF_000001405.39.2bit.bpt chromSizes GCF_000001405.39.chrom.sizes.txt chromAlias GCF_000001405.39.chromAlias.txt organism human defaultPos chr1:82985474-82995474 scientificName Homo sapiens htmlPath html/GCF_000001405.39_GRCh38.p13.description.html
Best Performance
For improved performance, the chromAlias.txt file can be converted to a bigBed format.
This enables efficient searching for sequence names without requiring the entire text file to be
read, which is particularly important for assemblies with large numbers of sequences.
The Perl script
aliasTextToBed.pl converts the chromAlias.txt file into the
corresponding bed and bigBed files:
aliasTextToBed.pl -chromSizes=asmId.chrom.sizes -aliasText=asmId.chromAlias.txt \ -aliasBed=asmId.chromAlias.bed -aliasAs=asmId.chromAlias.as -aliasBigBed=asmId.chromAlias.bb
Inputs:
chrom.sizes filechromAlias.txt fileOutputs:
chromAlias.bedchromAlias.aschromAlias.bb
Replace the chromAlias setting with the chromAliasBb setting, and specify
the .bb file in the genome stanza of the hub definition:
chromAliasBb GCF_000001405.39.chromAlias.bb
This replaces the chromAlias.txt specification.
Default Naming Scheme
A default naming scheme may be set in the hub.txt file using the
chromAuthority setting:
chromAuthority ucsc
In this example, the value ucsc corresponds to the column header from the
chromAlias.txt file. This setting ensures that names in the specified column are
displayed by default in the Genome Browser.
The groups.txt file defines the grouping of track controls under the Genome Browser graphic display.
Example:
name map label Mapping priority 2 defaultIsClosed 0
Refer to the Adding Groups to a Track hub section of the Track Hubs help page for more details.
Traditionally, an assembly hub required multiple configuration files (hub.txt,
genomes.txt, trackDb.txt, and optionally groups.txt), along
with a .2bit file for the sequence. The useOneFile on option simplifies
this by consolidating everything into a single configuration file. Note: The single-file
format supports one genome assembly per file. For multiple assemblies, use the traditional
multi-file setup.
Example configuration:
hub mySingleFileHub shortLabel My Single-File Hub longLabel An example of a single-file UCSC track hub useOneFile on email myEmail@example.com genome hg19 track exampleBigWig shortLabel BigWig Coverage longLabel Coverage data over hg19 type bigWig visibility full bigDataUrl http://myServer.com/data/example.bigWig track exampleVCF shortLabel VCF Variants longLabel Variant calls over hg19 region type vcfTabix visibility pack bigDataUrl http://myServer.com/data/example.vcf.gz
hub.txt.genomes.txt.trackDb.txt.
If your hub requires a reference genome sequence, you can still provide a .2bit file
with twoBitPath. Grouping (previously in
groups.txt.) can also be integrated here if needed.
Once hosted on a server, the single configuration file (and associated data files such as 
.bigWig, .vcf.gz, .2bit) can be loaded into the UCSC Genome
Browser via the My Hubs page.
Tracks are defined in the trackDb.txt file, where each stanza specifies how tracks are displayed (shortLabel, longLabel, color, visibility), along with other information such as the group the track belongs to (referencing groups.txt) and whether additional HTML should be displayed when a user clicks into the track or a track item:
track gap_ longLabel Gap shortLabel Gap priority 11 visibility dense color 0,0,0 bigDataUrl bbi/ricCom1.gap.bb type bigBed 4 group map html ../trackDescriptions/gap
For more information about the syntax of the trackDb.txt file, refer to the Track Database Definition page.
Processing genomes to construct tracks often requires a cluster or supercomputer. Small genomes can be processed on single computers with multiple cores. The process for each track is unique. For details, refer to the Browser Track Construction page, which discusses constructing tracks for assembly hubs.
Assembly hubs can include a Cytoband track, which allows quicker navigation of chromosomes and displays banding pattern information, if known.
A simple version of the track can be built using the existing chrom.sizes file for your assembly.
Banding options include: gneg, gpos25,
	gpos50, gpos75, gpos100, acen, gvar, or stalk).
Example:
cat araTha1.chrom.sizes | sort -k1,1 -k2,2n | awk '{print $1,0,$2,$1,"gneg"}' > cytoBandIdeo.bed
The resulting BED file can be converted into a BigBed file and associated with an .as
definition file (see
example) to
to inform the browser that this is not a standard BED:
bedToBigBed -type=bed4 cytoBandIdeo.bed -as=cytoBand.as araTha1.chrom.sizes cytoBandIdeo.bigBed
In trackDb.txt, if the track is named cytoBandIdeo (e.g., track cytoBandIdeo), it will automatically load into the assembly hub.
Direct links to the genome(s) within the assembly hub can then be constructed.
Resources for automatically building assembly hubs include G-OnRamp and MakeHub.
G-OnRamp is a Galaxy workflow that turns a genome assembly and RNA-Seq data into a Genome Browser with multiple evidence tracks. Since G-OnRamp is based on the Galaxy platform, becoming familiar with Galaxy concepts and functionalities is recommended. See their instruction page for an overview.
MakeHub is a command-line tool for fully automatic generation of track data hubs for visualizing genomes with the UCSC Genome Browser. More information is available on their GitHub page.
There is a collection of example NCBI assembly hubs that can be used directly or copied as templates. A large collection of script-generated assembly hubs can be browsed on the development server, with links defaulting to the genome-test site. To load these hubs on the public UCSC site, copy the hub.txt link and replace the test server domain with the public domain.
The following table provides links to launch various assembly hubs grouped by species subsets. By scrolling down each page, you can access rows for individual assemblies (or groups of assemblies, e.g., bacteria). Clicking the "common name" hyperlink (e.g., "African bush elephant" on the Vertebrate Mammalian page) loads the selected hub.
These assemblies use NCBI accession naming patterns. Prototype gene tracks from NCBI gene predictions are available for a few assemblies. No BLAT servers are provided. Users can copy the skeleton structure of a hub to run their own BLAT server locally. Brief instructions are available on each assembly gateway page under "Download files for this assembly hub."
Here are some quick steps to load an example hub from this collection, along with an explanation of how to view the files behind the hub.
https://genome-test.gi.ucsc.edu/...to
https://genome.ucsc.edu/...
To better understand how the hub works, you can review the associated files:
genomes.txt file
	    defines each assembly in the hub. It points to the genome's .2bit file
	    (twoBitPath) and specifies the trackDb file that contains the
	    track definitions. (In the case of this large hub with 204 assemblies, the main
	    genomes.txt file is one directory up, and this stanza is included there.)trackDb.txt
	    file defines the tracks displayed in the hub. It contains bigDataUrl lines
	    that tell the Browser where to retrieve data for each track, along with optional
	    settings such as:BLAT servers (gfServer) can be configured as either dedicated or
dynamic:
When running a local BLAT server, assembly hubs can be configured to support BLAT searches by adding entries to the genomes.txt file.
Installation and configuration details for gfServer are provided in the Running your own gfServer page.
In the  genomes.txt stanza for the target assembly, include the following lines (note
the capital B in transBlat):
transBlat yourServer.yourInstitution.edu 17777 blat yourServer.yourInstitution.edu 17779 isPcr yourServer.yourInstitution.edu 17779
With this configuration, BLAT and PCR searches become available for the assembly. For example:
http://genome.ucsc.edu/cgi-bin/hgBlat?hubUrl=http://yourServer.yourInstitution.edu/myHub/hub.txt
This URL opens the BLAT interface, where the assembly will appear in the Genome drop-down menu.
The isPcr line enables the use of a different gfServer instance for PCR queries if
desired.
Firewall note: Some institutions block repeated BLAT server queries. In such cases, administrators must whitelist the following IP ranges:
128.114.119.* (U.S. site: genome.ucsc.edu)129.70.40.120 (European mirror: genome-euro.ucsc.edu)
Further details on gfServer options are available from the Source Downloads page (pre-compiled binaries are located in the blat/ directory) and the blat documentation.
gfServers may also be set up within GBiB for local operation; see the GBiB assembly BLAT setup guide for detailed instructions.
To terminate a gfServer instance, run:
gfServer stop localhost 17860
Errors may occur if translatedBlat and nucleotideBlat port numbers are reversed. A typical message in this case is:
Expecting 6 words from server got 2
If a gfServer instance is started from the same directory as the .2bit file, for example:
gfServer start localhost 17779 -stepSize=5 contigsRenamed.2bit &
an attempt to run a DNA sequence query through the web-based BLAT tool may return:
Error in TCP non-blocking connect() 111 - Connection refused Operation now in progress Sorry, the BLAT/iPCR server seems to be down. Please try again later.
ps aux | grep gfServer
genomes.txt, the twoBitPath/filename must match the .2bit file
		used when starting gfServer. The location of the gfServer instance can
		be verified by changing into the directory where gfServer was launched and running
		the appropriate hostname command.
                hostname -iThis will return an IP address, for example:
132.249.245.79telnet:
                telnet yourIP yourPortFor example:
telnet 132.249.245.79 17777A successful connection shows:
Connected to 132.249.245.79If
Connection refused appears, gfServer may not be running, or the
		IP/port configuration is incorrect.genomes.txt file should also be checked to confirm that the BLAT
		line matches the correct IP and port. For example:
                blat 132.249.245.79 17777Instead of:
blat localhost 17777
gfServer:
		gfServer status yourLocation yourPortFor example:
gfServer status 132.249.245.79 17777Sample output might look like:
version 36x2 type nucleotide host localhost port 17777 tileSize 11 stepSize 5 minMatch 2 pcr requests 0 blat requests 0 bases 0 misses 0 noSig 1 trimmed 0 warnings 0
gfClient. If gfClient successfully
		connects to gfServer, the IP/port configuration is correct. Running
		gfClient directly verifies connectivity independently of the browser
		interface. From the directory containing the hub's .2bit file, the
		command can be executed as follows:
                gfClient yourLocation yourPort pathTo2bitFile yourFastaQuery.fa output.pslFor example:
gfClient localhost 17777 . query.fa gfOutput.pslNote the
. after the port, which tells gfClient to use
		the .2bit file in the current directory. Check gfOutput.psl for BLAT results.gfClient yourServer.yourInstitution.edu 17779 `pwd` test.fa dnaTestOut.pslProtein test
gfClient -t=dnaX -q=prot yourServer.yourInstitution.edu 17779 `pwd` proteinSequence.fa proteinOut.pslEnsure that the
yourAssembly.2bit file is present on the test machine.
A dynamic BLAT server is specified with the "dynamic" argument to the
blat, transBlat, and isPcr definitions in the hub
genomes.txt file, followed by the gfServer root-relative path of the
directory containing the .2bit and .gfidx files.
For example:
blat yourServer.yourInstitution.edu 4096 dynamic yourAssembly transBlat yourServer.yourInstitution.edu 4096 dynamic yourAssembly isPcr yourServer.yourInstitution.edu 4096 dynamic yourAssembly
The genome and gfServer indexes would be:
$rootdir/yourAssembly/yourAssembly.2bit $rootdir/yourAssembly/yourAssembly.untrans.gfidx $rootdir/yourAssembly/yourAssembly.trans.gfidx
Refer to the Building gfServer indexes section for for detailed instructions on building the index.
For large hubs, it is possible to have more deeply nested directories. For instance, the following NCBI convention:
blat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 transBlat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3 isPcr yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
Which will reference these genome files and indexes:
$rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.2bit $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.untrans.gfidx $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.trans.gfidx
A query without specifying genome acts as an "I am alive" check:
% gfServer status myserver 4040 version 37x1 serverType dynamic
Specifying a -genome checks that it is valid and provides information on how the index was
built:
% gfServer -genome=mm10 -genomeDataDir=test/mm10 status myserver 4040 version 37x1 serverType dynamic type nucleotide tileSize 11 stepSize 5 minMatch 2
Using -trans checks the translated index:
% gfServer -genome=mm10 -genomeDataDir=test/mm10 -trans status myserver 4040 version 37x1 serverType dynamic type translated tileSize 4 stepSize 4 minMatch 3