Schema for PhyloCSF Genes - PhyloCSF Genes - Curated conserved genes
  Database: wuhCor1    Primary Table: PhyloCSFgenes Data last updated: 2020-07-23
Big Bed File Download: /gbdb/wuhCor1/bbi/phyloGenes/
Item Count: 13
The data is stored in the binary BigBed format.

Format description: bigGenePred gene models
chromNC_045512v2Reference sequence chromosome or scaffold
chromStart265Start position in chromosome
chromEnd21555End position in chromosome
nameORF1abName or ID of item, ideally both human-readable and unique
score0Score (0-1000)
strand++ or - for strand
thickStart265Start of where display should be thick (start codon)
thickEnd21555End of where display should be thick (stop codon)
reserved11,0,101RGB value (use R,G,B string in input file)
blockCount2Number of blocks
blockSizes13203,8088Comma separated list of block sizes
chromStarts0,13202Start positions relative to chromStart
name2ORF1abAlternative/human readable name
cdsStartStatcmplStatus of CDS start annotation (none, unknown, incomplete, or complete)
cdsEndStatcmplStatus of CDS end annotation (none, unknown, incomplete, or complete)
exonFrames0,0Exon frame {0,1,2}, or -1 if no frame for exon
typeN.a.Transcript type
geneNameORF1abPrimary identifier for gene
geneName2ORF1abAlternative/human-readable gene name
geneTypeN.a.Gene type

Sample Rows

PhyloCSF Genes (phyloGenes) Track Description


These tracks show curated SARS-CoV-2 protein-coding genes conserved within the Sarbecovirus subgenus as determined using PhyloCSF [1], FRESCo [2], and other comparative genomics methods, consistent with experimental evidence in SARS-CoV-2. Ambiguous gene names were resolved according to the recommendations in [3]. For a complete description of the evidence, see [4].

For a complete description of the evidence, see [4].

  • The PhyloCSF Genes track shows the conserved protein-coding genes, namely ORF1a, ORF1ab, S, ORF3a, ORF3c, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF9b.


    • ORF3c is a 41 codon ORF overlapping ORF3a in a different frame with coordinates 25457-25582; it has also been referred to as ORF3h, ORF3a*, and 3a.iORF1.
    • ORF9b is a 97 codon ORF overlapping N in a different frame with coordinates 28284-28577; it has also been referred to as ORF9a.
  • The PhyloCSF Rejected Genes track shows other gene candidates that have been proposed that do not show the signature of conserved protein-coding genes or persuasive experimental evidence of function [4], and are thus unlikely to be actual protein-coding genes, namely ORF2b, ORF3d, ORF3d-2, ORF3b, ORF9c, and ORF10.


    • ORF2b is a 39 codon ORF with coordinates 21744-21860 overlapping the spike protein in a different frame; it has also been referred to as S.iORF1.
    • ORF3d is a 57 codon ORF with coordinates 25524-25697 overlapping ORF3a in a different frame; it has also been referred to as ORF3b.
    • ORF3d-2 is a 33 codon ORF with coordinates 25596-25697 that is a subset of ORF3d starting at a downstream in-frame AUG codon; it has also been referred to as 3a.iORF2.
    • ORF3b is the 22 codon ortholog of the 5' end of SARS-CoV ORF3b with coordinates 25814-25882, ending at an in-frame stop codon that is not present in SARS-CoV.
    • ORF9c is a 73 codon ORF overlapping N in a different frame with coordinates 28734-28955; it has also been referred to as ORF9b and ORF14.

Data Access

The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server.

Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example:

bigBedToBed -chrom=NC_045512v2 -start=0 -end=29902 stdout

Please refer to our mailing list archives for questions, or our Data Access FAQ for more information.


See [4]. Note that the data was updated in June 2021: ORF14 was renamed to ORF9c, ORF2b and ORF3d-2 were added.


Questions should be directed to Irwin Jungreis.

If you use the SARS-CoV-2 PhyloCSF Genes Track Hub, please cite Jungreis et al. 2021 [4].


[1] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011).

[2] Sealfon RS, Lin MF, Jungreis I, Wolf MY, Kellis M, Sabeti PC (2015). FRESCo: finding regions of excess synonymous constraint in diverse viruses. Genome Biol. doi: 10.1186/s13059-015-0603-7.

[3] Jungreis, I., Nelson, C. W., Ardern, Z., Finkel, Y., Krogan, N. J., Sato, K., ... & Kellis, M. (2021). Conflicting and ambiguous names of overlapping ORFs in the SARS-CoV-2 genome: A homology-based resolution. Virology 558, 145-151.

[4] Jungreis I, Sealfon R, Kellis M (2021). SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nature Communications 12(1), 1-20. doi:10.1038/s41467-021-22905-7