Schema for UNC/BSU ProtGeno - Proteogenomics Hg19 Mapping from ENCODE/Univ. North Carolina/Boise State Univ.
  Database: hg19    Primary Table: wgEncodeUncBsuProtGm12878NucleusSig    Row Count: 58,404   Data last updated: 2011-06-06
Format description: Format for genomic mappings of mass spec proteogenomic hits
On download server: MariaDB table dump directory
fieldexampleSQL type info description
bin 591smallint(5) unsigned range Indexing field to speed chromosome range queries.
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
chromStart 871331int(10) unsigned range Start position in chromosome
chromEnd 871382int(10) unsigned range End position in chromosome
name WRLHSAPAALAGSQRGRvarchar(255) values Peptide sequence of the hit
score 321int(10) unsigned range Log e-value scaled to a score of 0 (worst) to 1000 (best)
strand +char(1) values + or -
rawScore 85.7783float range Raw score for this hit, as estimated through HMM analysis
spectrumId 611.97937varchar(255) values Non-unique identifier for the spectrum file
peptideRank 2int(10) unsigned range Rank of this hit, for peptides with multiple genomic hits
peptideRepeatCount 1int(10) unsigned range Indicates how many times this same hit was observed

Sample Rows
 
binchromchromStartchromEndnamescorestrandrawScorespectrumIdpeptideRankpeptideRepeatCount
591chr1871331871382WRLHSAPAALAGSQRGR321+85.7783611.9793721
591chr1880898880955DLFDLNSSEEDDTEGFSER426-138.1851102.95535311
591chr1881600881642EEGTPLTLYYSHWR215-135.323876.427642511
591chr1881600881642EEGTPLTLYYSHWR425-159.536876.426057511
591chr1889425889461MLQPSSSPLWGK203-147.823665.848357511
591chr1892577892607LKDRDPEFYK380-236.608655.844452512
591chr1892607892637ASEHKDQLSR149-264.017585.799287512
592chr1949465949492IGVHAFQQR541+227.057528.290802511
592chr1949492949531LAVHPSGVALQDR141+153.015454.922973411
592chr1949492949531LAVHPSGVALQDR169+184.136454.923126711

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

UNC/BSU ProtGeno (wgEncodeUncBsuProt) Track Description
 

Description

The ENCODE project has revealed the functional elements of segments of the human genome in unprecedented detail. However, the ability to clearly distinguish transcripts designated for translation into protein versus those that serve purely regulatory roles remains elusive. The standard means for doing this is to measure the proteins, if any, that are produced by transcripts via mass spectrometry-based proteogenomic mapping. In this process, chromatographically fractionated peptides are fed into a tandem mass spectrometer (MS/MS). The series of fragment masses produced in MS/MS create a signature that can then be used to identify the peptide from a protein or DNA sequence database. For proteogenomic mapping, this identifying spectrum is mapped directly back to its most likely encoding locus on a genome sequence (Giddings, et al. 2003). This allows the direct verification of protein-encoding transcripts.

The proteogenomic track displays mass spectrometry data that have been matched to the genomic sequence for selected cell lines, using a workflow and software specifically designed for this purpose.

The proteogenomic tracks can be used to identify which parts of the genome are translated into proteins, to verify which transcripts discovered by ENCODE are protein-encoding, and can also reveal new genes and/or splice variants of genes. Of particular interest may be its ability to reveal the translation of small open reading frames (ORFs), antisense transcripts, or sites annotated as introns that encode proteins.

Display Conventions and Configuration

The display for this track shows peptide mappings as contiguous, rectangular items. These items are rendered in grayscale according to the score, with darker items representing higher-confidence peptide mappings. The name of each item is the amino acid sequence of the peptide. If a period (.) appears at the end of a name, it signifies a stop codon.

In addition to the displayed genomic coordinates, several additional fields are available for each track item.

  • The Raw Score reflects the strength of the peptide mapping, in contrast to the Score field which reflects the confidence of the mapping. The Score field is computed as -100×log10(E-Value) for the peptide mapping, and scores of 200 or greater have an estimated 5% false discovery rate (FDR) while scores of 230 or greater have an estimated 1% FDR. The Raw Score offers an additional level of confidence: raw scores of 300 or greater have an estimated 5% false discovery rate. Note that Raw Score is not normalized for the length of the peptide mapping, while Score is. Consequently, short mappings might have a strong Raw Score but a weaker Score.
  • The Spectrum ID is a semi-unique identifier of the spectrum associated with the peptide mapping, and can be used to track the origins of the mapping.
  • The Peptide Rank indicates the rank of each peptide/spectrum mapping. A spectrum can be chimeric, containing more than one peptide, and the spectrum can be mapped with confidence to two or more distinct peptides. Peptides with ranks greater than 3 are deleted from the track.
  • The Peptide Repeat Count indicates the number of places in the genome that match the peptide sequence. This reflects the uniqueness of the peptide mapping in the genome. Any mappings to highly-duplicated regions will have a high Peptide Repeat Count and peptides which were repeated more than 10 times in the genome were deleted from the track.

Methods

ENCODE cell lines K562 and GM12878 were used for large scale proteomic analysis. Cell lines were cultured according to standard ENCODE cell culture protocols and in-gel digestion was completed according to the standard protocol (Shevchenko, et al. 2007).

The proteolytic enzyme trypsin was used to digest the proteins in order to produce short, MS/MS analyzable peptides. Trypsin is a common protease that typically cleaves proteins after Arginine or Lysine. The metadata parameter enzyme specifies the restriction enzyme used for digestion. Tandem mass spectrometry (RPLC-MS/MS) analysis was then performed on an Eksigent Ultra-LTQ Orbitrap system. However, due to enzyme inefficiency, it does not always cleave at Arginine or Lysine, so there may be peptides that include an uncleaved Arg/Lys site. The number of such missed cleavages allowed in the search is described by the metadata parameter miscleavages.

We performed proteogenomic mapping (Jaffe, et al., 2004) with two missed cleavages allowed and using the whole human genomic sequence (UCSC hg19) via the genome fingerprint scanning (GFS) program (Giddings, et al. 2003) and newly developed Peppy (http://www.peppyresearch.com/). We used HMM_Score (Khatun, et al. 2008) to accurately match MS/MS spectra to their corresponding genome sequences. E-values are calculated, which estimate the number of results at the given score level which would be expected by random chance. We then empirically derived the false discovery rate for a given E-Value using a decoy database search and only those matches falling within the specified 5% FDR rate (E-Value <0.01) are included in the track. The results with 10% FDR (E-Value <0.05) are available under the Downloads page as Raw Signal.

Release Notes

This is Release 2 (July 2012). It contains a total of seven Proteogenomics experiments with the addition of one experiment available by download only. Unlike other ENCODE data, these data are not archived at GEO but at Proteome Commons. The first 32 digits of the Tranche Hash for each data set is stored as the labExpId.

Credits

Proteogenomic mapping: Dr. Jainab Khatun, Brian Risk, Mustaque Ahamed, Christopher Maier, Dr. John Wrobel and Dennis Crenshaw (Giddings Lab).

Proteomic analysis: Drs. Yanbao Yu and Ling Xie (Chen Lab).

Main Contact: Jainab Khatun

References

Giddings MC, Shah AA, Gesteland R, Moore B. Genome-based peptide fingerprint scanning. Proc Natl Acad Sci U S A. 2003 Jan 7;100(1):20-5.

Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004 Jan;4(1):59-77.

Khatun J, Hamlett E, Giddings MC. Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification. Bioinformatics. 2008 Mar 1;24(5):674-81.

Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann M. In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nat Protoc. 2006;1(6):2856-60.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.