Schema for UNC/BSU ProtGeno - Proteogenomics Hg19 Mapping from ENCODE/Univ. North Carolina/Boise State Univ.

JavaScript is disabled in your web browser

You must have JavaScript enabled in your web browser to use the Genome Browser

Database: hg19 Primary Table: wgEncodeUncBsuProtGm12878NucleusSig Row Count: 58,404 Data last updated: 2011-06-06
Format description: Format for genomic mappings of mass spec proteogenomic hits
On download server: MariaDB table dump directory

field	example	SQL type	info	description
`bin`	591	`smallint(5) unsigned`	range	Indexing field to speed chromosome range queries.
`chrom`	chr1	`varchar(255)`	values	Reference sequence chromosome or scaffold
`chromStart`	871331	`int(10) unsigned`	range	Start position in chromosome
`chromEnd`	871382	`int(10) unsigned`	range	End position in chromosome
`name`	WRLHSAPAALAGSQRGR	`varchar(255)`	values	Peptide sequence of the hit
`score`	321	`int(10) unsigned`	range	Log e-value scaled to a score of 0 (worst) to 1000 (best)
`strand`	+	`char(1)`	values	+ or -
`rawScore`	85.7783	`float`	range	Raw score for this hit, as estimated through HMM analysis
`spectrumId`	611.97937	`varchar(255)`	values	Non-unique identifier for the spectrum file
`peptideRank`	2	`int(10) unsigned`	range	Rank of this hit, for peptides with multiple genomic hits
`peptideRepeatCount`	1	`int(10) unsigned`	range	Indicates how many times this same hit was observed

Sample Rows

bin	chrom	chromStart	chromEnd	name	score	strand	rawScore	spectrumId	peptideRank	peptideRepeatCount
591	chr1	871331	871382	WRLHSAPAALAGSQRGR	321	+	85.7783	611.97937	2	1
591	chr1	880898	880955	DLFDLNSSEEDDTEGFSER	426	-	138.185	1102.955353	1	1
591	chr1	881600	881642	EEGTPLTLYYSHWR	215	-	135.323	876.4276425	1	1
591	chr1	881600	881642	EEGTPLTLYYSHWR	425	-	159.536	876.4260575	1	1
591	chr1	889425	889461	MLQPSSSPLWGK	203	-	147.823	665.8483575	1	1
591	chr1	892577	892607	LKDRDPEFYK	380	-	236.608	655.8444525	1	2
591	chr1	892607	892637	ASEHKDQLSR	149	-	264.017	585.7992875	1	2
592	chr1	949465	949492	IGVHAFQQR	541	+	227.057	528.2908025	1	1
592	chr1	949492	949531	LAVHPSGVALQDR	141	+	153.015	454.9229734	1	1
592	chr1	949492	949531	LAVHPSGVALQDR	169	+	184.136	454.9231267	1	1

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

UNC/BSU ProtGeno (wgEncodeUncBsuProt) Track Description


	Description The ENCODE project has revealed the functional elements of segments of the human genome in unprecedented detail. However, the ability to clearly distinguish transcripts designated for translation into protein versus those that serve purely regulatory roles remains elusive. The standard means for doing this is to measure the proteins, if any, that are produced by transcripts via mass spectrometry-based proteogenomic mapping. In this process, chromatographically fractionated peptides are fed into a tandem mass spectrometer (MS/MS). The series of fragment masses produced in MS/MS create a signature that can then be used to identify the peptide from a protein or DNA sequence database. For proteogenomic mapping, this identifying spectrum is mapped directly back to its most likely encoding locus on a genome sequence (Giddings, et al. 2003). This allows the direct verification of protein-encoding transcripts. The proteogenomic track displays mass spectrometry data that have been matched to the genomic sequence for selected cell lines, using a workflow and software specifically designed for this purpose. The proteogenomic tracks can be used to identify which parts of the genome are translated into proteins, to verify which transcripts discovered by ENCODE are protein-encoding, and can also reveal new genes and/or splice variants of genes. Of particular interest may be its ability to reveal the translation of small open reading frames (ORFs), antisense transcripts, or sites annotated as introns that encode proteins. Display Conventions and Configuration The display for this track shows peptide mappings as contiguous, rectangular items. These items are rendered in grayscale according to the score, with darker items representing higher-confidence peptide mappings. The name of each item is the amino acid sequence of the peptide. If a period (.) appears at the end of a name, it signifies a stop codon. In addition to the displayed genomic coordinates, several additional fields are available for each track item. The Raw Score reflects the strength of the peptide mapping, in contrast to the Score field which reflects the confidence of the mapping. The Score field is computed as -100×log₁₀(E-Value) for the peptide mapping, and scores of 200 or greater have an estimated 5% false discovery rate (FDR) while scores of 230 or greater have an estimated 1% FDR. The Raw Score offers an additional level of confidence: raw scores of 300 or greater have an estimated 5% false discovery rate. Note that Raw Score is not normalized for the length of the peptide mapping, while Score is. Consequently, short mappings might have a strong Raw Score but a weaker Score. The Spectrum ID is a semi-unique identifier of the spectrum associated with the peptide mapping, and can be used to track the origins of the mapping. The Peptide Rank indicates the rank of each peptide/spectrum mapping. A spectrum can be chimeric, containing more than one peptide, and the spectrum can be mapped with confidence to two or more distinct peptides. Peptides with ranks greater than 3 are deleted from the track. The Peptide Repeat Count indicates the number of places in the genome that match the peptide sequence. This reflects the uniqueness of the peptide mapping in the genome. Any mappings to highly-duplicated regions will have a high Peptide Repeat Count and peptides which were repeated more than 10 times in the genome were deleted from the track. Methods ENCODE cell lines K562 and GM12878 were used for large scale proteomic analysis. Cell lines were cultured according to standard ENCODE cell culture protocols and in-gel digestion was completed according to the standard protocol (Shevchenko, et al. 2007). The proteolytic enzyme trypsin was used to digest the proteins in order to produce short, MS/MS analyzable peptides. Trypsin is a common protease that typically cleaves proteins after Arginine or Lysine. The metadata parameter enzyme specifies the restriction enzyme used for digestion. Tandem mass spectrometry (RPLC-MS/MS) analysis was then performed on an Eksigent Ultra-LTQ Orbitrap system. However, due to enzyme inefficiency, it does not always cleave at Arginine or Lysine, so there may be peptides that include an uncleaved Arg/Lys site. The number of such missed cleavages allowed in the search is described by the metadata parameter miscleavages. We performed proteogenomic mapping (Jaffe, et al., 2004) with two missed cleavages allowed and using the whole human genomic sequence (UCSC hg19) via the genome fingerprint scanning (GFS) program (Giddings, et al. 2003) and newly developed Peppy (http://www.peppyresearch.com/). We used HMM_Score (Khatun, et al. 2008) to accurately match MS/MS spectra to their corresponding genome sequences. E-values are calculated, which estimate the number of results at the given score level which would be expected by random chance. We then empirically derived the false discovery rate for a given E-Value using a decoy database search and only those matches falling within the specified 5% FDR rate (E-Value <0.01) are included in the track. The results with 10% FDR (E-Value <0.05) are available under the Downloads page as Raw Signal. Release Notes This is Release 2 (July 2012). It contains a total of seven Proteogenomics experiments with the addition of one experiment available by download only. Unlike other ENCODE data, these data are not archived at GEO but at Proteome Commons. The first 32 digits of the Tranche Hash for each data set is stored as the labExpId. Credits Proteogenomic mapping: Dr. Jainab Khatun, Brian Risk, Mustaque Ahamed, Christopher Maier, Dr. John Wrobel and Dennis Crenshaw (Giddings Lab). Proteomic analysis: Drs. Yanbao Yu and Ling Xie (Chen Lab). Main Contact: Jainab Khatun References Giddings MC, Shah AA, Gesteland R, Moore B. Genome-based peptide fingerprint scanning. Proc Natl Acad Sci U S A. 2003 Jan 7;100(1):20-5. Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004 Jan;4(1):59-77. Khatun J, Hamlett E, Giddings MC. Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification. Bioinformatics. 2008 Mar 1;24(5):674-81. Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann M. In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nat Protoc. 2006;1(6):2856-60. Data Release Policy Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.

Description

Display Conventions and Configuration

Methods

Release Notes

Credits

References

Data Release Policy