Version 4.0 of PaxDb: Protein abundance data, integrated across model organisms, tissues, and cell-lines
Colour Online: See the article online to view Figs. 1–3 in colour.
Abstract
Protein quantification at proteome-wide scale is an important aim, enabling insights into fundamental cellular biology and serving to constrain experiments and theoretical models. While proteome-wide quantification is not yet fully routine, many datasets approaching proteome-wide coverage are becoming available through biophysical and MS techniques. Data of this type can be accessed via a variety of sources, including publication supplements and online data repositories. However, access to the data is still fragmentary, and comparisons across experiments and organisms are not straightforward. Here, we describe recent updates to our database resource “PaxDb” (Protein Abundances Across Organisms). PaxDb focuses on protein abundance information at proteome-wide scope, irrespective of the underlying measurement technique. Quantification data is reprocessed, unified, and quality-scored, and then integrated to build a meta-resource. PaxDb also allows evolutionary comparisons through precomputed gene orthology relations. Recently, we have expanded the scope of the database to include cell-line samples, and more systematically scan the literature for suitable datasets. We report that a significant fraction of published experiments cannot readily be accessed and/or parsed for quantitative information, requiring additional steps and efforts. The current update brings PaxDb to 414 datasets in 53 organisms, with (semi-) quantitative abundance information covering more than 300 000 proteins.
Abbreviations
-
- FDR
-
- false discovery rate
-
- PaxDb
-
- Protein Abundances Across Organisms
1 Introduction
Data processing and data reuse in proteomics remain challenging, more so than in other fields such as transcriptomics or genomics 1, 2. On the one hand, this is due to the sheer complexity of the proteome—where cellular proteins are expressed in a large diversity of isoforms and modifications, over a huge dynamic range, and in a variety of cellular localizations and biochemical contexts 3, 4. On the other hand, the technical and conceptual advances in proteomics currently happen so fast that it remains a challenge to unify and critically appraise all of the data as it arrives 5-7. Nevertheless, to achieve a deep quantitative coverage of the complete proteome is an essential milestone in the characterization of any model organism or tissue of interest, providing an important baseline for subsequent studies.
A growing number of online resources are dedicated to the processing and dissemination of proteomics data; they are operating at various degrees of postprocessing and data integration. Of these, the largest repositories of primary, raw data are those that are organized in the ProteomeXchange consortium 8: PRIDE 9, PeptideAtlas 10, MassIVE 11, and PASSEL 10. Building on these raw data collections as well as on additional curation, submission, and/or reprocessing, a number of additional resources exist. These typically offer higher levels of integration and standardization, but are sometimes also more specialized in terms of scope and coverage. They include GPMDB 12, MOPED 13, ProteomicsDB 14, MaxQB 15, and Human Proteome Map 16. In addition, other databases whose primary focus is perhaps not exclusively on proteomics may also contain information on protein abundances, notably UniProt 17 and NextProt 18.
-
Its primary focus is on consistency and comparability, both between datasets as well as between organisms. This is achieved by remapping all abundance information onto the same reference space of protein sequences and genome annotations, and by providing precomputed orthology relationships that allow comparisons between organisms, at the protein family level, across the entire tree of life.
-
PaxDb is “locus-centric”: information on alternative protein isoforms or PTMs is collapsed, down to the level of the single, protein-coding gene locus. This is a conscious decision, aiming to facilitate data interpretation and user interaction, and it should be useful in all scenarios where “proteoform” resolution 3 is not required.
-
PaxDb introduces a unique quality estimate, which applies at the level of entire datasets, as opposed to individual peptides or proteins. This metric aims to describe how well the observed spread of abundance values in a given dataset covers and delineates known functional groupings of proteins (e.g. protein complexes). The metric is called the “interaction consistency score” 19, and allows comparisons between datasets irrespective of the data source or measurement technique.
-
When populating PaxDb, datasets are chosen and filtered manually, so as to reflect largely unperturbed, “wild-type” cells, tissues, and organisms.
-
PaxDb is purely a meta-resource–it does not currently accept user submissions. All its data are imported from primary proteomics databases or from publication supplements; the original search parameters, false discovery rates, and other technical settings are left unchanged.
-
For each organism or tissue that has already been addressed by multiple available experiments/datasets, PaxDb conducts a weighted averaging to produce an integrated “best-estimate” dataset guided by the above quality estimates 19.
-
Lastly, PaxDb presents its information in an intuitive and simple web interface, which is enriched with accessory information regarding the annotation, structure, and interaction partners of the various proteins.
2 Data updates
The update process of PaxDb is partly manual, partly automatic, and it occurs on a time-scale of roughly once or twice a year. The growth of datasets and the number of organisms so far is tabulated in Fig. 2A. Care is taken not to exclude nonstandard datasets such as those based on biophysical or single-cell measurements; however, datasets are generally included only if they represent a mostly unperturbed, “normal” and physiological state of cells. Tissues and cell-lines are annotated with controlled vocabularies; in the case of tissues we use the Uberon ontology 20, which natively allows cross-species comparisons of homologous tissues/organs.
For the current update to version 4.0, we started with a manual search for publications describing possible datasets of interest. This included keyword searches and forward citation analysis of landmark papers, but we also systematically scanned all publications in three pertinent journals (MCP, J Prot Res, Proteomics), as well as all publication output of six major labs operating in high-throughput proteomics. The initial results were filtered down to 37 candidate publications, based on the following criteria: (i) studies should be published after August 2012 and not yet be contained in PaxDb, (ii) coverage should be at least 20% of the predicted proteome, or at least 20 000 peptide-spectrum matches in case of MS data, (iii) abundance values must cover at least three orders of magnitude, (iv) there should be no biased subfractionation (e.g. restricted to organelles, compartments, or specific modifications), (v) datasets must be parsable for absolute protein quantification data; this excludes purely relative quantifications, and (vi) datasets must address mainly unperturbed samples in normal, physiological state; this excludes mutated, stressed, or diseases samples.
The same set of criteria were then applied to filter recent datasets stored in three large proteomics data repositories: PRIDE 9, GPMDB 12, and PeptideAtlas 10. In the case of PRIDE, datasets were accessed via the PRIDE BioMart if “complete” submissions were available, and via the pepXML format in case of selected “partial” submissions. In the case of GPMDB, a recent new feature of the website that allows the aggregated access to peptide information for each taxon/organism, was used. From PeptideAtlas, data were imported via the so-called “builds.” GPMDB and PeptideAtlas are convenient data sources, but since the peptides cannot easily be traced back to the original experiments/publications, these data collections have to be taken as they are. Thus they may include some nonphysiological, subfractionated, or mutated samples—although in the case of PeptideAtlas, builds that hinted at this already in their annotation were blocked entirely.
Our data import procedure encountered many proteomics experiments multiple times (see Fig. 2B). Overall, redundancy was avoided by importing a given experiment via the most convenient access route (the two recent, large mapping efforts of the human proteome, for example, were imported via PRIDE).
Finally, a new development with release 4.0 of PaxDb is the inclusion of protein abundance information from cell-line samples. Cell-lines are unique in that they cannot be considered to be fully physiological and unperturbed, so their inclusion in PaxDb is somewhat of an exception to our normal import rules. However, cell-lines represent a significant fraction of the available data and their proteome expression status is of great interest for everyday lab work. Hence, datasets for 35 different cell-lines (including human induced pluripotent cells, as well as human and mouse embryonic stem cells), were included in this update; their biological origin is annotated using the “Cellosaurus” controlled vocabulary. Cell-lines are available for browsing and searching, but typically are not selected to contribute to the “best-estimate” integrated datasets in PaxDb.
3 Rescaling and quality scoring
As introduced and described previously 19, datasets in PaxDb are rescaled to a common abundance metric (“parts per million”), and also ranked via a universally applicable, albeit somewhat indirect quality score. For the rescaling, the datasets are first parsed or processed such that the data reflect proportional abundances of whole protein molecules (i.e. proportionality to counts of complete, individual protein molecules, not to molecular weights, protein volumes, or digested peptides). In the case of spectral counting data, this is done via an in-house pipeline that takes into account protein sizes and estimated relative detectabilities of peptides 19, 21. For other datasets, the procedures depend on the type of data and the type of quantitative information that is provided (datasets that cannot be converted to proportional abundances of entire protein molecules are discarded). Then, the proportional abundances are rescaled linearly to add up to one million; this means the abundance of each protein of interest is finally expressed in “parts per million,” relative to all other proteins in a sample. While this metric cannot be directly converted to “molecules per cell,” it has the advantage of being comparable/meaningful across cells of different volumes, or across tissues of different cellular and extracellular compositions.
For the quality scoring, we identified a test that can be applied to any organism and to any abundance dataset, albeit at the cost of providing an indirect quality estimate only 19. This test relies on the assumption that proteins which are interacting physically in a protein complex, or functionally in a pathway or metabolic process, should have a tendency to be expressed at similar abundance levels. This is merely a tendency, of course, and numerous exceptions to this assumption exist. However, globally, the abundance ratios of functionally interacting proteins are clearly closer to one than those of randomly chosen protein pairs 19, and this signal can be used to provide a relative ranking between datasets, given a constant and externally provided network of functional links between proteins. To compute this score, we first import protein–protein interaction information from the STRING database 22, separately for each organism in PaxDb. For a given protein abundance dataset, we then compute the absolute log abundance ratios of all pairs of proteins annotated to be functionally linked. The median of these absolute log abundance ratios represents an indirect quality metric: the closer it is to zero, the better (i.e. the more there is consistency between abundance values and functional annotations such as protein complexes or pathways). We then compute a background expectation for this metric, by permuting the abundance values in a given dataset randomly, and recomputing the median log abundance ratios. The permutation is repeated several times, yielding a distribution of medians. The actually observed median is then expressed as a Z-score distance to the random distribution of medians—this distance is termed the “interaction consistency score.”
4 False discovery rates
Since PaxDb does not reprocess raw MS data, and since it does not reexecute peptide-spectrum matching searches, the search parameter settings and false discovery rates (FDR) of the original submitters are always retained. However, there is controversy as to what extent false discovery rates represent a problem, especially as they propagate through larger integrated data collections 23. To reestimate FDRs in an independent way, Ezkurdia et al. proposed to focus on a set of proteins that should not be expected to be observed in the vast majority of human tissues, namely human olfactory receptors 23. Because there are several hundred of these receptors encoded in the human genome, they do represent a broad and universal test set of likely “false-positive” protein identifications (except, of course, for samples originating in nasal tissues and perhaps some other, inherently “leaky” tissues 24). We have implemented this test on all human abundance datasets in PaxDb 4.0, and indeed observe variable levels of inferred FDR across datasets (Supporting Information Table 1). This led us to block a small number of datasets from further inclusion in PaxDb, and the remainder of the data usually exhibit estimated FDRs of 5% or better, many even at 1% or better. Despite reasonably low FDRs throughout, false discovery identifications generally remain a pressing problem, since they will disproportionally affect the abundance estimates of lowly expressed proteins.
5 Stoichiometries and abundances on the tree of life
One of the unique features of PaxDb is its seamless comparability of data across organisms. This allows insights in protein abundance evolution, such as abundance conservation in the eukaryotic core proteome 21, 25, 26, cost-diversity tradeoffs during evolution 27, or the fate of paralogs during evolutionary network rewiring 28. Orthology relationships in PaxDb are precomputed, through the eggNOG mechanism 29, and can be browsed at various levels of phylogenetic depth (e.g. “mammals,” “animals,” or “eukaryotes”). To illustrate the usefulness of these comparisons, Fig. 3 shows abundance comparisons at various levels of organismal relatedness. At one end of the spectrum, closely related organisms such as mouse and human show a relatively high level of abundance conservation, with more than 3400 orthologous proteins observed in both, at an overall abundance correlation of 0.7 (Fig. 3A). At the other end of the spectrum, distant comparisons across the root of the tree of life (e.g. Bacteria versus Eukaryotes) reveal a universal core of the proteome—mostly involved in information processing, but still with an abundance correlation approaching 0.5 overall.
When combined with information on protein complexes or pathways, the view across multiple organisms might unravel general pathway-stoichiometries and scaling laws in proteome composition. With individual proteins and pathways, the measurement noise is likely still too large to allow many meaningful conclusions 30, but integration across datasets and organisms may allow to constrain global stoichiometries and min/max levels of regulation. This is explored in Fig. 3B: it shows the relative abundance ratios between functionally connected protein complexes (or processes), such as between the two subunits of the ribosome, between the ribosome and the proteasome, or between translation and transcription. Since multiple organisms, datasets, and proteins contribute data points to this plot, the ratios are statistically meaningful and reveal the expected differences of scale. Strikingly, the final abundance ratio of the two subunits of the ribosome comes down to 1:1.03 (Fig. 3B), which is very close to the theoretical expectation, and illustrates the quantitative power of data aggregation. Another notable observation concerns the stoichiometry between the core machineries of translation and transcription. Overall, this ratio is about 10:1, but it is significantly higher in eukaryotes than in prokaryotes (p < 2e-05; Fig. 3B), perhaps owing to larger cell-sizes and slower growth.
These and similar studies represent some of the use cases that PaxDb was designed for, but many other usage scenarios will undoubtedly surface with each new data release.
The author has declared no conflict of interest.