Tail-Robust Quantile Normalization

High-throughput biological data – such as mass spectrometry-based proteomics data – suffer from systematic non-biological variance, which is introduced by systematic errors such as batch effects. This hinders the estimation of ‘real’ biological signals and, thus, decreases the power of statistical tests and biases the identification of differentially expressed sample classes. To remove such unintended variation, while retaining the biological signal of interest, the analysis workflows for mass spectrometry-based quantification typically comprises normalization steps prior to the statistical analysis of the data. Several normalization methods, such as quantile normalization, have originally been developed for microarray data. However, unlike microarray data, proteomics data may contain features, in the form of protein intensities, that are consistently highly abundant across experimental conditions and, hence, are encountered in the tails of the protein intensity distribution. If such proteins are present, statistical inferences of the intensity profiles of the normalized features are impeded through the increased number of false positive findings due to the biased estimation of the variance of the data. Thus, we developed a, freely available, novel approach: ‘tail-robust quantile normalization’. It extends the traditional quantile normalization to preserve the biological signals of features in the tails of the distribution over experimental conditions and to account for sample-dependent missing values.

1 High-throughput omics data such as mass spectrometry-based proteomics data are subject to systematic non-biological errors such as batch effects, which bias expression level analyses, e.g. in the context of establishing diagnostic and prognostic signatures [1]. Normalization helps in reducing such variations and extracting the relevant biological signal as the main source of the variability that is linked to the biological factor of interest. It oftentimes represents the first data processing step and is critical in quantitative experiments [2], where the choice of the normalization method might influence the downstream analysis results [3] and lead to normalization bias. There exist a multitude of normalization techniques that can be applied in the course of the pre-processing of omics -and in particular proteomics -data. One of these being quantile normalization (QN) [4,5], which, initially, has been applied to microarrays, and was later adopted for proteomics data. It is based on the assumption that at the global level of the whole proteome, the distribution of the protein abundances is similar for all samples. Thus, all quantiles of the measured intensities for each sample are set to the average quantiles over all samples.
QN includes the following steps (assuming it is applied on a matrix where rows correspond to proteins, columns to samples and the values in the matrix cells to intensity values): 1) Sorting the proteins of each sample (column) separately 2) Calculating the mean across each quantile (row) and assigning the mean value to each element in this row 3) Rearranging to the original order of the values in each column. Yet, as already noted by Bolstad et al., QN proves to be problematic if individual gene expression features occur mainly in the tails of the intensity distributions [5]. Since proteomics data typically comprise features that are in the tails of the distributions, this prerequisite constitutes a serious and practically relevant limitation. Such features demonstrate a small or even no inter-sample variance of ranks across all samples. Depending on the degree of rank invariance they are termed 'nearly rank-invariant' (NRI) or 'rank-invariant' (RI). A high proportion of rank invariance impedes statistical inferences due to the biased biological variance of the normalized features, which in the case of RI features even results in the latter acquiring the same value across all experimental samples. In addition, it has been found that QN introduces extra patterns into the data on high intensities [6].
Moreover, the abundance of missing values due to technical or biological reasons is a prominent characteristic of proteomics data and can bias every normalization procedure. E.g. it can violate the distributional assumptions that are foundations of QN and may also bias the estimation of the offsets in the course of the here presented tail-robust QN (TRQN), if the occurrence of missing values is correlated with the feature intensities. Low intensity levels that are missing result in over-representation of high values of the same protein. This leads to a biased estimation of the offset and, in turn, may enhance inter-sample distribution differences.
In the present work, we demonstrate the prevalence of features in the tails of the intensity distribution on 173 label-free experimental protein datasets entered to the PRIDE data archive between 01/2013 and 12/2018 that have been processed via the software MaxQuant [7,8] beforehand. Datasets were included if they contain LFQ intensities and -to ensure a representative amount of samples over which the degree of RI is calculated -comprise at least 10 samples. Of these datasets only those proteins with at least 10 data points were included in the analysis. This corresponds to a confidence interval of less than 0.2775 in terms of the percentage of rank invariance.
To counteract the normalization bias caused by QN we present a novel modification of the classical QN: TRQN. It is implemented in the R package 'MBQN', which is freely available at www.bioconductor.org and github.com/arianeschad/MBQN under the GPL-3 license and also offers a separate function to examine the presence of RI features in omics data.
Finally, 56% (97) of all the 173 investigated PRIDE datasets had at least one protein showing a rank invariance of at least 50%, which would likely benefit from preprocessing with TRQN. A similar picture is likely to emerge from metabolomics data, such as datasets of the MetaboLights database [15]. While our analysis was conducted on the protein level, TRQN can be applied on the peptide level in the same fashion. Also, the advantages of TRQN are not restricted to MS-based proteomics data, but this algorithm may be also useful for other fields of application where data structured in arrays is distorted in scale and location.