Model selection techniques for sparse weight‐based principal component analysis

Many studies make use of multiple types of data that are collected for the same set of samples, resulting in so‐called multiblock data (e.g., multiomics studies). A popular analysis framework is sparse principal component analysis (PCA) of the concatenated data. The sparseness in the component weights of these models is usually induced by penalties. A crucial factor in the use of such penalized methods is a proper tuning of the regularization parameters used to give more or less weight to the penalties. In this paper, we examine several model selection procedures to tune these regularization parameters for sparse PCA. The model selection procedures include cross‐validation, Bayesian information criterion (BIC), index of sparseness, and the convex hull procedure. Furthermore, to account for the multiblock structure, we present a sparse PCA algorithm with a group least absolute shrinkage and selection operator (LASSO) penalty added to it, to either select or cancel out blocks of data in an automated way. Also, the tuning of the group LASSO parameter is studied for the proposed model selection procedures. We conclude that when the component weights are to be interpreted, cross‐validation with the one standard error rule is preferred; alternatively, if the interest lies in obtaining component scores using a very limited set of variables, the convex hull, BIC, and index of sparseness are all suitable.


| INTRODUCTION
Many studies make use of multiple types of data that are collected for the same set of samples, resulting in so-called multiblock data. 1 Examples include multiomics studies in which the same set of samples is profiled using different molecular assays such as mRNA expression, DNA methylation, DNA copy number, and somatic mutation data; see Wang et al. 2 for a multiomics study of breast cancer and Reinke et al. 3 for a joint analysis of six different data blocks collected from 22 individuals from an asthma cohort. Another example is multimodal studies that use different magnetic resonance imaging (MRI) modalities (e.g., anatomical, diffusion, and resting state functional magnetic resonance), for example, to study the same group of Alzheimer patients. 4 Each of the data blocks gives a partial view of the complex system under study. A full understanding of how the system works requires to understand both the drivers of the system that operate independently and those that operate only by concerted action. At the level of the data, this means that insight is needed in the relations between variables both within and between the data blocks: components of the system that work independently will show up as variation that is determined by the variables of a single block only whereas those components that work by concerted action will show up as variation that is determined jointly by variables linked throughout the blocks. A particular challenge in studying the jointly and individually determined variation is the need to automatically select variables that are of interest; not only to ease interpretation but also because data are often collected in an untargeted way and one of the primary aims of the data analysis is to hint at variables that may be key players in the process under study. 5 This is of particular relevance when using high-throughput approaches resulting in thousands of measured variables.
Following the strong rise of multiblock data in many disciplines, several integrative methods for exploratory data analysis have been put forward including clustering and dimension reduction techniques and combinations thereof; see, for example, the review by Ment et al. 6 Among the dimension reduction techniques, a number of methods that model joint and individual variation, also called common and distinctive latent variables or components, have been proposed. 7,8 Some of these methods perform variable selection 9-11 by adding a least absolute shrinkage and selection operator (LASSO) penalty to the objective function. 12 This penalty has the property to shrink the estimates to zero, some exactly with the implication that that variable does not contribute (e.g., a zero regression weight means that the predictor does not contribute to the prediction, and a zero component weight does mean that the variable does not contribute to the component). The use of such penalties that introduce zeros in the estimates is the current state of the art in variable selection. The main reasons for the popularity of penalties over subset selection methods such as best subset selection are better stability of the penalized regression model 12 and their computational efficiency (e.g., compared with calculating the solutions for all possible subsets of variables 13 ). A popular framework for the analysis of multiblock data is sparse principal component analysis (PCA) (in the multiblock case also known as sparse simultaneous component analysis [SCA] 14 ); this framework will be the focus of the current paper.
A crucial factor in the use of penalized methods is the tuning of the regularization parameters used to give more or less weight to the penalties. In practice, the amount of sparseness in the data and the number of common and distinct components are not known beforehand. Hence, to make good use of penalized PCA approaches, model selection tools are needed to determine the strength of the LASSO and group LASSO penalties. In the context of sparse PCA, a few methods have been put forward to address this issue: these include popular solutions such as crossvalidation (CV) 15 and the Bayesian information criterion (BIC) 16,17 but also less known alternatives such as the index of sparseness (IS) 18,19 and the convex hull (CHull) procedure. 20,21 A comparison of these methods in the context of sparse PCA misses.
In this paper, we will discuss and evaluate several existing model selection procedures to select proper values of the tuning parameters used in sparse PCA. Furthermore, we will extend sparse PCA with a group LASSO penalty 22 to model the common and distinct variation, by selecting at the level of the data blocks. The main focus of the paper will be on comparing several model selection procedures with respect to finding those values of the tuning parameters that yield the correct structure of the data, that is, selecting the right set of variables both in the single block setting and in the multiblock setting with common and distinct variations. The following model selection procedures, and adaptations thereof, will be discussed: CV with the eigenvector method, 15 BIC, 16,17 CHull, 21 and the IS. 18,19 We will examine these model selection procedures because they are readily available from the existing literature and can be used to estimate metaparameters for the weight-based PCA model with little to no modification of the original propositions. For sparse PCA, we will examine these model selection procedures in a simulation study with a single block of data (the most common case where all variables are assumed to represent one unit of interest). For sparse SCA, we will examine the procedures by making use of multiblock data (several sets of variables are available for the same cases with variables within one set representing a unit of interest). In the multiblock case, we will assess whether the model selection procedures produce a final model that correctly identifies the joint and individual structure of the components. In order to inform the analysis of the block structure of the variables, we implemented the group LASSO penalty in a blockwise fashion, to either select or cancel out blocks of data in an automated way.
The remainder of the paper is structured as follows: first, we will introduce sparse PCA with the LASSO penalty and its extension to the multiblock setting including a group LASSO penalty. Second, we will discuss several existing or adapted model selection procedures for tuning the LASSO and group LASSO penalty in sparse PCA. Third, we will examine these model selection procedures in the case of single and multiblock data in a simulation study. Lastly, we conclude with a discussion.

| SPARSE PCA FOR SINGLE AND MULTI-BLOCK DATA
In this section, we will introduce the notation and give a brief introduction to sparse PCA. Then we will discuss the extension to the multiblock setting and introduce the group LASSO penalty to account for common and distinct variation.
We will make use of the standardized notation proposed by Kiers 23 : bold lowercase and uppercases will denote vectors and matrices, respectively; the superscript "T" denotes the transpose of a vector or matrix, and a running index will range from 1 to its uppercase letter (e.g., there is a total of I cases where i runs from i = 1, … , I).
Given is a data matrix X that contains the scores for I observations on J variables; we follow the convention to present the J variable scores of observation i in row i and thus X has size I × J. PCA decomposes the data into Q components, as follows: where W is a J × Q component weight matrix, P is a J × Q loading matrix, and E is a I × J residual matrix. Often, the model is presented using the notation T for the component score matrix that results from the linear combinations shown explicitly in XW. In this type of representation of the PCA model, interpretation is usually based on the loadings. Yet an attractive property of the PCA formulation in (1) with kWk 1 = P j,r jw jq j the LASSO penalty (tuned by λ L ≥ 0) and kWk 2 2 = P j,r w 2 jq the ridge penalty, also known as Tikhonov regularization (tuned by λ R ≥ 0). The objective function in Equation (2) has been popularized by Zou et al. 24 As pointed out there, the inclusion of a ridge penalty is needed in the high dimensional setting, and this has J > I; the combination of LASSO and ridge is known as the elastic net.
The decomposition in (1) can be extended to the case of multiblock data by taking X = [X 1 , … , X K ]; this is concatenating the K data blocks composed of different sets of variables for the same observation units. The decomposition of X has the same block structured decomposition with W = ½W T 1 , …, W T K T and P = ½P T 1 ,…, P T K T . This multiblock formulation of PCA is known as SCA. 25 Also in the multiblock case, W can be penalized to obtain sparse weights, and we will call this variant sparse SCA. When analyzing multiblock data with sparse SCA, we can search for blockwise structures in the component weights that tell us whether a component is uniquely determined by variables from one single data block (distinctive component), or whether it is a component that is determined by variables from multiple data blocks (common component). In other words, a distinctive component is a linear combination of variables of a particular data block only, whereas a common component is a linear combination of variables of multiple data blocks. An example of common and distinctive components in the situation with two data blocks is given below. The first two components are distinctive components, and the third component is a common component: In total, there are ð2 K − 1Þ + Q− 1 Q possible combinations of common and distinctive components. There are 2 K − 1 states for each component (minus one to exclude components with only zero weights), and each of these specific states can be assigned to each of the components where the ordering does not matter. Therefore, counting all possible common and distinct configurations for Q components takes on the form of unordered sampling with replacement.
In the work of de Schipper and Van Deun, 11 the challenge of finding the right sparse block structure for the component weight matrix was handled by an exhaustive approach, examining all possible common and distinctive structures. If the number of components and blocks is not too large, calculating all possible models is feasible. However, if the number of blocks and components is large, it is not and can be expected to yield highly variable results (as is the case with the best subset selection method for variable selection). Another option to perform selection at the level of the blocks is to add a group LASSO penalty to the PCA objective; see Jenatton et al., 26 Deun et al., 14 and Erichson et al. 27 for similar proposals. Let w ðkÞ q denote the component weights of the variables of block k in component q. To have selection at both the level of the blocks and within blocks, the following penalized least squares criterion can be used: Hence, the group LASSO is tuned by λ G ≥ 0 with sufficiently large values resulting in components that are based on a linear combination of variables of just one or a few data blocks. To find estimates that minimize Equation (3) under the constraint of orthonormal component loading vectors, we rely on an alternating procedure that yields a nonincreasing sequence of loss values thus converging-in practice-to a fixed point. The details of this numerical routine are discussed in Appendix (A0.1). Importantly, the numerical procedure only optimizes with respect to the component weights and loadings and thus needs fixed values for the number of components and the tuning parameters of the penalties. How to obtain suitable values for λ L , λ R , and λ G is the main topic of this paper.

| MODEL SELECTION PROCEDURES FOR SPARSE PCA
In this paper, we will discuss several model selection techniques for the selection of the penalty tuning parameters. These methods are CV with the eigenvector method, 15 the BIC, 16,17 CHull 21 , and the IS. 18,19 These model selection techniques have been previously proposed in the context of PCA, some also in the context of sparse PCA as defined here; this is with penalties on the weights. The application of these methods to sparse SCA with a group LASSO penalty is novel. A thorough comparison of these methods-for both sparse PCA and SCA-is lacking.

| CV with the eigenvector method
In the context of PCA, CV can be applied in several ways; a discussion and comparison with respect to selecting the number of components for the X = TP T model can be found in Bro et al. 15 In that comparison, the best performing method was CV with the eigenvector method; de Schipper and Van Deun 11 discussed the method in the context of sparse SCA to determine the value of the LASSO and ridge tuning parameters. Let (− j) denote that (the coefficients of) variable j are removed. Following Bro et al. and de Schipper and Van Deun, given a number of components Q, to determine the value of a tuning parameter λ, the method then works as follows 1 : 1. Divide the sample into K folds each of size I k . 2. Leave out the kth fold and calculateŴ andP on the remainder given a set of tuning parameters λ.
3. For the left-out samples in the kth fold i = 1, … , I k , for variables j = 1, … , J, (a.) Estimate the score as t 1 Note that here K is used to denote the number of folds used in the CV procedure and not the number of data blocks (c.) Find the prediction error of the element x ij by taking e ij = x ij −x ij .

Calculate the mean squared error of the kth fold, d
MSEðλÞ k = 1 5. Repeat 2 and 3 for each fold and calculate the overall mean squared error, The standard error of Equation (4) is obtained by taking the sample standard deviation of d (see, e.g., Gordon et al. 28 ). Typically, the data are split into K = 10 folds of (approximately) equal size, which we will also do in the current paper. The attractive features of CV with the eigenvector method are that it is relatively fast to perform and that the estimated datax ij are obtained independent of the data used to construct the model. For more detailed information, we refer the reader to Bro et al. 15 The model with the lowest MSE is chosen as the best model. CV tends to select models that are too complex; therefore, the one standard error rule was developed. 29 The one standard error rule selects the set of tuning parameters that lead to the least complex model, still within one standard error of the best model. In this paper, we will examine the models chosen according to the best (this has the lowest MSE) and the one standard error rule.

| The BIC
Let RV be the residual variance resulting from the PCA decomposition with Q components, Likewise, let g RV denotes the residual variance for a given a model with a specific λ and Q. Following Guo et al. 16 and Croux et al., 17 the BIC for a set of tuning parameters λ and given the number of components Q is then given by with df(λ) the number of nonzero weights inŴ. The optimal set of λ values is then defined as the set of λ's that results in the model with the lowest BIC.

| CHull: A convex hull-based model selection method
CHull, 21 also known as L-curve (see, e.g., Hansen and O'Leary 30 ), is a generic model selection procedure that aims at striking an optimal balance between the goodness-of-fit/misfit and model complexity. As stated by the authors: "The CHull procedure consists of (1) determining the convex hull of the fit-measure-by-complexity-measure plot of the models under consideration and (2) identifying the model on the boundary of the convex hull for which it is true that increasing the complexity (i.e., adding more parameters) has only a small effect on the fit measure, whereas lowering complexity (e.g., dropping parameters from the model) changes the goodness of fit (or, respectively, the misfit) substantially." 21, p. 2 In this application of CHull, we will use the variance accounted for (VAF) as a goodness-of-fit measure: This is the goodness-of-fit measure that the authors originally used in their application of CHull as a method to determine the number of components in PCA. In our example, we will also make use of the d MSEðλÞ as a goodness-offit measure; that is, the MSE values we obtain from the CV procedure as described before; see Equation (4). Our motivation is that VAF λ is subject to overfitting and gives a downward biased estimate of the error. For the complexity measure, we use the number of nonzero weights inŴ. The models are selected using the multichull package. 31

| Index of sparseness
According to Gajjar et al. 18 and Trendafilov et al., 19 the IS given by where df(λ) is defined as previously; the VAF pca is given by Equation (7) withŴ andP resulting from the PCA decomposition with Q components and all λ = 0; and VAF λ is also given by Equation (7) but withŴ andP resulting from PCA with Q components and a set of regularization parameters λ ≥ 0. The IS increases with goodness of fit and the sparseness of the solution. The (combination of) value(s) of the tuning parameter(s) λ that result(s) in the model with the largest IS is picked as the optimal value(s).

| SIMULATION STUDIES
The model selection techniques are assessed under different conditions by means of a simulation study. First, we will discuss the case of a single block of data with an unstructured sparsity pattern, and then we will discuss the case of multiblock data with structured sparsity resulting in common and distinct variation.

| Single-block data
In the simulation study, we kept the number of variables fixed to J = 50 and the number of components to Q = 3. The study included the following design factors: • The number of observation units I: 25 , 50, and 100.
• The level of sparseness (percentage of the-in total JQ = 150 weights-that are equal to zero): 30% and 80%.
The design is fully crossed, resulting in 3 × 2 × 2 = 12 design cells. For each design cell, 50 data sets were simulated. The generation of the data is detailed in Appendix (A1.0.2). The resulting data were analyzed using an implementation of Algorithm (1) (see the Appendix) in the R software for statistical computing. 32 Algorithm (1) is freely available in R 32 and downloadable from github.com/trbKnl. Each data set was analyzed using a 50 × 10 grid of LASSO and ridge penalty tuning parameters. For the ridge, a sequence of 10 values equally spaced on the interval ln0 to ln500 was used and for the LASSO 50 equally spaced values on the same interval. Note that the values were back-transformed to the range 0-500. For each obtained (sparse) PCA model, the model selection statistics were calculated, and a best model was obtained for each of the six model selection methods. The chosen models according to the model selection criterion were then evaluated by looking at the following performance measures: • The similarity between the true model matrix W and the estimatedŴ. We use Tucker congruence between the vectorized version of W andŴ to measure the similarity. The Tucker congruence (also known as cosine similarity) is defined as the cosine of the angle between two vectors. If the two vectors share no similarity, they are orthogonal, and the Tucker congruence will be 0. If the vectors are linearly dependent, that is, perfect similarity, the angle between these two vectors is 0 and the Tucker congruence will be 1.
• The percentage of correctly identified zero weights, calculated as the percentage of zero weights in the true matrix that are recovered as a zero weight in the estimated matrix. • The percentage of correctly identified nonzero weights, calculated as the percentage of nonzero weights in the true matrix that are recovered as a nonzero weight in the estimated matrix.

| Results
The results of the simulation study for the single block data are summarized in Figures 1 and 2. Figure 1 shows the Tucker congruence coefficient for the different model selection methods. Usually, a threshold of 0.85 is recommended. 33 In the condition where the sparsity is 80%, only 10-fold CV, 10-fold CV with the one standard error rule, and CHull with the MSE often attain Tucker congruence values above the threshold value of acceptable similarity. Interestingly, CHull with MSE performs well, whereas this is not the case for the CHull badness-of-fit measure previously used in the literature. In the conditions were the sparsity is 30%, only 10-fold CV and 10-fold CV with the one standard error rule attain Tucker congruence values above 0.85. This means that the BIC, IS, and the CHull with VAF procedures result in models where the estimated component weights are too dissimilar from the true component weights. When the true underlying models are very sparse (the conditions with 80% of sparsity), the procedures in general perform better. Because the Tucker congruence coefficient is relatively insensitive to whether the correct status of the weights (i.e., zero or nonzero status) is estimated back, we also inspect whether the model selection procedures result in models that select the right subset of variables. The results are summarized in Figure 2. Three patterns can be discerned. First, CV finds almost 100% of the nonzero weights yet recovers very few of the zero weights; this confirms that CV is known to yield too complex models. Second, the IS, BIC, and CHull with VAF show the opposite behavior and favor very sparse models, which results in good recovery of the zero weights at the expense of recovering very few of the nonzero weights. Third, CV with the one standard error rule yields a high percentage of recovery for both the zero and nonzero weights. It may seem surprising that most of these model selection techniques perform badly while having showed good performance in the literature with sparse loadings (e.g., Gu et al. 34 ). This can be explained by the fact that-for the reconstruction of the data-the component scores and the loadings matter while the component weights play an indirect role. The component weights enter in the construction of the scores:T = XŴ. As long as the scores are recovered well, the data are reconstructed well. This is the case for the data here: Tucker congruence betweenT and T is larger than 0.85 for the bulk of the selected models with each of the model selection procedures; see Figure 3. This in fact means that the component scores themselves can be retrieved rather well without the need of having to estimate that many nonzero weights. Hence, model selection procedures that balance fit with the number of nonzero coefficients result in very sparse models. This implies that few weights actually need to be estimated in order for the model to attain a good fit.

| Multiblock data
In this simulation study, we assess the performance of the model selection criteria for the case of multiblock data that have structured sparsity; that is, we assume the component weights to have a common and distinctive structure. Here, particular interest will be in evaluating whether the model selection methods recover the common and distinctive structure.

| Simulation study design
The data that will be analyzed in this simulation study consist of two data blocks (X = [X 1 X 2 ]) each with 25 variables. The structure imposed on the data is Q = 3 components with two distinctive components and one common component. The study includes the following design factors: • The number of samples I: 25 and 100.
• The level of sparseness in the nonzero blocks in the columns of W: 30% and 80%.
The design was fully crossed, resulting in 2 × 2 × 2 = 8 design cells. For each design cell, 50 data sets were generated. The details of the data generation scheme used can be found in Appendix A1.0.2. The data were analyzed with Algorithm 1 for a grid of LASSO, group LASSO, and ridge tuning parameters. The sequences of the ridge, LASSO, and group LASSO parameters are given by a sequence from 0 to 500 of length 10 on the natural log scale for each of the three tuning parameters. The chosen model according to the model selection criteria is then evaluated by looking at the following performance measures: • Tucker congruence coefficient.
• Whether the two distinctive components are estimated back (i.e., all weights in the zero segments are estimated as zero), • Whether the common component is estimated back (i.e., at least one nonzero weight in each data block).

F I G U R E 3
The Tucker congruence coefficient between T andT for the various model selection procedures. The dashed line indicates a threshold value of 0.85 used as a cut-off for fair similarity. In each condition, 50 replicate data sets were used. The boxplots display the median and upper and lower quartiles

| Results
The results of the multiblock simulation study are summarized in Figure 4 and Tables 1 and 2. In Figure 4, the Tucker congruence coefficients are displayed; these mainly show low congruence; this is below the threshold of 0.85, except for the two CV procedures. Also here, as was the case in the single block simulation, the low Tucker congruence coefficients are caused by most model selection procedures having put too many weights to zero, compared with the actual number of zero weights. Compared with the first simulation, Tucker congruence is a bit higher because the distinctive components induce higher levels of overall sparsity, meaning that the true model is more sparse and thus supportive of selection methods that favor higher levels of sparsity.
We now turn to the question whether the model selection methods recover the common and distinct components. Table 1 summarizes whether the common component is identified for the different model selection procedures, that is, whether at least one nonzero component weight within each block was retained. Table 2 summarizes whether the distinctive components are identified by the different model selection procedures (i.e., whether all weights of the block not making up the component are set to zero). Together, these tables show the same patterns previously observed for the single block simulation study: CV favors complex models, which results in defining most components as common and not finding the distinctive components; the BIC, CHull-VAF, CHull-MSE, and the IS estimate models that are too sparse and hence declare most components to be distinctive at the expense of the common components; again, only 10-fold CV with the one standard error rule accurately estimates the sparsity, both within and between blocks.
To decide on which method is best on the basis of combining the identification rates for the common and distinct components, we used sum of ranking differences (SRD) scores and summarized these in a plot. SRD scores are a consensus decision making tool for situations with multiple optimality criteria 35 (for further reading, also see Héberger 36 ). Here, the scores are based on rankings of the model selection procedures on the basis of the identification rates for common (see Table 1) and distinctive (see Table 2) components. For further details on how to obtain the SRD score, we refer to Lourenco and Lebensztajn. 35 The SRD scores are summarized in Figure 5 with lower scores indicating better overall performance of the method. The gray solid curve denotes the cumulative distribution of SRD scores on the basis of a random ranking of the methods on the different optimality criteria (we relied on an approximate distribution). In the plot, the score corresponding to the 0.05 smallest SRD scores for the randomly ranked methods is indicated: this is F I G U R E 4 The Tucker congruence coefficient between W andŴ for the various model selection procedures in the case of multiblock data. The dashed line indicates a threshold value of 0.85, which indicates fair similarity. In each condition, 50 replicated data sets were used. The boxplots display the median and upper and lower quartiles our chosen cut-off for significance with methods having higher scores being considered to not perform consistently better on each of the optimality criteria than based on a random ranking. It can be observed that only 10-fold CV with the one standard error rule falls (just) below the cut-off, indicating that it is all-round better than the other methods. The other model selection procedures do not consistently perform better.
For the interested reader, we will provide an example of the analysis of multiblock data making use of Equation (3) in the next section.

| EMPIRICAL EXAMPLE: HERRING DATA
We will now provide an illustrative example where we analyze a data set on salted herring samples using sparse weight-based SCA. The data set on salted herring consists of two blocks, each containing a specific set of variables on 21 herring samples with the samples corresponding to different ripening conditions; see Bro et al. 37 and Nielsen et al. 38 for more information about the data. The first block contains chemical and physical measurements, whereas the second block consists of sensory variables. For an overview of the variable names, see Table 3. The analysis of multiblock data follows three steps:

T A B L E 1 Common components identified in percentages
• Preprocessing of the data. • Tuning the model; selecting the metaparameters.
• Analyzing the final model; interpreting the component weights.
We will discuss each of these steps here below.

| Preprocessing of the data
Preprocessing has a large impact on the final results of the analysis and should be done according to the needs of the researchers; see Deun et al. 25 for an overview. For the herring data here, we first centered and scaled (to unit variance) the variables as we are not interested in scale differences. As the two blocks have the same number of variables, no further block scaling is needed.

| Tuning the model
Multiple metaparameters need to be tuned in order to get to a satisfactory final model. For the sparse PCA method that we use here, these are the number of components and the regularization parameters λ L , λ G , and λ R . Also for the selection of the number of components, CV has been recommended. 15 Hence, two strategies can be considered, namely, tuning all parameters together or following a sequential strategy. Because of the computational burden of the simultaneous strategy, we opt for the sequential approach: first, we select the number of components, and then, given the selected number of components, we tune the LASSO and group LASSO parameters. To determine the number of components, we used 10-fold CV with the one standard error rule on each block. This resulted twice in three components; hence, we analyzed the concatenated data with the maximum number of components possible, that is, six distinctive components. Also, the regularization parameters were tuned using 10-fold CV with the one standard error rule. More specifically, we tuned the LASSO, ridge, and group LASSO regularization parameters on a three-dimensional grid with 25, equally spaced values between 0 and 500 on the log scale; for the data here, this covers solutions ranging from no sparseness at all to all coefficients being zero. We chose a log scale because it tends to do well in practice and has been recommended elsewhere; see Friedman et al. 39, p. 10 Note that the upper bound depends on the scale of the data.

| Analysis of the final mode and interpretation of the results
The component weights resulting from the final analysis (i.e., using six components and with the values of the regularization parameters set at those selected under the CV scheme) are summarized in Table 3. Note that in this case there are two distinctive components (Components 2 and 5) and four common components (Components 1, 3, 4, and 6). The component weights directly relate the components to the observed variable as t iq = P j w jq x ij . For comparison, we included results from a nonspare PCA of the concatenated data in Table 4 where the weights/loadings 2 are estimated using the singular value decomposition and subsequently rotated to a simple structure using varimax rotation. 40 Strikingly, there is no structuring of the components into common and distinctive components. Furthermore, components in PCA are a linear combination of all variables, and interpreting these is much more difficult than sparse SCA. Take, for example, the fifth component; in the case of PCA with varimax rotation, the component is a linear combination of mainly the variable Spice, but other variables are weighted as well with nonnegligible loadings such as TCAIndexB, Malt, and Stockfish smell. In the case of sparse SCA, the fifth component is just the variable Salty, making it the unequivocal Salty component. Importantly, the gain in interpretation obtained by imposing sparseness comes at barely any cost in terms of the variation accounted for.
As for the meaning of the components, we examine the first and most important component in terms of explained variance (42.4% variance explained) from the sparse SCA. We observe that ProteinB, TCAIndexM, TCAM, and TCAB, from the first block, together with Ripened, Malt, Sweetness, Softness, Toughness, and Watery from the second block, make up the first component. Nielsen et al. 38 note that softness (the most important quality indicator used in the  2 Note that the loadings and the weights are the same in PCA when P T P = I herring industry) correlates with TCAM/B TCAIndexM and ProteinB, which to them makes sense because: "A correlation between these parameters and softness may be expected as muscle proteins are broken down during the ripening thus explaining the increase in low molecular nitrogenous compounds and at the same time softening of the tissue is encountered. ProteinB is mainly salt soluble muscle protein diffusing into brine from the muscle. The solubilisation of muscle proteins will therefore also probably affect the texture." 38, p. 23 Furthermore, they note that Softness, Toughness and Watery measure the same characteristics and that TCAIndexM, TCAM and TCAB measure the same characteristics. This corresponds to the reported weights for the first component, except for the small weight of ProteinM. We could view component one as a "quality of herring" component. The first component obtained with PCA followed by varimax rotation is also the most important component (31.9% variance explained), and we may expect this to also represent quality of herring. The weights for this component in Table 4 show a somewhat similar pattern, yet there are some deviations, and the interpretation is much harder because all variables make up the first component. Furthermore, this component explains less variance compared with the first component of sparse SCA.

| CONCLUSION
The current paper examined several model selection procedures to select the penalty tuning parameters of sparse weight-based PCA for the unstructured case of a single block of data and of sparse weights SCA for the multiblock case having structured sparsity. Most model selection procedures that have been proposed in the sparse PCA literature did not perform well in terms of finding back the correct component weights. When analyzing single block data, the The first block, corresponding to the first 10 variables (rows), consists of physical and chemical analyses of the herring samples measured either in brine (B) or fish muscle (M). The second block contains sensory data on the herring samples. These are the results from the loadings obtained from the singular value decomposition rotated according to the varimax criterion. Abbreviations: MM, majorization-minimization; SCA, simultaneous component analysis; VAF, variance accounted for. procedures led to either too complex or too sparse models. When analyzing multiblock data, it led to either identifying most components as common components and not as distinctive or not identifying common components as such. The only model selection procedure that seems to strike a good balance between model complexity and goodness of fit in both the single and multiblock cases was 10-fold CV with the eigenvector method employing the one standard error rule. It has to be noted that we did not tune the number of components together with the tuning parameters; this could be addressed in further research.
As discussed in the paper, although the weights are recovered badly, this barely affects the recovery of the component scores nor the reconstruction of the data and hints at the fact that the estimation of the weights is an illconditioned problem. Importantly, this means that if the goal is to obtain good estimates of the component scores, loadings, or data yet with no interest in the estimates of the component weights, proper tuning of the penalties on the weights is not needed. In this situation, an economical decision may be to select a very sparse model (e.g., as resulting from the IS, BIC, or CHull) as good estimates of the component scores can be obtained with few variables. Yet when insight in the processes at play in the data is needed, our advice is to use CV with the one standard error rule.
It has to be noted that a good solution for the component weights is in the eyes of the beholder; a situation where a very sparse solution might be desirable is when the component scores themselves are of interest and when observing new data are expensive. For newly observed cases, only the variables with nonzero component weights have to be observed to compute component scores.

Description of algorithm
In order to obtain the component weights, we need to optimize the following objective function with respect to W c and P c : where W c = ½ðW ð1Þ Þ T , …, ðW ðKÞ Þ T T , and w ðkÞ q denotes the qth column from the submatrix W (k) . In order to get a minimum for (A1), we alternate between the estimation of W c and P c . Given W c , we can estimate P c by using procrustes rotation.24,41 Given P c , we can find estimates for W c by using the majorization-minimization algorithm; for a short review, see Hunter and Lange.42To majorize (A1), we can majorize all individual terms separately. First, we majorize ‖W c ‖ 1 . For the ease of simplicity, let j = 1,…, P k J k be index the rows of W c and let q = 1, … , Q be index the columns of W c , and then we can majorize ‖W c ‖ 1 as follows: wherew jq is the current estimate of w jq , vec() denotes the vectorized version of a matrix, D 1 a diagonal matrix of jw − 1 jq j, and c contains the terms that do not depend on w jq and thus can be neglected in solving the optimization problem with respect to the elements of W. Next, we consider a majorizing function for the QK group LASSO terms, Lastly, we will majorize the QK elitist LASSO penalty terms, with is a quadratic function that can be easily minimized by taking the partial derivatives with respect to the elements to vec(W c ) and setting them to zero, also see Deun et al.,14 doing this gives us the following estimates for vec(W c ): with I being a Q × Q identity matrix. Estimates for vecðŴ c Þ can be found relatively efficiently by making use of the block diagonality of ðD sup + IX T c X c Þ, meaning that the weights can be estimated per component separately, withŵ q denoting the estimates of the qth component, D ðqÞ sup denoting the part of D sup corresponding to the qth component, and a q denoting the qth column of A = X T c X c P c . The computation of the inverse in Equation (A8) can be costly if P J k is large. To make the algorithm well suited to handle a large number of variables, we can implement a coordinate descent procedure to solve for W c in Q(W c ,P c ). For the ease of notation, we will drop subscript c and let j = 1,…, Q P J k . Then, the update for an element of vec(W c ) is given by where subscript j denotes the jth element of a vector or the jth column of a matrix and −j denotes the object minus the jth element or column. Making use of the orthogonality of P and with j = 1,…, P J k , this simplifies tô With these derivations, the estimation of W c can be summarized in Algorithm 1.
Although different regularizers are implemented in Algorithm 1, it is not advised to combine them all together. For example, it is not advised to combine the group LASSO and the elitist LASSO as they have opposing goals. A use case for the elitist LASSO is when common components have to be extracted; this is imposing zeros on each block in such a way that for each block segment also nonzero component weights remain.

Data generation Single block
The data for the simulation study were generated from the following model: where W is J × J, W T W = I, and W = P. W is manipulated such that it contains a given level of sparsity. To achieve this, we make use of an iterative procedure that proceeds as follows. First, a random W matrix is generated with zero weights in the desired places. After this step, orthogonality of the columns is attempted by applying the Gram-Schmidt orthogonalization procedure only on the intersection of the nonzero weights between two columns of W. When W only has sets of columns that contain nonoverlapping sparsity patterns, this immediately results into orthogonal columns, but when the columns in W have overlapping sparsity patterns, the procedure will not always lead to W T W = I on the first pass. In such cases, multiple passes are needed in order to achieve orthogonality (additional coefficients might need to be put to zero). Some sparsity patterns are not possible, for example, an initialization where W does not have full column rank, or an initial set that degenerates to a linearly dependent set after multiple passes. In those cases, the algorithm fails to converge. After a suitable W has been obtained, Σ can be constructed by taking Σ=WΛW T . Here, Λ is a diagonal matrix with eigenvalues of the J components underlying the full decomposition. We specify these eigenvalues such that the first Q components account for a set amount of structural variance and the remaining eigenvalues for a set amount of noise variance. The data matrices X having a desired underlying sparse structure and noise level can then be obtained by sampling from the multivariate normal distribution using Σ and a zero mean vector.

Multiblock
The data generation for the multiblock simulation study is the same as the data generation in the single set simulation study, except that the data have been generated with two distinctive components and one common component. We define a distinctive component as being a linear combination of variables from a particular data block and a common component as a linear combination of all data blocks. In order to achieve the desired common and distinctive structure, full block segments of zeros are inserted in the W matrix.