Comparison of variable selection methods in partial least squares regression

Through the remarkable progress in technology, it is getting easier and easier to generate vast amounts of variables from a given sample. The selection of variables is imperative for data reduction and for understanding the modeled relationship. Partial least squares (PLS) regression is among the modeling approaches that address high throughput data. A considerable list of variable selection methods has been introduced in PLS. Most of these methods have been reviewed in a recently conducted study. Motivated by this, we have therefore conducted a comparison of available methods for variable selection within PLS. The main focus of this study was to reveal patterns of dependencies between variable selection method and data properties, which can guide the choice of method in practical data analysis. To this aim, a simulation study was conducted with data sets having diverse properties like the number of variables, the number of samples, model complexity level, and information content. The results indicate that the above factors like the number of variables, number of samples, model complexity level, information content and variant of PLS methods, and their mutual higher‐order interactions all significantly define the prediction capabilities of the model and the choice of variable selection strategy.

made in previous studies. 6,7 PLS has proven to be a very versatile method for multivariate data analysis. It is a supervised method purposely established to address the problem of making good predictions in multivariate problems, see Martens and Naes. 6 Although we can model high-dimensional data by PLS, very large p and small n can still spoil the PLSR results. For example, in some cases, the PLS estimator with the univariate response is not consistent 1 and a large number of irrelevant variables may cause a large variation in test set prediction. 2 These examples motivate variable selection in PLS to improve the model performance. 8,9 Further, variable selection is important for improved interpretation and understanding of the modeled phenomena. These two extreme end motivations are somehow contradictory, ie, normally increased model performance is achieved with a large number of variables, hence this motivates for a compromise in model performance and a number of selected variables. 10 Thus, variable selection is needed for having an interpretable 3 relationship between explanatory variables and the response, together with acceptable statistical model performance.
PLS in its original form has no direct implementation of variable selection, but a huge range of methods are proposed in PLS that address variable selection. Recently, a large set of variable selection methods in PLS have been reviewed 10 where the variable selection methods are categorized on the bases of their methodological construction into three main categories: filter methods, wrapper methods, and embedded methods. Motivated by this study, we have conducted a comparison of a large selection of available methods for variable selection within PLS.
The main aim of this paper is to unravel patterns of dependencies connecting variable selection methods and data properties. A simulation study is performed where data sets having miscellaneous characteristics like a number of variables, number of samples, model complexity level, and information content has been created. A meta-analysis provides an overview of the performance of PLS methods. This ultimately provides a suggestion for the selection of appropriate methods for the data set in hand.

PLSR ALGORITHM
There is a variety of PLSR algorithms, we start with most pioneering algorithm, called orthogonal scores PLSR, 5 and latte ww will include latest algorithms as well. We assume a set of explanatory variables X (n,p) are linked to a response y (n,1) through the linear relationship y = + X + with unknown regression parameters and and error term . For simplicity, we have considered only the single response case, but the methods can be generalized to multiple responses. Initially, the variables are centered (and optionally scaled) into X 0 = X − 1x ′ and y 0 = y − 1̄. Let A be the number of components to be extracted. Then for a = 1, 2, … , A, the algorithm goes as follows: 1. Compute the loading weights (LWs) by w a = X ′ a−1 y a−1 . The weights define the direction in the space spanned by X a−1 of maximum covariance with y a−1 . Normalize LWs to have length equal to 1 by w a ← w a ∕||w a ||.
2. Compute the score vector t a by t a = X a−1 w a .
3. Compute the X-loadings p a by regressing the variables in X a−1 on the score vector Similarly compute the Y-loading q a by 4. Deflate X a−1 and y a−1 by subtracting the contribution of t a as 5. If a < A return to 1.

VARIABLE SELECTION METHODS IN PLS
Variable selection methods in PLS are classified based on how the variable selection is made in PLS into three categories: filter, wrapper, and embedded methods, for detail, see Mehmood et al. 10 A short description is given below.

Filter methods
This is a two-step procedure. At first, a PLS model is built. Then, the output from the PLSR-algorithm is solely used to identify a subset of important variables. Examples are selections based on LWs 11 or RCs, 9 Jack-knife testing (JT), 12 variable importance in projection (VIP), 13 selectivity ratio (SR), 14 and significance multivariate correlation (sMC). 15

LWs from PLS
From the fitted PLS model for some number of components which are possibly optimized through cross-validation, the LW (w a ) can be used as a measure of importance to select variables. 8,11 For each component, the variables with a LW above a certain threshold in absolute value may be selected; specifically, the threshold is defined as w * = median(w)∕InterquartileRange(w). This is also known as hard thresholding and was suggested by Shao et al. 16

RCs from PLS
RCs ( ) are single measures of association between each variable and the response. Again, variables having small * = median( )∕InterquartileRange( ) can be eliminated. 9 Also in this case, thresholding may be based on significance considerations from jackknifing or bootstrapping, 12 which has been adopted in a wide range of studies (eg, previous works 17,18 ).

Variable importance in PLS projection
A third filter measure is the variable importance in PLS projections (VIP) introduced by 13 as "variable influence on projection" which is known now as "VIP" termed by Eriksson et al. 19 The v weights are measures of the contribution of each variable according to the variance explained by each PLS component where (w a ∕||w a ||) 2 represents the importance of the th variable. Since the variance explained by each component can be computed by the expression q 2 a t ′ a t a , 19 the v can be expressed as Variable can be eliminated if v < u for some user-defined threshold u ∈ [0, ∞). It is generally accepted that a variable should be selected if v > 1. 19-21

SR from PLS
The SR is based on the target projection (TP) approach. 14 Target projection is based on a postprojection of the predictor variables onto the fitted response vector from the estimated model. This results in a decomposition of the original predictor matrix into a latent (TP)-component and a residual component as follows: where t TP = Xw TP , w TP =̂P LS ∕||̂P LS || and p TP = X ′ t TP ∕(t ′ TP t TP ). The loadings from this model can be used as measures of how much each predictor variable contributed to the fitted response from the PLSR-model, and based on this SR, r is introduced. 22 For each variable , the r can be computed as where V exp, is the explained variance and V res, is the residual variance for variable according to the (TP)-model. The most influential variables will score highest on r . In order to set a threshold for r , Kvalheim et al presents an F-test under the null hypothesis that "explained and residual variance are equal," 23 and for each variable the hypothesis can be tested by rejecting the null hypothesis if r > F ,n−2,n−3 .
This provides a probabilistic measure for the selection of variables. 24

Significance multivariate correlation with PLS
In sMC, 15 an F-test is also used to assess variables which are statistically significant with respect to their relationship (regression) to y, but for an sMC i test value an F-distribution with 1 numerator and an n − 2 denominator degrees of freedom is used, F (1− ,1,n−2) where is the chosen significance level and sMC is defined as

Minimum redundancy maximum relevance in PLS
The minimum redundancy maximum relevance (mRMR) is a variable selection algorithm that tends to select variables having a high correlation with the response variable (relevance) and the least correlation between the selected variables (redundancy). 25 In PLS, the variables with minimum redundancy and maximum relevancy can be found by starting with a variable having maximal mutual information with the response then it greedily adds variables from PLS LWs with a maximal value of where S is the set of already selected variables and j present a respective variable in S. Variables are ranked based on mRMR, and the top ′ m ′ number of variables are marked as influential, where ′ m ′ can be determined through validation.

Wrapper methods
The variables identified by the filter methods are used to refit the PLS model. This refitted PLS model is then used again with filter methods. This procedure is repeated a certain number of times. The wrapper methods are mainly distinguished by the choice of underlying filter-method and how the "wrapping" is executed. Wrapper methods can further be categorized 3 based on the search algorithm: deterministic or randomized. Randomized search algorithms utilize some kind of randomness in the selection of subset while deterministic ones do not. An example from randomized wrapper methods is Monte-Carlo variable elimination with PLS (MVE), 26 while examples from deterministic wrapper methods are sub-window permutation analysis with PLS (SPA), 27 backward variable elimination in PLS (BVE-PSL), 8 and regularized elimination procedure in PLS (REP). 3

Genetic algorithm combined with PLS regression
Hasegawa et al combine genetic algorithm (GA) with PLS in the GA method. 28 Genetic algorithms involve the following steps: 1. Building an initial population of variable sets by setting bits for each variable randomly, where bit "1" represents a selection of the corresponding variable while "0" presents nonselection. The approximate size of the variable sets must be set in advance. 2. Fitting a PLSR-model to each variable set and computing the performance by, for instance, a leave-one-out cross-validation procedure. 3. A collection of variable sets with higher performance are selected to survive until the next "generation." 4. Crossover and mutation: New variable sets are formed (a) by the crossover of selected variables between the surviving variable sets and (b) by changing (mutating) the bit value for each variable by a small ratio R 0 . Indicating the variable selection is defined by this mutating ratio R 0 . 5. The surviving and modified variable sets from the population serve as input to point 2.
The steps 2 to 5 are repeated a preset number of times. Upon completion of the GA-algorithm, the best variable set (or a combination of a collection of the best sets) in terms of performance is selected.

Uninformative variable elimination in PLS
Centner et al 29 introduced uninformative variable elimination in PLS (UVE), where artificial noise variables are added to the predictor set before the PLSR model is fitted. All the original variables having lower "importance" than the artificial noise variables are eliminated before the procedure is repeated until a stop criterion is reached. The steps involved in UVE are as follos: 1. Generate a noise matrix N, having the same dimension as X, where entries are randomly drawn from uniform distribution in the interval 0.0 to 1.0. 2. Combine the N and X matrices into a new matrix of variables Z =[X , N]. 3. Fit the PLSR model to the combined matrix Z and validate by means of leave-one-out cross-validation. 4. Cross-validation results are used to compute a test statistic for each variable as c = mean(̂)∕sd(̂), for = 1, 2, … , 2p. 5. Set the threshold cmax as the maximum of absolute value c among the noise variables. Original variables with an absolute value of c smaller than cmax are assumed to be noise variables and are eliminated.
Steps 2 to 6 are repeated unless the performance of the models start decreasing. Artificially added random variables could influence the model if random variables are not properly selected. 30

Sub-window permutation analysis coupled with PLS
Sub-window permutation analysis coupled with PLS (SPA) 27 provides the influence of each variable without considering the influence of the rest of the variables. The use of subsets of variables makes SPA more efficient and fast for huge datasets.
Steps involved in SPA are as follows: 1. Sub-dataset sampling in both sample and variable space into N test and N training data sets. 2. For each randomly sampled training set, a PLS model is built and a normal prediction error, NPE, is measured on the corresponding test set. This NPE will be associated with all the predictors in the given training data set. 3. Prediction performance is also measured for each sampled training set by iteratively permuting each predictor variable in the training set to obtain a permuted prediction error, PPE, associated with each predictor. 4. The variable importance of each variable is assessed upon completion of the sub-dataset sampling by comparing the distributions of normal prediction errors (NPEs) and PPE of any given variable. Hence, for variable , the statistical assessment of importance is computed as D = mean(PPE ) − mean(NPE ). 5. All variables for which D > 0 are considered informative, in the sense that they will with large probability improve the prediction performance of a model if they are included.

Iterative predictor weighting PLS
Forina et al 31 introduced an iterative procedure for variable elimination, called Iterative predictor weighting PLS (IPW). This is an iterative elimination procedure where a measure of predictor importance is computed after fitting a PLSR model (with complexity chosen based on predictive performance). The importance measure is used both to rescale the original X-variables and to eliminate the least important variables before subsequent model refitting.

Backward variable elimination PLS
Backward variable elimination procedures are also developed for the elimination of non-informative variables, first introduced by Frank et al 8 and then Fernandez et al. In general, variables are first sorted for some important measure, and usually, one of the filter measures described above is used. Second, a threshold is used to eliminate a subset of the least informative variables. Then, a model has fitted again to the remaining variables and performance is measured. The procedure is repeated until maximum model performance is achieved.

Regularized elimination procedure in PLS
Mehmood et al 3 introduced a regularized variable elimination procedure (REP) for parsimonious variable selection, where also a stepwise elimination is carried out. A stability-based variable selection procedure is adopted, where the samples have been split randomly into a predefined number of training and test sets. For each split, g, the following stepwise procedure is adopted to select the variables: Let Z 0 = X and s (eg, w , , or r ) be one of the filter criteria for variable .
1. For iteration g, run Y and Z g through cross validated PLS. The matrix Z g has p g columns, and we get the same number of criterion values, sorted in ascending order as s (1) , … , s (p g ) . 2. Assume there are M criterion values below some predefined cutoff u. If M = 0, terminate the algorithm here. 3. Else, let N = ⌈ M⌉ for some fraction ∈ ⟨0, 1]. Eliminate the variables corresponding to the N most unfavorable criterion values. 4. If there is still more than one variable left, let Z g+1 contain these variables, and return to (1).
The fraction determines the "steplength" of the elimination algorithm, where an close to 0 will only eliminate a few variables in every iteration. The fraction and the cutoff u can be determined through cross validation.

Hotelling T 2 based variable selection in PLS (T 2 )
PLS LW matrix W against the validated optimal model can be used to classify the informative variables through Hotelling T 2 . 32 The procedure follows these steps: Remove the inlier as non-informative from the model.

Embedded methods
The variable selection is an assimilated part of a modified PLS-algorithm. Examples include soft-threshold PLS (ST), 33 the sparse-PLS (SPLS), 34 and distribution based truncation for variable selection in PLS (TR). 35

ST PLS
Sρbø et al 33 introduced a soft-thresholding step in the PLS algorithm (ST) based on ideas from the nearest shrunken centroid method. 36 The ST approach is more or less identical to the Sparse-PLS presented independently by Lê Cao et al. 34 At each step of the sequential ST algorithm, the LWs are modified as follows: 1. Scaling: w k ← w k ∕max |w k, |, for = 1, … , p.

Normalizing:
The shrinkage ∈ [0, 1) sets the degree of thresholding, ie, a larger gives a smaller selected set of variables. Cross validation is used to define this threshold. 33

Distribution-based truncation for variable selection in PLS (TRUNC)
For the selection of variables, the magnitude of PLS LWs (the columns of W) is an established criterion. Liland et al 35 suggested a distribution-based truncation for the variable selection. At each step of the sequential PLS algorithm, the weights were modified as follows: 1. Sort LWs w as w s . 2. Compute a confidence interval around the median of w s , which is based on a threshold pr. 4. Classify outliers as real, informative contributions and inliers as noise. 5. Truncate inliers.

Weighted variable contribution in PLS
Weighted variable contribution with PLS (PLS-WVC) is based on to the first singular value of the covariance matrix for each PLS component. 37 In each PLS iteration, the singular value decomposition of covariance of X t can be written as s a = w t a X t a a q a = (w 2 a,1 + w 2 a,2 + ... + w 2 a,p ). This means the contribution of the ith certain variable to the first singular value of the correlation matrix between X a and a is w 2 a,i s a . The weighted contribution of the ith variable can be defined as This expression seems to be a hybrid between two of the four possible criteria in the original article and will now work due to the dimensions not fitting each other in the denominator. In each iteration, w a → 0 if WVC(i) < followed by the LW normalization, ie, w k ← w k ∕||w k ||. The optimal can be determined through validation.

Simulation for the comparison
We are interested in understanding the strengths and weaknesses of different variable selection methods in PLS. This goal can be achieved by constructing an experimental design for running simulation experiments, with different data properties that may influence the methods' performances. The chosen factors are the number of variables (p), the number of training samples (n), the number of relevant predictors (q), information content (R 2 ), a parameter ( (> 0)) defining the decline of the eigenvalues of the predictor covariance matrix (Σ x ), the number of latent relevant components (m), and their position pos in the set of indices of declining eigenvalues of (Σ x ). As an example, if pos = (1, 5), it means that the eigenvectors associated with the largest and the fifth largest eigenvalues of (Σ x ) define the relevant components for the prediction of the response . The eigenvalues of (Σ x ) are assumed to decline exponentially for = 1, … , p according to exp(− ( − 1)). That is, a large implies rapid decline and high multi-collinearity. These factors represent the data properties and define the factors to be varied in our simulation study. We have considered different levels of these factors, which are listed in Table 1. The simulations were conducted using the simrel-package 38 and the plsVarSel-package 39 in R, 40 and further details regarding the simulation model may be found therein.

Assessment of performance
We have simulated independent test data for the assessment of the prediction performance of the selected variables from the different methods. In particular, the root mean square error of prediction (RMSEP) and the root mean square error of prediction relative to minimum achievable error (RMSEP minA = RMSEP∕ √ 1 − R 2 ) were measured. The latter can be The information content 0.6 and 0.9 2 computed since the lower bound of prediction error is known from the simulation design. Further, accuracy (Accurac ) of selected variables was computed for each method.

Optimization and parameter tuning
In the process of model fitting, the complexity parameters of the models had to be set, and for many methods, there are given recommended values of these parameters, whereas for others, the tuning of the parameters must be done according to some criteria. We have applied the most commonly used criterion in practical modeling and variable selection with PLS, namely to minimize cross-validated prediction error measured as the root mean squared error of prediction (RMSEP). All PLS-methods have at least one common complexity parameter, the number of components. In addition, there are some method-specific parameters for variable selection that need to be set for each method. To set the values of these parameters, cross-validation is again a frequently adopted procedure. Hence, double cross-validation is recommended where the training data is further divided into test and training set for model tuning. Double cross-validation is recommended for filter and embedded selection methods, while for wrapper methods, the tuning of both the number of PLS components and the other additional parameters at the same time is not recommended. In this situation, triple cross-validation (TCV) 3,41 provides a solution. Here, the training data are divided into test and training data for tuning the variable selection parameters and then a further split of training and test data is used for selecting the number of components.

Meta analysis using mixed-effects ANOVA
The design factors with their chosen levels, as listed in Table 1, lead to 2 6 = 64 design levels. For each design level, we simulated five data sets, which were analyzed by all 18 variable selection methods. From the parameters defining the simulation models, the theoretical minimum prediction error was known for each simulation design. In order to compare the prediction performance of the various models and for different designs, it is convenient to compare observed prediction error relative to the minimum achievable. We have referred to this variable above as (RMSEP minA ). A mixed-effects model was fitted to this response with the design factors p, n, pos, gamma, and method as main effects. In addition all 2. and 3. order interactions between these were included. Further, a random effect of data set, with 64 * 5 = 320, levels were included in the model to account for the inherent random "prediction difficulty" associated with each simulated training data set. The design factor R 2 was not included in the analysis since the minimum achievable error has a 1:1 relation to this factor, and hence, the effect of R 2 has been removed when we study the relative prediction error. After an initial model fit in R and subsequent backward elimination of nonsignificant effects (5% test-level), a reduced model was found and the results are given below.

RESULTS AND DISCUSSIONS
This study provides the comparison of 17 variable selection methods in PLS. The optimized values of the tuning parameters for the various methods, as described in the methods section, and as set by the optimization criterion (ie RMSEP, see Section 4.3) are presented in Table 2.
For real-life data, the prediction error is the only possibility to assess the performance of the method, while in a simulation study where we know which variables are actually important, we have other possibilities. The core of the study is to focus on the differences between methods, in prediction error and variable selection accuracy. In Figure 1, we have displayed the average accuracy across all simulation sets for each method versus the average prediction error. The methods which on average perform best with regard to selection accuracy are the SR and the sMC which reach an accuracy of 0.9, but the SR has lower prediction error. A large group of methods performs more or less equally with regard to both the accuracy of variable selection and prediction error: Trunc, JT, RC, BVE, ST, and REP. The lowest accuracy is observed for SPA and JT. In reference to prediction capability, JT and mRMR both perform worst while remaining methods have similar lower prediction capability. Since prediction capability is the main focus of the article, we have therefore discarded the mRMR and JT for further analysis.
Since variable selection methods are classified based on their construction into three groups, called filter, wrapper, and embedded. In reference to computational time, groups of methods can be ranked from faster to slower as a filter, embedded, and wrapper. Since wrapper and embedded are more conscious about prediction error, hence a group of methods can be ranked from high performance to low performance as embedded, wrapper, and filter. This is also supported by Note. In addition to these tuned parameters, some additional parameters were kept fixed, for instance, we used IPW with 10 iterations, GA with linear kernels, and population of size 1∕p, SPA with 10 variables to be sampled in each run, BVE with = 0.05, and RE with = 0.05 and = 1. the results because in reference to better accuracy and minimum prediction error methods can be ranked as T2, BVE, and SR. Among the embedded method, T2 considers the multivariate structure of LWs which results in better performance. Similarly among wrapper methods, BVE selects the optimal model from steps wise elimination in contrast to REP which compromises the performance for a smaller number of variables. Moreover, among filter methods, SR associates the statistical significance based on target projections for variable selection. As can be seen from the ANOVA table, there are several second-order interactions between design factors and the variable selection methods that appear important. Further, all main effects are significant and the effects of these were more or less as expected. The distribution of prediction error relative to the minimum achievable (RMSEP minA ) over the levels of significant second-order interaction effects that appeared in the ANOVA are presented in Figure 2. To explore the conditions under which methods differ and on which basis, we start with the interaction plots for the significant second-order interactions.
The prediction error is in general decreasing when the number of observations increases and when the relevant information for the response is located in components with large variances (eigenvalues). This latter observation confirms the findings of Helland and Almøy (1994). 42 The results also indicate an increase in prediction error when the number of variables increases and when the relevant information for the response is located in components with large variances (eigenvalues), but with larger components, a reversed trend in prediction error is observed with the increase of prediction error. The prediction error decreases as the fraction of relevant important variables get decreased. Moreover, there is a decrease in prediction error with increasing number of variables, p, but this is likely an artifact of the simulation design where the number of samples n was set as a function of p, and the largest values of p were thus associated with the largest values of n. A closer inspection of prediction errors for values of p with equal numbers of n showed that prediction error was quite constant with increasing p, indicating that PLS-regression is capable of ignoring irrelevant predictors even without strict variable selection. This is the essence of the PLS algorithm which is achieved because of dimension reduction through the use of latent variables in the presence of response. We find that the prediction error decreases when the number of observations increases and when the degree of decline, gamma, of the eigenvalues of (Σ x ) increases. The interaction plot of a number of variables and degree of decline gamma indicates that with a low degree of decline gamma = 0.01, the prediction error increases with the increase of a number of variables. However, it shows a reversed trend when the degree of decline increases with gamma = 0.95. Moreover, the prediction error increases when the number of relevant important variables increases and when the degree of decline gamma, of the eigenvalues of (Σ x ) decreases.
With regard to revealing any patterns between selection method and data properties, these are the most interesting results when it comes to the prediction performances of the methods. In Figure 2, the interaction between the degree of decline, gamma, of the eigenvalues of (Σ x ) and PLS variable selection method is displayed. On the vertical axis, a value of one means a prediction performance equal to the minimum achievable. As we move from left to right in the figure, the relative information in the predictor matrix becomes more hidden in principal directions with steadily smaller variances  nents are more similar with a less steep decline in eigenvalues and information is easier to find and the prediction error is larger.
With regard to the performance of the methods we can make the following observations: • The LW which is purely based on the selection on LWs performs badly for all data types. 6 • Some methods struggle more than others when gamma = 0.01, that is, when there is a relatively low degree of collinearity in the data. This applies especially to RC, SR, IPW, BVE, T2, TRUNC, and WVC. These methods rely less on borrowing information between variables than the others. These methods perform better when collinearity is larger and it is easier to locate the informative directions in the predictor space. 43 • Several methods are quite similar in their performance across all data designs, but among these, the SR appears to be better. 10 • The commonly used RC performs well in all cases. 44 When choosing a method for variable selection, the stability of the method is also an issue to take into consideration. In Figure 3, the standard deviations of all prediction errors are displayed for each method. The variation is largest for JT and mRMR, followed by LW. All other methods have similar and lower variability in the prediction errors for the test data. Also included is the standard deviations of the selection accuracies. Except for LW, the pattern is a bit different from IPW and MCUVE as the most unstable, together with LW. T2 seems to be the best compromise with regard to overall stability.

CONCLUSIONS
In this work, we have compared a large range of PLS variable selection methods. Some of these are tailored specially for PLS, while others have more general application. The simulation framework applied allows for the creation of data that mimics several important aspects of real data, particularly with regard to the eigenvalue structure of the predictive variables.
From the results and discussion, it is evident that variable selection is no guarantee for improving predictions, though many methods improve on the pure PLS model in situations with low variable correlation. Some methods can, in general, be left aside due to low predictive performance (JT, mRMR) or bad selection accuracy (SPA, LW). Among the remaining methods, the structure of the data and the emphasis on either good selection, eg, for biomarkers, or good prediction will be important for the final choice of method. Among the best overall performers, we would like to point out BVE, T2, SR, and WVC, with BVE as a marginal winner on average RMSEP minA and T2 as the most accurate and stable selector among the best performers.