Dr. Frank Westad, CAMO Software
The latent variable methods PCA, PCR and PLSR are linear, or more correctly bilinear as they are linear in both scores and loadings. Many real-world processes are inherently non-linear in one way or another. The non-linearity can be in terms of the overall relationship between two sets of variables (X, Y) or on a more individual basis for some specific variables based on first-principle or deterministic models. Both curvature and more or less known underlying phenomena in the system might introduce non-linearity in the data. The non-linear effects can be handled in different ways:
1. Pre-processing with the purpose of removing non-linearity
2. Transform variables based on a priori knowledge
3. Include interaction and square terms(and optionally higher order terms) as a rough approximation
4. Use non-linear methods
Non-linear methods have been developed for most linear and bilinear regression methods. In the case of PLS regression two sets of scores are computed for every a factor; the x-scores, ta, and the y-scores, ua, which are the basis for the so-called inner relation which is found by the least squares solution. An alternative is to perform non-linear inner relation modelling (polynomial functions, spline functions). Artificial neural networks (ANN,) have been investigated thoroughly over the past years as a non-linear family of methods. More recently Support Vector Machines for classification and regression have shown good figures of merit. Other methods aim at making local linear models, e.g. Locally Weighted Regression.
Below is a suggested general procedure for how to include terms that express non-linearity within the family of linear and bilinear methods:
1. Make an initial model with the linear terms and relevant variable transformations due to a priori knowledge, e.g. Viscosity = f(Temperature2).
2. Decide on significant variables from some suited test of significance, e.g. cross-validation at the correct scientific level (see below).
3. Recalculate with significant variables and include interaction terms and square terms (I & S). Remove non-significant variables.
4. Recalculate with the significant variables only.
The term “significant” might be replaced by the more loose term “relevant” when aspects other than statistical significance also are part of the data analysis.
Validation is even more essential when performing non-linear modelling, as random non-linear tendencies and outlying objects might be modelled as “true” non-linear relations. The objects in a data table can often be stratified into groups based on background information about the origin of the objects. Such groups are a consequence of the experimental set-up of the study. Typical stratifications are:
– Across instrumental replicates (repeatability)
– Reproducibility (analyst, instrument, reagent…)
– Sampling site and time
– Across treatment/origin (year, raw material batch, lot ID…)
Cross-validation (CV) performed at the various grouping level will give important information about the stability of the model and which sources of variation that need special attention. Thus, even if a test set has been defined as the proper way of validating the model (or process or system in a wider context) the calibration set must be validated with CV at the appropriate level. If not, the model dimensionality may not be conservative enough and the test set is predicted with a suboptimal number of variables or factors.
Some additional comments regarding item 3 above: It is rare that a variable is not significant as a linear term when the “true” relation is somewhat non-linear, except from a pure “bell-shaped” relation. This will be revealed in interpretation from the available model plots such as correlation loadings, so the danger of losing vital information is rather low. The score plot may change its appearance when I & S terms are included, and the model can have lower explained validation variance than the first linear model. The model performance will usually improve when only the significant variables are used in the recalculation. Validation becomes even more important for non-linear methods and when non-linear terms for individual variables are added, as the danger of overfitting increases and outliers in a linear model may be modelled as a non-linear trend.
Although PCR and PLSR are linear methods, they have shown to be suitable also for data with inherent non-linearity. This is sometimes at the cost of using 1-2 factors more than a direct non-linear method.