Do you have a question related to multivariate data analysis? Each month our expert panel will select a handful of the most popular or unique questions to answer so you can get expert advice on choosing the right tools and scientific methods for your data analysis needs…all for free!

**Previous Questions and Answers**

**Question:** What prevents a matrix from being classified as spectra? When I right click on a data matrix which I imported from excel or imported from excel followed by transposing, the spectra option is grayed out and cannot be selected. I am importing spectral data so it would be really helpful to set it as such, but I can’t figure out what the problem is.

**Answer:** To use the option to designate data as spectral, a column set needs to be defined (even if all the columns in the data are spectra. Once this is done, highlight the column set in the project navigator, right click, and toggle on “Spectra”. With this setting, the loadings plots in the PCA overview will by default be shown as a line plot, and for PLS regression, the regression coefficients will be displayed in the PLS overview as a line plot.

…………………………………………………………………………………………………………………………………………………

**Question:** What is the difference between LDA and the “discriminant analysis” offered in other software? (Damien E)

**Answer:** Discriminant analysis is a generic term for models that aim to classify data into groups or classes, and where the class information is used as input to the model. In Unscrambler we have the classical method Linear Discriminant Analysis (LDA) which aims to maximize the spread between classes while minimizing the within class sample distances using a Bayesian approach. We also have non-linear extensions of LDA (QDA) as well as Support Vector Machine Classification (SVMC) for non-linear class separation. A very flexible discriminant analysis method is to use regular Partial Least Squares Regression (PLSR) with binary class-information (0/1) in the response matrix. Then you can assign sample to classes based on the predicted responses, and use the available validation options to establish a conservative estimate of future classification performance.

…………………………………………………………………………………………………………………………………………………

**Question:** How do I interpret the Deviation in PLSR prediction? (Tom P)

**Answer:**The Deviation can be used to assess the uncertainty of a predicted value, and it is particularly useful when the references value is not known. It is estimated based on the prediction error variance from the validation, the sample leverage and the sample residual X-variance. A large deviation indicates that the sample is not similar to the samples used to make the calibration model. It can be interpreted as a “standard error of prediction”, and a confidence interval can be generated using the t-distribution function. This implies that for any moderately sized data set, a 95% confidence interval is given by Ypred +/- 2 *deviation.

…………………………………………………………………………………………………………………………………………………

**Question:** How are the uncertainty limits calculated for PLSR regression coefficients? (Matti M)

**Answer:** The uncertainty limits are calculated based on the magnitude of regression coefficients (B) and their stability during cross-validation. The null hypothesis for a variable is that B=0, implying that variables with regression coefficients significantly different from zero are important. The uncertainty test is similar to a T-test, where the calibration model B (Bcal) is used in the numerator. Because cross-validation segments are not independent, the denominator variance is calculated as the sum of squared differences between Bm and Bcal, multiplied with (M-1)/M. Here, Bm denotes B for a cross-validation segment, and M is the number of segments. Note that you choose the level of validation when you set up the cross-validation (e.g. it is important to leave out all replicates of a sample in a single segment to achieve conservative estimates).

…………………………………………………………………………………………………………………………………………………

**Question:**Hello I have some data, but I am unsure of the best test to use for analysis. My data set consists of two groups of samples. There are 5 samples in each group. Each of the five samples was tested multiple times by different people. The response is a rating scale. Which test should be used in order to determine if there are differences between group A and group B? Thanks! (Erin B)

**Answer:** First you must check whether there are level or range differences between the different assessors. Use descriptive statistics for the response variables and use sample grouping on the quantiles plot to see if there are significant differences in how different people use the scale. This requires that each assessor has tested both groups of samples, otherwise there is no way to tell whether differences are due to the sample groups or due to individual preferences. Significant differences between assessors can be corrected for by subtracting the mean and dividing by the standard deviation for each assessor (taken across samples).

If there is only a single measured response you can use a student’s T-test to compare the means of the two groups. For multiple responses you should look into PCA classification (SIMCA) or Linear Discriminant Analysis (LDA).

…………………………………………………………………………………………………………………………………………………

**Question:** Dear Camosoft team, in the Regression Overview of PLSR, correlation coefficient and R-square were shown. My understanding is that the square of correlation coefficient is R-square. Is it right? I got the result of PLSR in which the value of R-square was higher than the value of the square of correlation coefficient? Thank you. (Naomi O)

**Answer:** R-square is calculated from the Explained Variance for the given number of components, as given in the Explained Variance plot. The Correlation is taken directly between the predicted response and the reference vector, as plotted in the Predicted vs. Reference plot. The squared Correlation is therefore not the same as R-square, though they are similar in most cases. To avoid confusion we have added R^{2 }(Pearson) as a separate entry in The Unscrambler X v.10.2 which is the squared Correlation value.

Note that these values are given both for the calibration model and for the validation, and validation settings have to be chosen carefully to give relevant error estimates.

…………………………………………………………………………………………………………………………………………………