ANOVA with Multiple Comparison Tests

ANOVA with Multiple Comparison Tests

Purpose 

To provide an analysis of variance (ANOVA) test per selected attribute.

ANOVA examines sources of variation in the data: this is often used in sensory science to investigate whether variation in attributes is due to products, samples and replications (amongst other variables) and in consumer science too see if products differ in (say) liking.

Where there is an effect of product or assessor, the analyst needs to know between what products or assessors the differences lie: a multiple comparison test of mean scores is performed to achieve this. Several multiple comparison tests are available in the EyeOpenR ANOVA module.

Data Format

  1. profiling.xslx

Note: for EyeOpenR to read your data, the first five columns must include the following in the specified order: Assessor, Product, Session, Replica and Sequence. Sensory attributes or consumer variables start from column six (Column F). If there is no session, replica or sequence information available, the user should input a value of “1” in each cell in the column that contains no collected information. See the example dataset for more information. 

Background 

ANOVA is one of the most commonly used statistical tests in consumer and sensory science. It allows variability within the data to be partitioned into several specified effects, for example, a Product effect, an Assessor effect, a Replicate effect, etc., as well as random error. The choice of what effects (aka terms) to include in an ANOVA model is chosen by the user from a shortlist in EyeOpenR, constrained by whether there are replicates, sessions or sequences in the data.

Common ANOVA models – sensory science: One of the most common in sensory science is to model each sensory variable (y) by a Product effect, an Assessor effect, and the interaction between the two, that is, Product by Assessor, as well as random error. This affords the user to understand whether there are significant differences between products, between assessors and whether there is a significant interaction term, which in this case would indicate that assessors rate the products differently on the chosen attribute. In other words, the interaction provides information on whether assessors disagree with product intensities on that attribute. Ideally, we would have agreement in the panel and thereby a non-significant interaction effect. In EyeOpenR, the Product effect is determined by examining the variation between products to the level of disagreement amongst the panel/assessors. If there is variation between products and low disagreement amongst assessors, then we can be confident that there are differences between the products on that attribute. ANOVA can therefore be thought of as a signal to noise ratio, whereby the product differences are the signal and the level of disagreement is the noise.

In sensory science it is also quite common to add session and/or replicate effects to the ANOVA model, depending on the data and research question.

Common ANOVA models – consumer science: Like sensory science, ANOVA is very popular in consumer science, where again, variability in a quantitative variable (e.g., liking, acceptance, willingness to pay) is partitioned into (say) a Product effect, a Consumer effect and random error. This allows the analyst to examine if there are differences between products in liking/acceptance (as measured by the product effect), and to see if there are differences between consumers (as measured by the consumer effect). Typically consumer tests have a much larger number of participants than a trained sensory panel: it is not surprising to see a significant consumer effect for this reason. However, including a consumer effect term in our ANOVA model helps to reduce the variability partitioned into random error. This is important as the denominator in the calculation of the product effect in consumer science ANOVA models is typically the random error (also known as the residual error). In other words, including a consumer effect helps to explain the variability in the data that is not due to random noise and as a result, we are more likely to see product differences due to reduced noise.

Unlike sensory science, a product-by-assessor interaction effect is rarely used in consumer tests, as it is acknowledged that consumers can disagree on what products they may like/dislike. Commonly a consumer panel is not trained to perceive attribute intensities but to provide hedonic information, which can of course vary from individual to individual. The product-by-assessor interaction effect is therefore quite redundant in this context, as disagreement is expected and accepted (for example, differences in consumer liking can be basis for subsequent cluster analysis). Rather, the noise measure in a consumer test is more likely to be the variation that is unaccounted for by the product and consumer effects in the model (i.e., the residual error of the model).

Importantly, if the user has requested an ANOVA model that does not include a particular variation source, then the user must not conclude this variation does not exist (Kemp, Hollowood & Hort, 2009). Rather the effect will be part of the residual error. For example, if there are replicates in the data and a replicate effect has not been chosen as a model term, the effect of replicate will be placed in the residual error term.

The user is required to input their level of significance prior analyses: either 1% (0.01) , 5% (0.05) or 10% (0.1). These numbers refer to the value of alpha: the risk of rejecting the null hypothesis when it is in fact true (also known as Type I error). The level of confidence is 1-alpha: so choosing a 5% significance level will provide results with 95% level of confidence. A ‘significant’ effect is indicated by the p-value in the EyeOpenR output being lower than the alpha value which the user sets prior analyses (e.g., p=0.03 when alpha is 0.05 would be termed “significant”).

If there is a significant effect of, say Product, then the user requires an understanding of what products differ (e.g., are Products A and B different? Products A and C? Products A and D? Etc.). This is the role of a multiple comparison test, which statistically compares the means of each product and outputs results, either in a group or pairwise format. There are several of the most commonly used multiple comparison tests available in EyeOpenR.

Multiple comparison tests vary in their computation and conservativism: when some say that a test is ‘conservative’ they refer to the test being less likely to report a significant difference between two products, whereas more ‘liberal’ tests would report a difference. Conservative tests lower the risk of Type I error: that is, they lower the risk of rejecting the null hypothesis of no-difference. This means that conservative tests report less differences between products. Lowering the risk of Type I error is however at the expense of increasing the risk of Type II error: more conservative tests will be less likely to report a “significant difference” when there actually is a difference in reality. This is also known as being less powerful. Tests that are more ‘liberal’ and therefore more inclined to report a significant difference between products exists, which lowers the risk of Type II error but increases the risk of Type I.

Due to differences in results provided by different multiple comparison tests, the analyst may well ask ‘which multiple comparison should I use?’. The answer to this question should be decided prior the experiment, depending on company policy, the number of products in the test, whether there is a reference product and whether there are particular pairwise comparison tests of interest vs. all pairwise comparisons.

Many general statistics textbooks (e.g., Ott & Longnecker, 2015) and sensory-specific textbooks (e.g., O’Mahony, 1986) elaborate on the construction of multiple comparison tests and the differences between them. Fisher’s LSD is widely regarded as the most liberal (purporting differences between products), whilst the others vary in their higher degrees of conservatism by controlling for Type I errors. Being more conservative is often important when there are many products that are compared at the pairwise level, as the risk of a Type I error increases quickly with each comparison (the error rate = 1 – (1 – alpha)^k where k is the number of pairwise tests). So, with six products and an alpha initially set to 0.05, there are 15 pairwise tests, which equates to an error rate of 1- (1 – 0.05)^15 = 0.54. This error rate is likely to be interpreted as too high and therefore the analyst may seek to protect against such a high Type I error rate by using an alternative multiple comparison test.

The reader is recommended to study statistical textbooks to further their understanding on the computation and conservatism of different tests (e.g., Ott & Longnecker, 2015).

Options

General options tab

  1. Treat Sessions/Replicates separately: If sessions or replicates are part of the design then the user can choose to specify these effects in the ANOVA model, in which case check the ‘No’ option (i.e., not to treat either as separate). On the other hand, if, one would like to treat (e.g.,) three replicates of a product as three distinct individual products then check the ‘Replicate’ option.
  2. Type of Mean: Adjusted’ takes into account missing data or imbalance in design and takes into account the model chosen based on the parameters in the ‘Model’ tab; ‘Arithmetic’ calculates the mean in the data and is recommend for balanced data. ‘Arithmetic’ is the default in the case of no missing data.
  3. Number of Decimals for Values: Choose preferred number of decimal places in subsequent output (default=2).
  4. Number of Decimals for P-Values: Choose preferred number of decimal places in subsequent output (default=3).

Model tab

  1. Assessor Effect: ‘Yes’ refers to including an assessor effect in the ANOVA model. This is typically the case with sensory and consumer data. ‘No’ excludes the assessor effect.
  2. Type of assessor effect: If an assessor effect is included, the user must select ‘Random’ or ‘Fixed’. The latter, ‘Fixed’, focuses on the specific assessors participating and the user should only draw conclusions about the performance of the specific assessors, not that of the wider population. As most sensory and consumer scientists want to say something about the wider population, ‘random’ is typically chosen as level for the assessor effect. For example, in consumer science, the researcher is not particular interested in the specific consumers who participated in the study, but is interested in the wider population that the recruited sample represents, and thereby selects random. EyeOpenR therefore can calculate mixed-effect models, which include at least one fixed and one random effect.
  3. Session effect: ‘Yes’ to include a session effect in the ANOVA model; ‘No’ otherwise.
  4. Replicate effect: ‘Yes’ to include a replicate effect in the ANOVA model; ‘No’ otherwise.
  5. Sequence effect: Yes’ to include a sequence effect (i.e., effect of presentation order) in the ANOVA model; ‘No’ otherwise.
  6. Interaction: If an assessor effect is included, this option refers to the inclusion/omission of the product by assessor interaction in the ANOVA model: ‘yes’ to include interaction term; ‘No’ to omit. If the product by assessor interaction is included, a mixed-model ANOVA is performed, meaning the calculation of the F-statistic uses the interaction term as the denominator for product and assessor main effects, not the model residuals. (i.e., to test for a significant product effect, the product mean square is tested against the product by assessor interaction mean square, not the model residual mean square). This is preferable because, especially in sensory data, the user wishes to test the product variability against the level of agreement/disagreement within the panel of assessors.
If a replicate effect is included (see above), then this option also relates to a Product by Replicate interaction being included in the model. This could be of value if wishing to evaluate whether certain products show differences across replicates.
  1. Include all interactions: Yes’ to include any additional two-way interactions based on the chosen effects selected as ‘yes’ in the preceding model options. For example, if ‘Assessor’ and ‘Replicate’ effects are chosen in the model, then All interactions would include also include the ‘Assessor by Replicate’ interaction (note: the ‘Product by Assessor’ and ‘Product by Replicate’ interactions are already included as ‘interaction’ is ‘Yes’ to the preceding option). A user can not have an interaction effect included if the main effect is excluded (e.g., it is not possible to have the ‘Product by Replicate’ interaction term included if ‘No’ is selected to the parameter ‘Replicate Effect’).

Comparison of Means tab

  1. Choose Multiple Comparison test: The user can choose one of five multiple comparison tests: Tukey’s HSD, Fisher’s LSD, Newman-Keuls (SNK), Duncan and Dunnett. The first four are the most commonly used in sensory and consumer science. They differ in their risks of making Type I (false positive – reject the null hypothesis when it is true) and Type II (false negative – accepting the null hypothesis when it is false). Dunnett’s test allows the user to select a reference product, with the remaining products tested specifically against the reference.

    Many general statistics textbooks (e.g., Ott & Longnecker, 2015) and sensory-specific textbooks (e.g., O’Mahony, 1986) elaborate on the construction of multiple comparison tests and the differences between them. Fisher’s LSD is widely regarded as the most liberal (purporting differences between products), whilst the others vary in their higher degrees of conservatism by controlling for Type I errors (see above).

  2. Reference: Only applicable if Dunnett’s multiple comparison test has been chosen (see above).
  3. Type of test: Only applicable if Dunnett’s multiple comparison test has been chosen (see above). Type of test refers to whether comparison of each product against chosen reference is based on a one or two tailed hypothesis: ‘Two Sided’ reflects a two-tailed hypothesis (that there is a difference); ‘Greater’ refers to products being tested to be greater than the chosen reference (one tailed); ‘Less’ denotes products tested to be less than the reference (one tailed).
  4. Display of Multiple Comparison test result: User can select ‘Pairwise’ or ‘Group’. This will be reflected in the subsequent ANOVA table that displays significant differences between products, per attribute. ‘Pairwise’ summarises the significance level associated with each paired comparison, presented in a table. Use this option if you wish to read pairwise comparisons between products. ‘Group’ will assign each product per attribute to a particular group based on significance testing: products not sharing the same group are statistical different at the chosen level of significance. See Results section for more information.
  5. Significance level: Only applicable if display of multiple comparison test is at the group level: the user can select 1%, 5% or 10%. The percentages refer to the alpha level (risk of Type I error).
  6. Levels of signif. (pairwise): Only applicable if display of multiple comparison test is at the pairwise level: user can choose varying levels of significance which are presented in a summary table in the output.
  7. Y-axis scale: Only applicable if Dunnett’s multiple comparison test has been chosen, wherein a reference product is chosen. User can set the y-axis to manual minimum and maximum values or use the default automatic option.
  8. Y-axis min value: Only applicable if Dunnett’s multiple comparison test has been chosen.
  9. Y-axis max value: Only applicable if Dunnett’s multiple comparison test has been chosen.

Results and Interpretation

ANOVA p-values tab

The ANOVA table showing the p-values associated with the specified ANOVA model. This is colour-coded to aid interpretation.

A sensory researcher may well wish to understand if there are differences between products per attribute: this will be reflected in the ‘Product’ column. In general, the researcher would like significant differences per attribute in the product column, which indicates the sensory panel can discriminate between products on this attribute.

Likewise the ‘Assessor’ and ‘Replica’ column provides information on whether there are differences between assessors and replicates per attribute.

P-values for the requested interaction terms are then presented: small p-values are generally not wanted: e.g., a significant p-value (say p <.05) for the Product:Assessor interaction indicates significant disagreement among the sensory panel concerning that attribute. Product:Replica and Assessor:Replica can be interpreted in similar fashion if they are included in the model.

ANOVA Tables tab

Provides Sum of Squares (Sum Sq), Degrees of Freedom (DF), F-value and p-value for each term in the chosen ANOVA model, per attribute.

Sum Sq refers to the total variation in the data. DF inform the user of the number of levels free to vary within each model term. Mean Squares can be calculated by dividing the Sum Sq by the respective DF. The F value, also known as the F ratio, reflects the variability captured by each effect (the Mean Square; i.e., the signal) divided by the Mean Square of the Error (i.e., the noise). The term used for the Mean Square of the Error will differ depending on whether there is a Product-by-Assessor interaction included. If there is a Product-by-Assessor interaction term, the then the Mean Square of the Error will be the Product by Assessor Mean Square. In laymen’s terms, the differences between products on a given attribute are compared to the level of disagreement in the (sensory) panel. If assessors are in disagreement, the Product by Assessor interaction will have a relatively high Sum Sq value and thus a high Product by Assessor Mean Square value. As this is the denominator in the F-value calculation for a Product effect, the high level of disagreement will weaken the Product F value. This will lessen the ability to see significant differences between products. This is one reason why training a panel can improve the power or discriminability of the panel: lowering the level of disagreement amongst the panel lowers the noise, which makes the signal of product differences more salient.

The p-value is indicated by the last column and provides the probability of making a Type I error. Thus, if a Product p-value is p < 0.001 on a given attribute, there is a 0.1% chance of concluding a difference between products on this attribute when in reality there is no difference. As this chance is very, very small, the interpretation is that it is very unlikely and thereby the (null) hypothesis that there are no differences between products on the given attribute is rejected. Therefore, the product effect is deemed ‘significant’: there is a difference in intensity (somewhere) between products on this attribute. Typical thresholds to determine significant used in sensory and consumer science are 1, 5 and 10%.  

(Multiple Comparison) tab

Depending on the chosen multiple comparison test, the naming of next set of tabs vary according to the test chosen and whether the display of test result is requested as either ‘Group’ or ‘Pairwise’:

  1. For Fisher’s LSD: per variable with Group display, the tab LSD (Least Square Difference) provides the adjusted or arithmetic means (depending on user choice) of each product according to significance, represented by letters. Significance should be checked when the ANOVA product effect in the ANOVA p-values table is significant (typically p < 0.05).
In the LSD tab, products sharing the same letter are not significantly different. The same information is provided in the LSD Letters tab, now with group indicated by one group per column. LSD values are given per variable in the LSD values tab: these reflect the smallest (least) difference between products that is deemed ‘significant’. A value greater than the LSD between two product mean values will see the test report the two products as significantly different. To see both means and group allocation, see LSD (Group) tab.

The LSD Differences tab provides the difference in means between every two-product combination. Confidence Intervals are then provided, alongside the p-value. To aid interpretation, the column ‘Sig.’ provides either no, one, two or three asterisks, if the difference is significant according to the significance levels chosen by the user.

Finally, model information is available in the Information tab, such as whether the arithmetic or adjusted means was used. All results can be exported to Excel via the icon.

For Fisher’s LSD with Pairwise option, the output is largely similar, except the LSD (Pairwise) table found under the LSD (Pairwise) tab: this is a table of means with each product coded as a letter (e.g., Product1 = A). A table is presented with each product as a separate column and each variable in rows: a cell corresponds to the respective products’ mean score and the codes of the products that are significantly less than the product. So, if “67.03 B” was entered in the cell of Product A by Attribute1, then Product A has a mean of 67.03 on attribute 1, which is significantly more than the product coded B.
  1. Tukey’s HSD: The same interpretation as Fisher’s LSD except that for the calculation of the multiple comparison. Generally, one will see fewer significant differences between products when using Tukey HSD than with Fisher’s LSD.
  2. Newman-Keuls (SNK): Like with Tukey’s HSD and Fisher’s LSD, Newman-Keuls (SNK) provides various tabs of ANOVA output. The SNK test is designed to be more powerful than the Tukey test, meaning it will likely report more differences between products, but less conservative, meaning there is an increased risk of reporting differences that are not present.
One difference in output between SNK vs. both Tukey and Fisher is that confidence intervals around the mean for the difference between two products cannot be calculated for the SNK test. Therefore no “Differences” tab appears in the SNK output.
  1. Duncan: Duncan’s New Multiple Range Test (Duncan) is another multiple comparison test offered. The literature seems to disagree on the validity of the test for consumer and sensory science (e.g., see Lawless & Heymann, 2010, and Lea, Næs & Rødbotten, 1996, for opposing perspectives).

    For the Duncan test, EyeOpenR calculates the confidence intervals from the product differences and computes plus or minus the relevant critical value. The critical values are scaled quantiles from the studentized range distribution, with degrees of freedom from the residual degrees of freedom in the ANOVA model and the sample size of the range is the number of means in-between (inclusive) the means compared. The probability to be inverted is the confidence level to the power of the number of means in-between (inclusive) the means compared minus one.

  2. Dunnett: This multiple comparison test used when the user wishes to compare products to a reference. The reference is selected in the options phase of the ANOVA analysis in EyeOpenR. Besides the standard ANOVA p-values, ANOVA tables and grouping/pairwise results, this test affords a graph depicting the chosen reference against each product, per variable (tab Intensity vs. Ref).

    The ‘
    Dunnett All’ tab provides equivalent output to the ‘Differences’ tab seen with the other multiple comparison tests: for each pair of products, the difference in mean score per attribute is provided with the reference mean subtracted from the test product mean, alongside the output of a statistical test that compares the mean to the reference. To aid interpretation of each p-value, the column ‘Significant’ outputs either Yes or No output at the requested significance level (the p-value will reflect whether the user has selected two-tailed test, greater than reference, or less than reference in the options screen).

    The final tab of the output, ‘Dunnett Summary’, provides a table of each attribute by product with a Yes/No per cell indicating significantly different from reference. Dunnett’s test is commonly used in sensory analysis when one wishes to benchmark in descriptive analysis or using a Different From Control (degree of difference) test in discrimination testing.

Technical Information

  1. R packages: car, SensoMineR, psych (for Fisher’s LSD and Tukey’s HSD), agricolae (for SNK), multcomp (for Dunnett), lm (from stats package is used to fit the linear model).

References 

  1. Kemp, S.E., Hollowood, T. & Hort, J. (2009) Sensory Evaluation: A Practical Handbook. Wiley.
  2. Lawless, H. T., & Heymann, H. (2010). Sensory Evaluation of Food: Principles and Practices (2nd ed.). Springer-Verlag.
  3. Lea, P., Næs, T. & Rødbotten, M. (1997). Analysis of Variance for Sensory Data. Wiley. 
  4. O’Mahony, M. (2017). Sensory Evaluation of Food: Statistical Methods and Procedures. Routledge.
  5. Ott, R., & Longnecker, M. (2015). An Introduction to Statistical Methods and Data Analysis (7th edition). Brooks Cole.
    • Related Articles

    • Paired Comparison

      Introduction The paired comparison method is a handy method to use when dealing with products that have a lasting effect or are in short supply. Unlike other tests like the triangle or duo-trio, it works well when showing three samples at once isn't ...
    • Paired Comparison Analysis

      Purpose To analyse the results of a paired comparison test. Data Format paired_comparison.xlsx Datatype for the attribute is pairedcomp. Background A paired comparison test is a directional / specified test. It is used as a difference test with 2 ...
    • Design Types Available in EyeQuestion

      Introduction EyeQuestion provides a diverse range of design types to meet your every need. On the "Design Type" tab within the Design Generator, you can select from various design options. This page offers an overview of each design type available. ...
    • Comparison Values

      Purpose In sensory and consumer studies the aim is usually to compare products for each attribute tested. This module reports summary statistics for each pair of products in the data and is a useful starting point for understanding how the ...
    • Multiple Factor Analysis (MFA)

      Purpose Performs a multiple factor analysis on a data set. This is not ‘Factor Analysis’ in the classic statistical sense, but rather a method for handling multiple groups of variables that are all measured on the same samples, and from this ...