Best Worst Scaling

Best Worst Scaling

Purpose 

To provide analysis of data from a Best Worst Scaling (aka Max-Diff) study. Best Worst Scaling (BWS) presents a subset of items from a longer list to a consumer, who must then select the best and worst item amongst the subset. The process can be repeated so that a consumer evaluates different subsets of items, according to an experimental design. The experimental design can be constructed in EyeQuestion and the analysis of such data can be performed in EyeOpenR.

Data Format

  1. Best_Worst_Scaling.xlsx
Note that for EyeOpenR to read your dataset the first five columns must be: Assessor, Product, Session, Replica and Sequence. The consumer BWS response should be in column six (F). 

The Assessor column denotes the consumer. The Product column refers to the product or item presented: this column codes the variable that you are interested in (whether it be products, claims, packaging designs, etc.). The Session and Replica columns are often redundant for BWS: enter the numeric value of “1” in each cell unless there are sessions or replica assessments in your data. The Sequence column refers to the question number, or using BWS terminology, the set number. So, if four items (P01, P02, P03, P04) were presented to consumer one (J01) as the first BWS question, then the Sequence value for the items would be “1”, as this is the question number. 

Column F provides the consumer response to the BWS question: “1” indicates the best/most preferred option and only one of the presented items can have this value per question (Sequence); “-1” indicates the worst/least preferred item, and again only one item can have this value per question; “0” indicates neither most or least preferred (i.e., mid-pack option). All items that are not best or worst are given this value. 

See the example dataset for a specific illustration of data format for BWS.

Background 

The analysis provides a Best Worst Scaling (BWS) analysis, also known as Max-Diff analysis (Maximum Difference Scaling). This is a popular consumer science technique that has been used since the early 1990s. BWS quantifies the preference of a number of attributes by repeatedly presenting a subset of attributes from a longer list and asking the consumer to select the best and worst attribute from those presented. In this respect, by asking to select the best and worst attribute, BWS is different to popular methods that either ask consumers to either rate each option on an ordinal scale (e.g., 5pt scale) or to rank options from most to least preferred. From this perspective, it is argued that BWS better reflects the decision making that a consumer typically performs in everyday scenarios. To further describe the method and outline some advantages, we will take an example scenario.

Imagine you’re working on a new packaging design for a snack product that will feature a single claim on the front-of-pack. The claim must appeal to your consumer market. A problem faced is that you don’t know which claim to go with: there are many potential claims that you could use (“new improved recipe”, “high in fibre”, “low in salt”, “no artificial ingredients”, “organic”, etc.). Say you have 15 potential claims and wish to identify the best one with a data-driven approach and quantify how much better the winning claim is. Traditional approaches are problematic: perhaps you will not find many significant differences with a rating task, which is also time consuming with 15 attributes, whilst the ranking task seems both difficult and rather artificial when there are so many options (e.g., do consumers ever rank 15 items in real life?). BWS offers a viable solution to the problem. It is commonly used when the number of potential attributes is greater than those that can be presented at once.

Using BWS, the consumer is simply tasked to select their best and worst option out of a subset of the whole list of attributes. So, using our example as outlined above, a consumer is presented with a set of four of 15 claims (items) and asked to select the best and worst. The process is repeated with different sets of four items. The total number of sets seen per consumer is established by way of experimental design. The optimal design for BWS should reflect three key features: (i) that each item occurs an equal number of times; (ii) that each item is paired with each other item an equal number of times, and; (iii) that each item occurs in each position within sets an equal number of times. EyeQuestion is one of several software packages that can be used to generate BWS designs. It is very important that an experiment design is in place prior to data collection.

BWS offers several advantages: the task is one consumers are familiar with (selecting their best and least preferred option); it is more efficient than running numerous paired preference tests; from a statistical perspective, BWS may be more discriminative (powerful) than other approaches (i.e., it tends to find differences between items where other methods report non-significant differences – e.g., see Jaeger et al., 2008).

As stated in the ‘Data Format’ section, in the data file the respective consumer response to a BWS question is as follows: ‘1’ signifies best; ‘-1’ worst and 0 otherwise (i.e., not chosen as either best or worst). 

Options

  1. Reference: In order to perform the statistical analysis (multinomial logistic regression), BWS requires the user to select a reference product, even if there is no reference in the actual set of items you’re testing. The selected reference will be given a coefficient value, also known as utility, of 0. Coefficients of other items are relative to this reference. See Results and Interpretation for more information.
  2. Number of Decimals for Values: Choose preferred number of decimal places in subsequent output (default=2).
  3. Number of Decimals for P-Values: Choose preferred number of decimal places in subsequent output (default=3).

Results and Interpretation

LogReg tab (multinomial logistic regression) 

BWS analysis in EyeOpenR uses a multinomial conditional logistic regression model. Put simply, this approach provides an overall probability of each item being chosen as best, as if all items were presented to the consumer at once. So, the analysis allows one to compare the probability of an item being selected as best versus the competition. This can also be seen as a quantified overall rating of all items, from best to worst. In addition to being able to compare an item with all others, one is also able to use the EyeOpenR output to quantify the probability of an item being chosen over a specific set of competitors. All this information is contained in the LogReg table:

  1. Coefficient: The statistical analysis returns the chosen Reference item to have a coefficient (aka utility) value of 0. Items with positive coefficients (above 0) are more preferred and items with negative coefficients are less preferred, with respect to the chosen reference.
  2. Exp(Coeff): The values in this column are e, the base of natural logarithms (approx. 2.718), raised to the power of the coefficient value shown in the first column. This allows the analyst to work out probabilities of choosing one item over specific other items. To perform such, the sum of the exponentiated coefficients is required (see below). If the user is interested in comparing two items, it is possible to quickly calculate the selection probability of each item. Simply divide the chosen item’s exp(Coeff) value given in the table by the sum of the two exp(Coeff) values for the two items of interest. This will provide the probability of one item being chosen in a pairwise test. The calculation can be extended to model choices from sets of three or more by increasing the number of items summed in the denominator. Finally, note that Exp(0) = 1, so a value of 1 is used in any comparison involving the reference item.
  3. Lower limit (95%): The lower confidence limit (95%) of the Exp(Coeff) value in the second column. Note that the confidence interval itself is non-symmetric since it has been computed by exponentiating a standard 95% confidence interval of the coefficient value in the first column.
  4. Upper limit (95%): The upper confidence limit (95%) of the Exp(Coeff) value in the second column (see note above for lower limit).
  5.  Std. Error: The standard error of the coefficient value in the 1st column. Generally these values will be similar across items if the design is well balanced (see Background).
  6. z-Score: The coefficient value divided by its standard error, or the distance of each item from the reference in standard deviation units. If the design in approximately balanced, where each item is shown equally often, then this column provides similar information as the Coefficient column.
  7.  P value: provides the p-value from the z- test which assesses the difference in size of coefficients between an item and the reference. Small p-values less than a threshold value are used to indicate significant differences. In most instances, the threshold value is set at 0.05.
  8. Choice Prob.: Choice probability. Provides the probability of an item being chosen as best as if all items were presented simultaneously. The probability per item is between 0 and 1. The sum of the probabilities will be 1. The higher the value, the more likely the item would be chosen as best. The value in this column is calculated by dividing the respective Exp(Coeff) by the sum of Exp(Coeff). Note that values in the Choice Prob. column will be the same regardless of the reference product.
    If the user is interested in only comparing (say) two items, it is also possible to quickly calculate the choice probability manually. Simply divide the Exp(Coeff) value given in the table by the sum of the respective Exp(Coeff) values of interest. For two products, this probability will be the chance of one being chosen as if in a pairwise test. This calculation can be extended beyond two items, following the same logic.
  9. Times Shown: The total number of times an item is presented.
  10. Times Best: The total number of times an item is selected as best.
  11. Times Worst: The total number of times an item is selected as worst.
  12. Times Not Chosen: The total number of times an item is neither selected as best or worst. Note, the summed counts in columns Times Best, Times Worst and Times Not Chosen should equal Times Shown.
  13. Choice of reference: Once the user has assessed this result table for the first time, it may be beneficial to return to the options screen, select a different reference item and recalculate the model. A good choice of reference is often the most disliked item, the one with the largest negative coefficient in the initial analysis. In other situations, there may be an item that is a ‘control’ product, in which case setting this as the reference will yield statistical tests of each item against its control.

Likelihood tab

There are several statistical methods to test for the degree of fit of the Best Worst Scaling model. These include the pseudo-R squared value, the Likelihood ratio test, the Wald test and the Score test. Often with larger sample sizes the last three of these options, which all provide significance tests, will yield similar results. The p-values from the three tests are all approximations of the true p-value, and thus are reported to only two decimal places (see Agresti, 2007 for more information.) The Wald and Score tests are similar and are both used for comparing to the null hypothesis that the item effects are all zero, and so provide information about the significance of the BWS model, but they differ in terms of how the test statistic is calculated and the background assumptions made. The Likelihood ratio test takes a different approach and compares models with the effect in and out, highlighting which one is better. Since the Likelihood ratio test involves computing both models, it should be considered superior to the Wald and Score tests which are based on distributional assumptions and the results of just one model. Certainly, if there is disagreement among the three significance tests, then the Likelihood ratio test should be preferred.
  1. R2: McFadden’s pseudo-R squared value, the amount of variation in the response that is explained by the model fitted. This R2 value may seem small, especially if the user’s experience is mainly of R2 statistics from standard linear regression models. It is therefore recommended that for Best Worst Scaling one of the three fit statistics below, all of which offer significance tests, are used instead.
  2. Likelihood ratio test: This tests whether the inclusion of the item effect in the regression model explains more variation in the data (improves the model fit with the extra effects added) than not being included, so it is comparing the two different models. The significance of the test is captured by the p-value. This is similar to the idea of a product effect in ANOVA. A low p-value (e.g., < 0.1 or 0.05) suggests that the effect of item is significant. In other words, there are overall differences in preference between items.
  3. Wald test: This test focuses on individual effects within a single model and tests their significance in terms of whether they are impacting the preference choices, evaluating whether the item effect is 0. The test statistic is calculated by comparing the estimated effect value to the value under the null hypothesis and dividing it by the standard error of the parameter estimate. A low p-value provides evidence against the null hypothesis
  4. Score (logrank) test: This is similar to the Wald test, in that it is focusing on the individual effects and testing their significance, giving a measure of the statistical evidence against the null hypothesis of no effect. The test is based on the score function, with the test statistic being the squared difference between the estimated effect value and the specified value under the null hypothesis, divided by the variance of the effect estimate. A low p-value provides evidence against the null hypothesis.

Technical Information

  1. R packages: survival (for conditional logistic regression)

References 

  1. Agresti, A. (2007). Introduction to Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons
  2. Jaeger, S. R., Jørgensen, A. S., Aaslyng, M. D., & Bredie, W. L. P. (2008). Best–worst scaling: An introduction and initial comparison with monadic rating for preference elicitation with food products. Food Quality and Preference, 19(6), 579–588.

    • Related Articles

    • Best Worst/Max Diff

      Introduction Best Worst/Max Diff method involve in providing panellist a list of different statement which they have to chose which is the most and the least preferred. Best Worst/Max Diff can be used for multiple items, such as brand preferences, ...
    • How Can I Set Up, Run, and Analyze a Best Worst / Max Diff Test

      Introduction Best Worst scaling is an approach for obtaining preference/importance scores for multiple items, such as brand preferences, brand images, product features, advertising claims, etc. Best Worst scaling is also known as Max Diff. Setting up ...
    • Principal Component Analysis (PCA)

      Purpose To provide a Principal Components Analysis (PCA) of the imported data. PCA is a popular dimensionality reduction method in sensory and consumer science. In non-technical terms, the analysis reduces an initial set of variables to a smaller set ...
    • Question Types Available in EyeQuestion

      In EyeQuestion you have two libraries of question types available. Your own custom template library and the default system template library. The default library contains a wide range of question types and are divided into several subfolders. Question ...
    • MAM Model Analysis

      Purpose This analysis looks at the overall panel performance in terms of Discrimination, Agreement and Repeatability or Reproducibility, and then the performance of each individual in the panel in these terms. Using a more sophisticated model, than ...