To provide analysis of data from a Best Worst Scaling (aka Max-Diff) study. Best Worst Scaling (BWS) presents a subset of items from a longer list to a consumer, who must then select the best and worst item amongst the subset. The process can be repeated so that a consumer evaluates different subsets of items, according to an experimental design. The experimental design can be constructed in EyeQuestion and the analysis of such data can be performed in EyeOpenR.
- Best_Worst_Scaling.xlsx
Note that for EyeOpenR to read your dataset the first five columns must be: Assessor, Product, Session, Replica and Sequence. The consumer BWS response should be in column six (F).
The Assessor column denotes the consumer. The Product column refers to the product or item presented: this column codes the variable that you are interested in (whether it be products, claims, packaging designs, etc.). The Session and Replica columns are often redundant for BWS: enter the numeric value of “1” in each cell unless there are sessions or replica assessments in your data. The Sequence column refers to the question number, or using BWS terminology, the set number. So, if four items (P01, P02, P03, P04) were presented to consumer one (J01) as the first BWS question, then the Sequence value for the items would be “1”, as this is the question number.
Column F provides the consumer response to the BWS question: “1” indicates the best/most preferred option and only one of the presented items can have this value per question (Sequence); “-1” indicates the worst/least preferred item, and again only one item can have this value per question; “0” indicates neither most or least preferred (i.e., mid-pack option). All items that are not best or worst are given this value.
See the example dataset for a specific illustration of data format for BWS.
Background
The analysis provides a Best Worst Scaling (BWS) analysis,
also known as Max-Diff analysis (Maximum Difference Scaling). This is a popular
consumer science technique that has been used since the early 1990s. BWS
quantifies the preference of a number of attributes by repeatedly presenting a
subset of attributes from a longer list and asking the consumer to select the
best and worst attribute from those presented. In this respect, by asking to
select the best and worst attribute, BWS is different to popular methods that
either ask consumers to either rate each option on an ordinal scale (e.g., 5pt
scale) or to rank options from most to least preferred. From this perspective,
it is argued that BWS better reflects the decision making that a consumer
typically performs in everyday scenarios. To further describe the method and
outline some advantages, we will take an example scenario.
Imagine you’re working on a new packaging design for a snack
product that will feature a single claim on the front-of-pack. The claim must appeal
to your consumer market. A problem faced is that you don’t know which claim to
go with: there are many potential claims that you could use (“new improved
recipe”, “high in fibre”, “low in salt”, “no artificial ingredients”, “organic”,
etc.). Say you have 15 potential claims and wish to identify the best one with
a data-driven approach and quantify how much better the winning claim is. Traditional
approaches are problematic: perhaps you will not find many significant
differences with a rating task, which is also time consuming with 15 attributes,
whilst the ranking task seems both difficult and rather artificial when there
are so many options (e.g., do consumers ever rank 15 items in real life?). BWS
offers a viable solution to the problem. It is commonly used when the number of
potential attributes is greater than those that can be presented at once.
Using BWS, the consumer is simply tasked to select their
best and worst option out of a subset of the whole list of attributes. So,
using our example as outlined above, a consumer is presented with a set of four
of 15 claims (items) and asked to select the best and worst. The process is
repeated with different sets of four items. The total number of sets seen per
consumer is established by way of experimental design. The optimal design for
BWS should reflect three key features: (i) that each item occurs an equal
number of times; (ii) that each item is paired with each other item an equal
number of times, and; (iii) that each item occurs in each position within sets
an equal number of times. EyeQuestion is one of several software packages that can
be used to generate BWS designs. It is very important that an experiment design
is in place prior to data collection.
BWS offers several advantages: the task is one consumers are
familiar with (selecting their best and least preferred option); it is more
efficient than running numerous paired preference tests; from a statistical
perspective, BWS may be more discriminative (powerful) than other approaches
(i.e., it tends to find differences between items where other methods report
non-significant differences – e.g., see Jaeger et al., 2008).
As stated in the ‘Data Format’ section, in the data file the
respective consumer response to a BWS question is as follows: ‘1’ signifies best;
‘-1’ worst and 0 otherwise (i.e., not chosen as either best or worst).
Options
- Reference: In order to perform
the statistical analysis (multinomial logistic regression), BWS requires the
user to select a reference product, even if there is no reference in the actual
set of items you’re testing. The selected reference will be given a coefficient
value, also known as utility, of 0. Coefficients of other items are relative to
this reference. See Results and Interpretation for more information.
- Number of Decimals for Values: Choose preferred number of decimal places in subsequent output (default=2).
- Number of Decimals for P-Values: Choose preferred number of decimal places in subsequent output (default=3).
Results and Interpretation
LogReg tab (multinomial logistic
regression)
BWS analysis in EyeOpenR uses a multinomial conditional logistic
regression model. Put simply, this approach provides an overall probability of each
item being chosen as best, as if all items were presented to the consumer at
once. So, the analysis allows one to compare the probability of an item being
selected as best versus the competition. This can also be seen as a quantified overall
rating of all items, from best to worst. In addition to being able to compare
an item with all others, one is also able to use the EyeOpenR output to
quantify the probability of an item being chosen over a specific set of
competitors. All this information is contained in the LogReg table:
- Coefficient:
The statistical analysis returns the chosen Reference item to have a
coefficient (aka utility) value of 0. Items with positive coefficients (above
0) are more preferred and items with negative coefficients are less preferred,
with respect to the chosen reference.
- Exp(Coeff):
The values in this column are e, the base of natural logarithms (approx.
2.718), raised to the power of the coefficient value shown in the first column.
This allows the analyst to work out probabilities of choosing one item over specific
other items. To perform such, the sum of the exponentiated coefficients is
required (see below). If the user is interested in comparing two items, it is
possible to quickly calculate the selection probability of each item. Simply
divide the chosen item’s exp(Coeff) value given in the table by the sum of the
two exp(Coeff) values for the two items of interest. This will provide the
probability of one item being chosen in a pairwise test. The calculation can be
extended to model choices from sets of three or more by increasing the number
of items summed in the denominator. Finally, note that Exp(0) = 1, so a value
of 1 is used in any comparison involving the reference item.
- Lower
limit (95%): The lower confidence limit (95%) of the Exp(Coeff) value in
the second column. Note that the confidence interval itself is non-symmetric
since it has been computed by exponentiating a standard 95% confidence interval
of the coefficient value in the first column.
- Upper
limit (95%): The upper confidence limit (95%) of the Exp(Coeff) value in
the second column (see note above for lower limit).
- Std.
Error: The standard error of the coefficient value in the 1st
column. Generally these values will be similar across items if the design is
well balanced (see Background).
- z-Score:
The coefficient value divided by its standard error, or the distance of
each item from the reference in standard deviation units. If the design in
approximately balanced, where each item is shown equally often, then this
column provides similar information as the Coefficient column.
- P value: provides the p-value from the z-
test which assesses the difference in size of coefficients between an item and
the reference. Small p-values less than a threshold value are used to indicate significant
differences. In most instances, the threshold value is set at 0.05.
- Choice Prob.: Choice probability. Provides the probability of an item being chosen as best as if all items were presented simultaneously. The probability per item is between 0 and 1. The sum of the probabilities will be 1. The higher the value, the more likely the item would be chosen as best. The value in this column is calculated by dividing the respective Exp(Coeff) by the sum of Exp(Coeff). Note that values in the Choice Prob. column will be the same regardless of the reference product.
If the user is interested in only comparing (say) two items, it is also possible to quickly calculate the choice probability manually. Simply divide the Exp(Coeff) value given in the table by the sum of the respective Exp(Coeff) values of interest. For two products, this probability will be the chance of one being chosen as if in a pairwise test. This calculation can be extended beyond two items, following the same logic. - Times Shown: The total number of times an item is presented.
- Times Best: The total number of times an item is selected as best.
- Times Worst: The total number of times an item is selected as worst.
- Times Not Chosen: The total number of times an item is neither selected as best or worst. Note, the summed counts in columns Times Best, Times Worst and Times Not Chosen should equal Times Shown.
- Choice of reference: Once the user has
assessed this result table for the first time, it may be beneficial to return
to the options screen, select a different reference item and recalculate the model.
A good choice of reference is often the most disliked item, the one with the
largest negative coefficient in the initial analysis. In other situations,
there may be an item that is a ‘control’ product, in which case setting this as
the reference will yield statistical tests of each item against its control.
Likelihood tab
There are several statistical methods to test for the degree of fit of
the Best Worst Scaling model. These include the pseudo-R squared value, the
Likelihood ratio test, the Wald test and the Score test. Often with larger
sample sizes the last three of these options, which all provide significance
tests, will yield similar results. The p-values from the three tests are all approximations
of the true p-value, and thus are reported to only two decimal places (see
Agresti, 2007 for more information.) The Wald and Score tests are similar and are
both used for comparing to the null hypothesis that the item effects are all
zero, and so provide information about the significance of the BWS model, but
they differ in terms of how the test statistic is calculated and the background
assumptions made. The Likelihood ratio test takes a different approach and
compares models with the effect in and out, highlighting which one is better.
Since the Likelihood ratio test involves computing both models, it should be
considered superior to the Wald and Score tests which are based on
distributional assumptions and the results of just one model. Certainly, if there
is disagreement among the three significance tests, then the Likelihood ratio
test should be preferred.
- R2:
McFadden’s pseudo-R squared value, the amount of variation in the response
that is explained by the model fitted. This R2 value may seem small,
especially if the user’s experience is mainly of R2 statistics from
standard linear regression models. It is therefore recommended that for Best
Worst Scaling one of the three fit statistics below, all of which offer
significance tests, are used instead.
- Likelihood
ratio test: This tests whether the inclusion of the item effect in the regression
model explains more variation in the data (improves the model fit with the
extra effects added) than not being included, so it is comparing the two
different models. The significance of the test is captured by the p-value. This
is similar to the idea of a product effect in ANOVA. A low p-value (e.g., < 0.1
or 0.05) suggests that the effect of item is significant. In other words, there
are overall differences in preference between items.
- Wald
test: This test focuses on individual effects within a single model and
tests their significance in terms of whether they are impacting the preference
choices, evaluating whether the item effect is 0. The test statistic is
calculated by comparing the estimated effect value to the value under the null
hypothesis and dividing it by the standard error of the parameter estimate. A low
p-value provides evidence against the null hypothesis
- Score
(logrank) test: This is similar to the Wald test, in that it is focusing on
the individual effects and testing their significance, giving a measure of the
statistical evidence against the null hypothesis of no effect. The test is
based on the score function, with the test statistic being the squared
difference between the estimated effect value and the specified value under the
null hypothesis, divided by the variance of the effect estimate. A low p-value
provides evidence against the null hypothesis.
- R packages: survival (for
conditional logistic regression)
References
- Agresti, A. (2007). Introduction to Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons
- Jaeger, S. R., Jørgensen, A. S., Aaslyng, M. D., & Bredie, W. L. P. (2008). Best–worst scaling: An introduction and initial comparison with monadic rating for preference elicitation with food products. Food Quality and Preference, 19(6), 579–588.