Purpose
Cluster analysis (of which hierarchical clustering is a specific type) is used to assign consumers to different groups (clusters) based on their patterns of product liking. This is important when there are strong and different preferences across consumers, because those differences can be missed unless the consumers are split into groups (clusters) first.
The module creates clusters and reports the ANOVA overall and by cluster, including pairwise comparisons between products. These enable you to see how different the patterns of product liking are between the clusters.
- Example dataset: apples clusters val.xlsx
For EyeOpenR to read your data, the first five columns must be in the following order: assessor (consumer), product, session, replicate and order. The consumer liking rating should be in column six (column F).
If there is no session, replicate or order information then these columns should contain the value ‘1’ in each cell.
See the example spreadsheet for an illustration of the data format. In the example there is a Data sheet and three other sheets: Attributes, Assessors and Products. Hierarchical clustering will run with only the Data sheet, but including the others enables you to label assessors (consumers) and products, and to set the attributes of the liking variable.
Background
Hierarchical clustering assesses the similarity in patterns of product liking between consumers and groups consumers into clusters. It is called hierarchical clustering because clusters are formed iteratively, with each iteration building from the previous one and meaning that once two consumers are in the same cluster they will always be in the same cluster.
The process starts with all consumers in their own separate cluster and in step one the two consumers who are closest together form a new cluster, reducing the number of clusters by 1. At each subsequent step the two clusters that are closest together form a new single cluster and the total number of clusters reduces by 1. This continues until there is one single cluster containing all consumers. This gives a cluster solution for every possible number of clusters (1 to n where n is the number of consumers), so either you specify the number of clusters required or the procedure will select automatically based on the point with the highest relative loss of inertia (a measure of the difference between clusters).
There are different ways to describe the distance / proximity of clusters and different methods for deciding which clusters are closest. These are controlled through the options and are described below.
You can choose to take the result of the hierarchical clustering and use the cluster means as seeds of a k-means cluster analysis by selecting the consolidation option (see below). This often reduces the within cluster variation, because it allows consumers who were joined into a cluster at an early stage to separate.
Options
- Center by assessor: Selecting ‘Yes’ will mean each consumer’s product liking pattern will be relative to their average liking of all products. For example, a consumer who tends to be more generous in the scores that they give but ranks the products in the same way as a consumer who tends to be less generous in their assessments will look the same.
- Scale by assessor: Selecting ‘Yes’ will mean that consumers who use the whole range of the liking scale will look like those who use a narrower range but still rank products in the same way.
- Missing data: The choices here are to ‘Remove’ missing data or ‘Impute’. Selecting ‘Remove’ will remove all data for the consumer. Selecting ‘Impute’ will use a regularised iterative PCA algorithm to impute the missing values.
- Aggregation method: The following options are available:
- Ward: This method aggregates the two clusters that lead to the smallest increase in the within-group variance.
- Single linkage: This method aggregates the two clusters with the smallest minimum distance between pairs of consumers within A and B.
- Average linkage: This method aggregates the two clusters with the smallest average distance between pairs of consumers within A and B.
- Compete linkage: This method aggregates the two clusters with the smallest maximum distance between pairs of consumers within A and B
- Distance type: This describes the way that the distance is calculated between consumers. For the Ward aggregation method, the only distance type possible is ‘Euclidean’ and this will be selected automatically. For the other three aggregation methods you can choose ‘Euclidean’ or ‘Manhattan’ distance. We will usually be calculating distances between consumers over many dimensions (one for each product), but it is easier to describe these distances in the case when there are two dimensions. If we have a point representing consumer A and point representing consumer B in a two-dimensional space, then the Euclidean distance is the straight line between those two points. To calculate the Manhattan distance, you can only travel left and right or up and down (like you are travelling the streets of Manhattan on a grid). Therefore, you need to measure the distance that you need to move in the horizontal direction and add it to the distance that you need to move in the vertical direction. If there is one product that is very different in liking from the others it will have more influence when you use Euclidean distance compared with when you use Manhattan distance. That could be what you want, or it could be what you want to avoid – the choice is yours.
- Consolidation of clusters (Yes/No): Consolidating clusters means that the module will take the cluster means from the hierarchical clustering and use them as the starting point (seeds) of a k-means clustering. The aim is to break the hierarchical nature and potentially reduce the within cluster variation. A message will be printed in the Information tab of the results to say that the dendrogram does not match the cluster solution.
- Define clusters: Choosing ‘Automatically’ lets the module select the number of clusters based on the point in the dendrogram with the highest relative loss of inertia (a measure of the difference between clusters). Choosing ‘Manually’ lets you specify the number of clusters.
- Number of clusters: If you have chosen ‘Manually’ then enter the number of clusters here.
- Choose multiple comparison test: After finding a cluster solution the module carries out ANOVA by cluster and reports pairwise comparisons of product liking within clusters. There are two options available for type of test: Fisher’s LSD and Tukey’s HSD. Fisher’s LSD will detect more differences because it does not adjust for the fact that you are making many comparisons simultaneously. This means that the chance of making a Type 1 error (false positive – indicate a difference when there isn’t one) is greater with Fisher’s LSD than it is with Tukey’s HSD. Tukey’s HSD adjusts for the number of comparisons that could be made but will include comparisons that you may not ever make and therefore can be conservative.
- Number of decimals for values: Specify the number of decimal places shown for values in the results.
- Number of decimals for p-values: Specify the number of decimal places shown for p-values in the results.
Results and Interpretation
- Summary tab: This shows the clusters and reports the number of consumers in each cluster.
- Cluster: This shows the assignment of each consumer to cluster.
- Means by cluster: This shows the product means of each cluster. They are shown in a chart and in a table. A strong cluster solution will have a different pattern of product liking for each cluster.
- Assessor dendrogram: The dendrogram is a visual representation of how the clusters form. At the lowest point on the vertical axis all the consumers are in separate clusters. The vertical axis measures the distance between clusters as they join. It is useful to confirm whether the cluster solution makes sense and whether the number of clusters makes sense. If the consolidation option has been selected it will not represent the resulting k-means cluster solution.
- ANOVA overall: This reports the ANOVA including product, cluster and product by cluster interaction. For the cluster solution to be meaningful the product and cluster interaction should be significant.
- ANOVA by cluster: These show the ANOVA for each cluster with product and assessor (consumer) effects, and the pairwise comparison test for product differences within clusters. If the cluster solution is useful then the product effect would be significant within each cluster. The assessor (consumer) effect being significant could represent different uses of the scale. The pairwise comparison table is ordered by most to least liked product for the cluster in question. The letters shown in the table indicate the results of the multiple comparison test. Where the letter is the same there is not enough evidence to say that the liking of those products is different for consumers in the cluster.
- Information: This will report additional information about the treatment of missing values, or the interpretation of the output if caution is required.
R packages used:
- hclust (stats) - hierarchical clustering
- kmeans (stats) - k-means clustering
- lm (stats) - linear models
- anova (stats) - analysis of variance
- lsd (agricolae) - Fisher’s LSD
- hsd (agricolae) - Tukey’s HSD
R function settings that are not otherwise visible to the user:
- If the aggregation method Ward is selected the module uses ‘ward.D’ and not ‘ward.D2’.
References
- Everitt, B. (1974). Cluster Analysis. London: Heinemann Educ. Books.
- Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100--108. 10.2307/2346830.
- Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd (Edinburgh). ISBN 0-05-002170-2.
- Tukey, John (1949). "Comparing Individual Means in the Analysis of Variance". Biometrics. 5 (2): 99–114. JSTOR 3001913