Purpose
Cluster analysis (of which K-means clustering is a specific type) is used to assign consumers to different groups (clusters) based on their patterns of product liking. This is important when there are strong and different preferences across consumers, because those differences can be missed unless the consumers are split into groups (clusters) first.
The module creates clusters and reports the ANOVA overall and by cluster, including pairwise comparisons between products. These enable you to see how different the patterns of product liking are between the clusters.
- Example dataset: apples clusters val.xlsx
For EyeOpenR to read your data, the first five columns must be in the following order: assessor (consumer), product, session, replicate and order. The consumer liking rating should be in column six (column F).
If there is no session, replicate or order information then these columns should contain the value ‘1’ in each cell.
See the example spreadsheet for an illustration of the data format. In the example there is a Data sheet and three other sheets: Attributes, Assessors and Products. Hierarchical clustering will run with only the Data sheet, but including the others enables you to label assessors (consumers) and products, and to set the attributes of the liking variable.
Background
K-means clustering assesses the similarity in patterns of product liking between consumers and groups consumers into clusters.
As a starting point you must tell the procedure how many clusters you want to find (these are the ‘k’ clusters). Then the procedure selects a random set of ‘k’ observations from the data to act as the starting point (seeds) for the cluster means. These points are the cluster centres (sometimes called centroids). The procedure calculates the euclidean distance between each consumer and these centroids and assigns each consumer to their nearest centre. New cluster means are calculated based on this assignment and the process is repeated until there is no benefit in moving to the next iteration.
The procedure uses the Hartigan and Wong (1979) method which minimises the within cluster variance (sum of squared distances). It is sensitive to the starting point, therefore the procedure tries 20 different random samples of ‘k’ consumers as starting points and picks the one with the minimum within cluster variance.
Options
- Center by assessor: Selecting ‘Yes’ will mean each consumer’s product liking pattern will be relative to their average liking of all products. For example, a consumer who tends to be more generous in the scores that they give but ranks the products in the same way as a consumer who tends to be less generous in their assessments will look the same.
- Scale by assessor: Selecting ‘Yes’ will mean that consumers who use the whole range of the liking scale will look like those who use a narrower range but still rank products in the same way.
- Missing data: The choices here are to ‘Remove’ missing data or ‘Impute’. Selecting ‘Remove’ will remove all data for the consumer. Selecting ‘Impute’ will use a regularised iterative PCA algorithm to impute the missing values.
- Number of clusters: Specify the number of clusters here.
- Choose multiple comparison test: After finding a cluster solution the module carries out ANOVA by cluster and reports pairwise comparisons of product liking within clusters. There are two options available for type of test: Fisher’s LSD and Tukey’s HSD. Fisher’s LSD will detect more differences because it does not adjust for the fact that you are making many comparisons simultaneously. This means that the chance of making a Type 1 error (false positive – indicate a difference when there isn’t one) is greater with Fisher’s LSD than it is with Tukey’s HSD. Tukey’s HSD adjusts for the number of comparisons that could be made but will include comparisons that you may not ever make and therefore can be conservative.
- Number of decimals for values: Specify the number of decimal places shown for values in the results.
- Number of decimals for p-values: Specify the number of decimal places shown for p-values in the results.
Results and Interpretation
- Summary: This shows the clusters and reports the number of consumers in each cluster.
- Cluster: This shows the assignment of each consumer to cluster.
- K-Means: The K-means procedure minimises the within cluster sums of squares (measure of within cluster variability or spread). This tab shows the within cluster sums of squares and the between cluster sums of squares and how the within cluster sums of squares is apportioned between clusters. Clusters that are more compact have smaller sums of squares, but this measure will increase as the size of the cluster increases so larger clusters will naturally have higher within cluster sums of squares. The better the cluster solution the higher the within cluster sum of squares is relative to the between cluster sum of squares.
- Means by cluster: This shows the product means of each cluster. They are shown in a chart and in a table. A strong cluster solution will have a different pattern of product liking for each cluster.
- ANOVA overall: This reports the ANOVA including product, cluster and product by cluster interaction. For the cluster solution to be meaningful the product and cluster interaction should be significant.
- ANOVA by cluster: These show the ANOVA for each cluster with product and assessor (consumer) effects, and the pairwise comparison test for product differences within clusters. If the cluster solution is useful then the product effect would be significant within each cluster. The assessor (consumer) effect being significant could represent different uses of the scale. The pairwise comparison table is ordered by most to least liked product for the cluster in question. The letters shown in the table indicate the results of the multiple comparison test. Where the letter is the same there is not enough evidence to say that the liking of those products is different for consumers in the cluster.
- Information: This will report additional information about the treatment of missing values, or the interpretation of the output if caution is required.
R packages used:
- kmeans (stats) - k-means clustering
- lm (stats) - linear models
- anova (stats) - analysis of variance
- lsd (agricolae) - Fisher’s LSD
- hsd (agricolae) - Tukey’s HSD
R function settings that are not otherwise visible to the user:
- In the kmeans function the value of nstart is set to 20. This means that the procedure tries 20 different random samples of ‘k’ consumers as starting points and picks the one with the minimum within cluster variance.
- Euclidean distance is used as the measure of distance between consumers and cluster centres.
- The method minimises the total within-cluster variance.
References
- Hartigan, J. A. and Wong, M. A. (1979).Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28,100--108. 10.2307/2346830.
- Fisher, R.A. (1925). Statistical Methods forResearch Workers. Oliver and Boyd (Edinburgh). ISBN 0-05-002170-2.
- Tukey, John (1949). "Comparing IndividualMeans in the Analysis of Variance". Biometrics. 5 (2): 99–114. JSTOR3001913