Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Purpose 

To provide a Principal Components Analysis (PCA) of the imported data.

PCA is a popular dimensionality reduction method in sensory and consumer science. In non-technical terms, the analysis reduces an initial set of variables to a smaller set of latent variables, known as principal components (PCs), that best describe the variation in the original data. By ‘best describe the variation’, it is meant that PCA finds the directions with the most variance in a data set (Næs, Brockhoff & Tomic, 2010). The PCs, smaller in number than the initial set of variables, are easier to interpret than trying to evaluate and comprehend results over many variables. In this respect, PCA simplifies a complex data set.

As an introductory example, one may have a sensory data set with 20 attributes (variables) used to measure 10 products. PCA will reduce the dimensionality from 20 to a small number of PCs (say two or three) that may offer a good approximation (high explained variance) of the variation that is in the original 20 variables. Interpreting the two or three PCs is an easier task for the analyst (and audience). A map of the observations (Scores Plot), a map of the variables (Loadings Plot) and a map of both (Bi-plot) are produced, affording the analyst to see:

  1. Relationships between variables
  2. How products are similar or dissimilar to each other
  3. How variables characterise products

Data Format

  1. Profiling.xslx (format 1)
  2. Consumer.xlsx (format 1)
  3. Apples_means.xlsx (format 2)

For PCA, the data for EyeOpenR must be in one of two formats.

Format 1: For EyeOpenR to read your data the first five columns must include the following in the specified order: Assessor, Product, Session, Replica and Sequence.

Assessor and Product columns can be of character (e.g., “Assessor 01”, “Product A”), or numeric format (“1”). Session, Replica and Sequence need contain only numeric values. If there is no session, replica or sequence information available, the user should input a value of “1” in each cell in the column that contains no collected information. Sensory or consumer variables should start from column six (column F). As examples of Format 1, please the dataset ‘Profiling.xlsx’ (sensory science) or ‘Consumer.xlsx (consumer science).

Format 2: Table of means. Like Format 1, the first five columns must be labelled Assessor, Product, Session, Replica and Sequence. The same also applies for the format: Assessor and Product can be character or numeric, whilst the latter three columns should only contain numeric values. The difference to Format 1 is that Format 2 is a table of means (averaged over the sensory/consumer panel and optionally, replications). As a result there is no individual assessor information. As there is an assessor column which must contain information, each value in the assessor column should be labelled “mean” or something similar. If there is no information or variation in column Replica, enter a value of “1” in each cell of this column. The same applies to Session and Sequence columns. Sensory or consumer attribute measures are to be placed from the sixth column (column F). Please see demo data ‘Apples_means.xlsx’ as an example. 

Background 

General

Imagine running a sensory descriptive profiling with a trained sensory panel, with 10 products (observations) and 20 sensory attributes (variables). One could graph each of the 20 attributes and examine differences between the 10 products (averaged over the panel), iteratively over each attribute. However, it is very difficult to get a comprehensive overview of how the 20 attributes vary over the products. One could plot two, possibly three attributes, but the situation becomes near impossible when trying to look at four, five -plus attributes at once. Further, presenting a plethora of graphs to an audience is not recommended.

One approach to deal with the situation presented above is PCA. It is one of the most popular and powerful methods in sensory and consumer science. In contrast to the univariate analysis aforementioned, one of the major advantages of PCA is that it deals with high-dimensional data, that is, data with many variables. A primary goal of PCA is to find the most important ‘directions’ of variability in a given data set and then present these in a way that can be easily yet comprehensively understood by way of a few plots and statistics. As a result, the initial (large) data set is ‘reduced’ in dimensionality and the key insights regarding what drives variation between observations is magnified. Specifically, the number of variables in the data to a reduced set of latent variables, known as Principal Components (PCs), that best capture the variation in the data.

Here is a brief, non-mathematical description of each step in PCA:

  1. Assume a data set with observations in rows and variables in columns. We’ll continue with the example of 10 products as the observations and 20 sensory attributes as the variables; our data set is therefore 10 rows by 20 columns. Each value of a particular sensory attribute for a given product is the average over a trained sensory panel. Our input data is ‘20-dimensional’, meaning that to fully describe a product we need to know the values of the 20 attributes.

  2. Next, calculate the average of each sensory attribute (each variable).

  3. The averages are then subtracted from each corresponding observational value so that each variable has a mean of 0 across the observations. The mean of all attributes (i.e., the average point) is now at the origin. The next step depends on whether the type of PCA is based on the covariance or correlation matrix. For the covariance type, the reader can skip to the next step, as the covariance matrix utilises the variation that is inherent within the raw data. If the correlation type is selected, then normalization of each variable is performed: each value in a variable is divided by the standard deviation of that variable. As a result, variables entering a correlation-type PCA have a mean of 0 and a standard deviation of 1. Thus, each variable has the same standard deviation and therefore the same weight. This process is commonly called standardising, autoscaling and ‘reduce and centering’. For information on how standardisation of variables impacts PCA interpretation, please see the later Covariance vs. Correlation section.

  4. The PCA algorithm finds the direction in the variable space (here 20-dimensional) that has the largest variance. The direction is constrained in that it must go through the average point (i.e., the origin) and be straight. It best approximates the data via least squares, that is the sum of squares of the distances of each observation to PC1 is the least distance attainable given the data. This direction is PC1.

  5. Each observation (here 10) is projected on to PC1.  This gives each observation a coordinate, or Score, on PC1. These values will differ between observations.

  6. The variables in the initial data contribute to the construction of PC1. This contribution is the ‘Loading’ value for PC1. High positive or high negative loading values denote that that variable is important for that PC. Each sensory attribute will be given a ‘Loading’ value.

  7. Each observation now has a ‘Score’ value and each initial variable has a ‘Loading’ value.

  8. PC2 is then calculated, under the constraint that it is orthogonal (correlation = 0) to PC1. PC1 and PC2 are thereby uncorrelated. Orthogonality is a very useful feature that mitigates the issues of multicollinearity that surrounds (e.g.,) multiple linear regression. Further, it affords subsequent analyses such as Principal Component Regression (PCR) and Preference Mapping. In these latter analyses, the original data is compressed to a handful of PCs and only the PCs enter further regression analyses. Here, we will focus our attention solely on the practical use and interpretation of PCA.

  9. The PCA procedure continues with the extraction of PC3, PC4, etc., with each PC orthogonal to the ones prior.

  10. The sum of variances in the PCs is equal to the sum of the variance in the initial variables (Næs et al., 2010). This makes the concept of explained variance to be a meaningful measure: we seek a PCA model that explains a high degree of variance. Due to the iterative way of calculating PCs, it is shown that the variance explained per PC will decrease with increasing PCs (i.e., PC1 will explain the most variance, followed by PC2, etc.). In keeping with our 20-variable example, PC1 may explain 50% of the variation in the data, PC2 25% and PC3 10%. This would mean 85% of the variation can be captured in 3 PCs, which generally would be considered useful if the original sensory space consisted of 20 attributes. Typically, in sensory science two-five PCs are chosen (reducing the number of dimensions significantly), as trying to interpret more than that is deemed as trying to interpret noise.

It is important to note that the analyst must choose the number of PCs to keep in the PCA model when presenting results to an audience. Additionally, the analyst must consider the variation that each PC explains when interpreting PCA maps, as well as the overall percentage of variance explained in the PCA model. If a 2-PC model explaining 40% of the variation is chosen, then the analyst should be aware that 60% of the total variation in the data is unaccounted for. This PCA model would therefore be a relatively poor representation of the data. Another example of caution would be if one PC explains much more variation than other (say 75 vs. 10%): interpretation should be primarily focused on the PC that explains 7.5 times the variation in the other PC. For guidance on interpreting PCA output in EyeOpenR see the below section. For a more comprehensive understanding of PCA and its computation see Næs et al., 2010. 

Advantages of PCA

As a multivariate method, PCA provides information regarding:
  1. How variables in the dataset are related to each other and which variables are the most important in discriminating the observations. For example, the analyst can see how patterns of association between variables based on their position in the Loadings plot (variables chart) or Bi-plot. Attributes that are highly positively correlated with each other will tend to be positioned closely to each other and far from the origin. Attributes that have a strong negative association will be opposite each other. Attributes close to the origin are not well explained on the dimensions shown and should not be interpreted. Note the Loadings plot can be seen as a visualisation of the covariance or correlation matrix, depending on which type is chosen (see below)

  2. Which observations (often products) are similar and dissimilar: PCA provides a Scores plot which maps the observations. For sensory science data, this will often be of products whereas for consumer science this may be either products or consumers, depending on the interest of the analyst. Regardless, observations near each other show similarities on the components plotted; observations opposite each other show dissimilarities on the components plotted.

  3. How variables can characterise products: Based on positions in the Scores (observations) and Loadings (variables) plots, the analyst can understand how variables characterise particular products.

  4. How good a representation is the PCA model? In other words, how good is the PCA in describing the total amount of variation?

Covariance vs. Correlation

One frequent question by users of EyeOpenR is whether the ‘type of PCA’ should be correlation or covariance. As noted above, the difference is whether an optional normalization of variables occurs: if the user selects correlation, then the normalization step ensues with each variable having the same mean (0) and standard deviation (1); if covariance is selected then no normalization takes place. In non-technical terms, the analyst must decide if they wish to keep the use of scale in the original data (covariance) or whether they wish to remove the effect of scale and thereby give equal weight to each variable (correlation). Let’s take two examples to clarify the issue. Firstly, in sensory descriptive analysis a trained panel provides intensity scores on (say) a 10-point scale. The analyst often wants to keep the use of scale in the analysis, as the panel has been trained and that the attributes that have the biggest variation are the most important attributes in discriminating between products. Therefore the analyst selects ‘covariance’ as type. 

Now consider data from a decathlon sporting event, comprising 10 sports (variables) that vary in terms of the unit measured: sprinting events in seconds, jumping and throwing events measured in metres and longer-distance running events in minutes. The variance we see depends on the event: in the 100m sprint, there is only around a second between the first and last athlete and so the variation is very small, whereas in the javelin there could be a difference of more than 30m between the best and worst thrower, with higher variation. If we choose ‘covariance’ we keep the original units and therefore the variation in events like the javelin will dominate our first components. Hopefully the reader can agree that each sporting event should be given equal weight: the variation in the 100m should be given the same importance as the variation in the javelin. This is what the ‘correlation’ type PCA does. By removing the effect of scale, the analyst is able to perform PCA on variables that (typically) were measured using different units. By using the correlation PCA, each attribute has a mean of 0 across observations and a variance of 1 across all observations (hence an equal weight). Correlation PCAs are often used in consumer science and when a sensory panel is undertaking training.

Options

Analysis options tab

  1. Treat Sessions/Replicates separately: If sessions or replicates are part of the design then the user can choose to specify these effects in the analyses, in which case check the ‘No’ option (i.e., not to treat either as separate). On the other hand, if, one would like to treat (e.g.,) three replicates of a product as three distinct individual products then check the ‘Replicate’ option.
  2. Dimension: choice of components to plot (either PC1 vs. PC2; PC1 vs. PC3; PC2 vs. PC3).
  3. Ellipses: choice of ‘Yes’ or ‘No’ as to whether confidence ellipses are drawn. 
  4. Type of PCA: choice of ‘Correlation’ or ‘Covariance’. Correlation will give each variable equal weight, with each variable mean = 0 and standard deviation = 1. This is the recommended option if variables are measured on different scales. Covariance keeps the data in their initial format and is often used with trained sensory panels, where each variable is measured on the same scale.
  5. Data to Analyse: ‘Compute Table of Means’ or ‘Run on Imported Data’. ‘Compute Table of Means’ will calculate the average across replicates and assessors, if data is in a raw format as stipulated under the Data Format section. The result is a table of means with each observation in rows and variables in columns. In sensory science this will frequently be products in rows and attributes in columns. Run on Imported Data’: refers to data that is in ‘format 2’ as detailed in the above Data Format section. That is, the data is a table of means. In sensory science this will frequently be products in rows and attributes in columns.

  6. Type of Mean: ‘Arithmetic’ is recommended if data is complete and balanced with no missing data. When missing data occurs or the design is not balanced, the user should select ‘Adjusted’. 
  7. Adjusted Means Model: If adjusted means has been chosen in the preceding option, the ‘Adjusted Means Model’ option allows the user to choose a one-, two- or three- way ANOVA model when calculating the adjusted means. In this context, ‘one way’ would refer to an ANOVA model with solely the ‘product’ column as the factor [see Data Format section above]; ‘two way’ accounts for ‘Product’ and ‘Assessor’ columns; ‘three way’ includes ‘Product’, ‘Assessor’ and ‘Replicate’.

  8. Clustering: Cluster analysis can be performed to group the products, and this is done using an automatic cluster selection routine using agglomerative hierarchical clustering. 
  9. Define clusters: Either proceed with the ‘automatic’ calculation to allow the software to algorithmically choose the best number of cluster for you or, if desired, the user can specify the number of clusters, generally after reviewing the dendrogram having run through the automatic solution beforehand.
  10. Number of clusters: Enter number of clusters if proceeding with manual number.

  11. Number of Decimals for Values: as desired.
  12. Number of Decimals for P-Values: as desired.

Supplementary options tab

  1. Include supplementary products: Y/N. User has the option to select one or more products as supplementary. Supplementary products do not affect the construction of PCs but can be shown on subsequent graphical output. This has the advantage that the orientation of the existing map will not change with the addition of supplementary products or variables.
  2. Select supplementary products: User chooses product(s).
  3. Include supplementary attributes: Y/N. The same as for supplementary products, now with the option to choose additional variables. For example, the analyst may include consumer liking as a supplementary variable to a sensory descriptive analysis PCA.
  4. Select supplementary attributes: User chooses attributes for supplementary.

Results and Interpretation

  1. Means tab: Provides a table of means with products in rows and attributes in columns, according to the whether the user has chosen arithmetic or adjusted means.
  2. Eigenvalues tab: Provide the information that is captured in each PC. The more information, the higher the Eigenvalue. Note that the sum of the Eigenvalues is equal to the total variance of the products over the attributes. The ‘Percentage of variance’ column provides the percent of variance that is explained by that PC. This is useful in determining how many PCs to select in interpreting the model. The user must assess how many PCs to include: for example, if a 3 PCA model explains 80% of the variance and the 4th PC only explains an additional 2%, it is probably not worth interpreting the 4th PC, as it only explains a small percent of the variation. If however the 4th PC explained 15%, then it would be worthwhile to interpret.
  3. Products: Contains four tabs: ‘Coord’, ‘Cos2’, ‘Contrib’ and ‘Graph’:
    1. Coord: The PCA’s representation of the products is founded on the coordinates. This tab provides the coordinates of the products (observations) on each PC. These are the values that are plotted on the ‘Graph’ tab. Also known as Scores.
    2. Cos2: Also known as the squared cosine. These refer to the quality of the representation of the product. If the value is low, the position of the product should not be interpreted. Each product (i.e., each row) will sum to 1 over the PCs. The higher the value the better the representation; products positioned near to the origin will have a lower cos2. 
    3. Contrib: Provides the contribution of the products to the formation of the PCs. The total contribution for each PC is 100. Therefore, the larger the number the more that product contributes to that PC.
    4. Graph: This is the product plot aka Scores Plot. Products near to each other and far from the origin share a similar profile over the attributes. Products opposite each other through the origin show dissimilar sensory profile. Products near to the origin on the plotted PCs cannot be interpreted as they are not well explained.
  4. Supplementary Products: If supplementary products have been chosen then this tab outputs the coordinates (Coord) and squared cosines (Cos2) values. 
  5. Attributes: tabs Coord, Cor, Cos2, Contrib and Graph:
    1. Coord: The coordinates of the attributes. These are plotted on the subsequent ‘Graph’ tab. Also known as the Loadings Plot.
    2. Cor: Provides the correlation between the variable and each PC. High positive values denote a strong positive correlation between the PC and increasing attribute intensity values; likewise negative values indicate that a negative relationship exists between attribute intensities and the respective PC. 
    3. Cos2: Also known as squared cosine. The value represents the quality of the representation of the variable on the PCs. The analyst seeks high values to indicate a good quality representation. Note that when correlation-type PCA is used, the Cos2 values are calculated as the Coord * Coord.
    4. Contrib: the contribution of each variable to each PC. Each PC sums to a Contrib total of 100. Therefore, the higher the Contrib value the more that attribute contributes to the construction of the PC. 
    5. Graph: This is a plot of the coordinates (Coord) and is also known as the Loadings Plot. Variables located towards the periphery of the map are well explained. Those that are close together have a high positive correlation, whilst those opposite have a strong negative correlation. 
  6. Supplementary Attributes: This table provides the coordinates (Coord), correlation to each PC (Cor) and squared cosine per attribute.
  7. Biplot: this is a combination of the observation (Scores) and attribute (Loadings) plots. It is based on the logic that products to the left of the Scores Plot are characterised (i.e., have higher intensities of) variables to the left of Loadings Plot. Likewise, products to the right of the Scores plot are characterised by variables to the right of the Loadings Plot, etc. The Biplot attempts to combine both the Scores and Loadings plot into one. 
  8. 3D Biplot: This is an interactive 3D graph of PCs 1-3 that is useful for exploration and to facilitate understanding of how the product and attributes are positioned. 
  9. Clustering Info: Assigns each product to one cluster.
  10. Clustering Desc: A description of each cluster in table format. Each tables contains: the products assigned that cluster and attributes that are significantly related to that cluster vs. the average. The mean value of those attributes for that cluster is shown in the ‘Mean in category’ column, whilst the mean across all products is in the ‘Overall mean’ category. The standard deviation (SD) for the category (cluster) and overall SD are also given. A v.test column is also provided, which provides a form of significance testing: high positive v-test statistics indicate that the attribute has significantly more of that attribute than the overall average (hence a positive v-test score). The significance of this is indicated in the respective ‘p.value’ column. Likewise, a negative v-test score indicates that the attribute is significantly lower in the category than overall. In effect, the clustering description provides a quick summary of the dominant attributes that define each cluster relative to all products tested.
  11. Product Dendrogram: Provides the dendrogram cut according to the number of clusters chosen manually or cut automatically. The adjust an ‘automatic’ clustering to a set number of clusters, draw a horizontal line on the dendrogram and count how many times it crosses a vertical line: this will provide you with the number of clusters which can then be put into the manual input box in the options 

Technical Information

  1. R packages: FactoMineR , SensoMineR, flashClust
  2. R function settings that are not otherwise visible to the user

References 

  1. Altman, D. G. (1978). Plotting Probability Ellipses. Journal of the Royal Statistical Society: Series C (Applied Statistics), 27(3), 347–349. 
  2. Næs, T., Brockhoff, P. B., & Tomic, O. (2010). Statistics for Sensory and Consumer Science. 
  3. Lê, S., Josse, J., & Husson, F. (2008). FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software, 25(1), 1 - 18. doi:http://dx.doi.org/10.18637/jss.v025.i01 
  4. Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58, 236-244. doi: 10.1080/01621459.1963.10500845

    • Related Articles

    • Napping Analysis

      Purpose To provide an analysis of data collected using the napping methodology. Data Format Napping.xlsx For EyeOpenR to read your data the first five columns must include the following in the specified order: Assessor, Product, Session, Replica and ...
    • Penalty Analysis

      Purpose To provide a penalty analysis of a consumer data set, that is to investigate how liking or acceptability of product decreases when product attributes are not at the optimal intensity. Data Format Consumer.xlsx Note: for EyeOpenR to read your ...
    • Multiple Factor Analysis (MFA)

      Purpose Performs a multiple factor analysis on a data set. This is not ‘Factor Analysis’ in the classic statistical sense, but rather a method for handling multiple groups of variables that are all measured on the same samples, and from this ...
    • Canonical Variates Analysis (CVA)

      Purpose To analyse sensory data using the multidimensional visualisation technique, Canonical Variates Analysis. Data Format Assessor by Product matrix measured on multiple attributes – there should be at least 3 attributes, and at least 3 products. ...
    • Correspondence Analysis (CATA and categorical data)

      Purpose To visualise and summarise analyse tabular data and to highlight the patterns of association in two way tables. It is widely used for mapping pure qualitative variables – e.g cluster by demographic use. This is an example of typical data that ...