Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

R | Performing Principal Component Analysis (PCA)

As noted in [1], PCA is essentially a rotation of the data to a new set of axes, the direction of which are represented by vectors that are the eponymous "principal components".

Before starting, make sure that the data is fully numerical (i.e. it's only ints or doubles), and it has no missing values. This means that PCA should come after steps such as encoding and missing data handling.

Additionally, if this is being done on data that will be used to train a predictive model, the target variable (what you're trying to predict) should NOT be included—only the predictors should be transformed. This is because the same transformation will need to be done to the live data, and the target variable will not be available there.

Performing PCA:

Performing PCA itself in R is very straightforward. Suppose you have some untransformed_data:

untransformed_data <- iris# perform PCA and get information; `center` and `scale.` should normally be set to truedata_pca_info <- prcomp(untransformed_data, center = TRUE, scale. = TRUE)# get the principal components; each principal component (PC1, PC2, etc.) is a linear combination of the original column variables, and this matrix contains the coefficients of those linear combinationspca_components <- data_pca_info$rotation # also called the "rotation" matrix# get post-PCA-transformation datapca_transformed_data <- data_pca_info$x

Deciding on Columns to Drop:

One of the goals of PCA is dimensionality reduction, which can be accomplished by dropping one or more of the components. Each component "holds" a certain amount of variance in the data [2], and we want to maximize the remaining variance, so typically we look to drop the components holding the smallest variances.

# examine the amount of variance each PC holdsprint(data_pca_info)

You'll get this kind of printout:

Importance of components:                          PC1    PC2     PC3     PC4Standard deviation     1.7084 0.9560 0.38309 0.14393Proportion of Variance 0.7296 0.2285 0.03669 0.00518Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

The important thing here is the "Cumulative Proportion" line, which tells you the cumulative proportion of variance up to that component. So in the example above, keeping PC1 and PC2 will still give us about 95.8% of the total variance, so we may look to drop the other 2 columns.

Note that there is no hard-and-fast rule for what proportion of variance to keep; this could depend on the data, the model you're feeding the data into, and potentially other factors. Anecdotally, it seems that in most cases you probably want to keep at least 80% of the variance, but this is not a rigorous determination.

Applying Same PCA Transformation to New Data:

Typically, when using a model that was trained on PCA-transformed data, it will be necessary to apply the same transformation to any predictor data that is fed into that model after it's been trained. This is also easy to do in R:

# `new_untransformed_data` should have the same columns as `untransformed_data`, with the same preparation donenew_transformed_data <- predict(data_pca_info, newdata = new_untransformed_data)

Resources/Footnotes:

  • [1] Data | Principal Component Analysis (PCA)

  • [2] variance is a statistical quantity for a dataset that measures the spread of the data. For PCA, the proportion of variance each PC "holds" can be used as a proxy for the amount of information it contains relative to the whole set.

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.