Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

Data | Principal Component Analysis (PCA)

Principal component analysis (PCA) [3] is a technique that transforms a set of numerical observations into a format that allows one to reduce the number of dimensions [1] of that data while minimizing the information loss.

Explanation:

A full technical explanation of PCA is outside of this card's scope (see [4] for a more technical, though still simplified treatment), but in essence, PCA rotates the data so that it's easier to "flatten" it along one or more of the axes, which is what we want in order to reduce the number of dimensions.

GaussianScatterPCA.svg.png

Source: https://commons.wikimedia.org/wiki/File:GaussianScatterPCA.svg#/media/File:GaussianScatterPCA.svg

In this simplified example, the data largely follows a diagonal pattern from lower left to upper right, with some spreading around that. If we wanted to reduce the data to a single dimension, it would not be a good idea to "squish" (i.e. project) the data to either of the original axes, because you'd be removing more variation in the data (variance) than is strictly necessary.

Instead, by finding a new set of axes that are pointing in the two directions of the data, as shown by the arrows in the diagram, we can project the data down to the axis pointing in the "long" direction of the data, and only lose the variance that was present along the "short" direction of the data, which is much less than would've happened if we chose either of the original axis to project onto.

Performing PCA:

  • In R: see [2]

Resources:

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.