Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

Data | Dimensionality Reduction

When preparing numerical data for analysis, it is common to reduce its number of dimensions, i.e. the number of columns the data has, while preserving as much of the information contained in the data as possible. Intuitively, this is condensing the data into the essential information it contains while throwing away parts that either contains no information, or just minor noise.

It may be somewhat counter-intuitive to simply throw away data, but there are some significant benefits to doing so:

  • It makes the data more computationally tractable. A larger number of dimensions requires more computations to perform operations on.

    • Additionally, if there are many dimensions that don't contain a lot of information, it will result in having to perform computation on matrices that are sparse [1] (has a lot of zeros), which can cause problems for some algorithms.

  • It can improve the robustness of prediction models trained on the data, because it forces the training process to focus on the major features of the data, which are more likely to be preserved in the real data the models will be used on after training. This prevents the models from becoming dependent on the minor features in the training data, which are more like to be different in the real data.

    • Having a model be too dependent on features that are specific to the training data and aren't generalizable is called overfitting, and is a major contributor to predictive models not working as well in production compared to in development.

    • Non-technical example to illustrate: suppose we are trying to teach someone who has never seen an apple before what an apple is. So, we show them a bunch of (colored) photos of apples. However, it just so happens that all the apples in our sample are red, and so—assuming the person is not colorblind—the person might erroneous assume that all apples are red, and thus when they see a green apple, they won't think it is an apple. In this case, we might have done better to give them black-and-white photos of the apples, which would prevent them from learning that erroneous connection.

      • Real people will likely not make this mistake, but compared to humans, machines will be more literal-minded and will make these errors, if allowed to.

      • The whole problem would've also been avoided by making sure the training data (the photos) had both green and red apples in it, and this is important, and should've been done. However, it is not always possible to account for every single possibility—for example, if someone painted an apple blue, it seems likely that we would want the person to identify that as an apple. But it's reasonable to think that the photos we show them wouldn't have an example of an apple painted in this way.

Dimensionality reduction can be as simple as realizing that some columns of the data are unlikely to have any predictive value for the target variable and discarding them. However, usually it involves applying some mathematical techniques to the data. A basic and frequently used technique is called Principal Component Analysis (PCA) [4], which transforms the data into uncorrelated components that can be dropped individually.

Resources:

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.