Guru's Verification engine ensures consistency, confidence, and trust in the knowledge your organization shares. Learn more.

Machine Learning | Stochastic (Mini-Batch) Gradient Descent (SGD)

Gradient descent [1] is a technique to minimize the loss [2] in machine learning models, by "descending" the loss function in a direction that can be determined from the gradient, which is a kind of higher-dimensional equivalent of the slope of a 2-d curve.

Calculating the true gradient requires finding the model's loss on the full training dataset. However, in the context of real-world applications, this requires far too much computation to be feasible, especially since this computation has to done thousands or even more times.

Therefore, as a shortcut, instead of using the full training set, we can use a random subset of it; as long as it's not the same subset at each descent step, produce an estimate for the gradient that is "close enough" [3] that it will still lead to approximately the same direction of descent, and over many steps, it will still lead to the minimum. This random sampling of the dataset is the "stochastic" part of stochastic gradient descent, and this technique (along with its variant, mini-batch gradient descent, see below) is the bedrock of the training process of a great many machine learning algorithms.

Terminology Note:

Several related terms are used for the gradient descent variants in machine learning, which can get confusing. Here is what is commonly meant when these terms are referenced:

batch gradient descent: this is "regular" gradient descent, using the full training dataset to calculate the gradient
stochastic gradient descent: this is when the sample used to estimate the gradient has only a single observation in it
mini-batch gradient descent: this is when the sample used to estimate the gradient has more than one observation, but still not the full dataset (usually it's a small percentage of the full dataset)

The distinction between the stochastic and mini-batch variants seems to be mostly historical rather than practical, and oftentimes people will use stochastic to refer to either case. The Guru cards will largely use stochastic gradient descent in this vein, to refer to using random sample of any size so long as it's not the full dataset.

See [4] for a simple demonstration of SGD in action.

Footnotes/Resources:

[1] Data | Gradient Descent
[2] Data | Loss Function
[3] What "close enough" is depends heavily on the data context, and there isn't a hard-and-fast rule that applies in all situations.
[4] https://www.kaggle.com/ryanholbrook/stochastic-gradient-descent/

You must have Author or Collection Owner permission to create Guru Cards. Contact your team's Guru admins to use this template.