Machine Learning | Gradient Descent
Gradient descent is a mathematical technique that is often used in machine learning [8] to tune the parameters of a model [1]. Although the mechanical details of how it works can get quite technical, its core concept in the context of machine learning is relatively simple, and that is what will be outlined in this card.
Loss Minimization Example
As noted in [2], one important step in developing a machine learning model is to minimize its loss on the dataset it is trained on, and typically this is done by tuning the parameters of the model until the loss reaches a minimum. Note that the loss, by design, is always a non-negative number.
The example shown is a plot of a hypothetical loss function against a parameter m of some model. In this toy model, the loss function is a convex parabola, with a minimum at m = 5, so here, we would just set m to that value and we will have tuned the model.
However, in practice it isn't possible to figure this out so easily. For one thing, real models would usually have more than a single parameter (neural nets can have thousands, for example), and even for simpler models, the loss function may not behave as neatly as it did here. More robust numerical techniques are needed to find the minimum, and gradient descent is one such technique.
Gradient Descent Concept
Please note that [7] provides a good introduction to what gradient descent is in (slightly) more technical depth; this card will describe the core concept without dealing much with the underlying mathematics.
The essential idea behind gradient descent is this: if you keep going down a valley, eventually you will get to the bottom of that valley [3], even if, starting off, you can't see where the bottom is. In the case of loss minimization, the "valley" is the curve representing the value of the loss against the parameters, such as the parabola example previously shown:
The key here is that you don't actually need to know where exactly the bottom of the valley is to get there; even without any knowledge of its actual location, if you keep going in the "down" direction, you'll eventually get to the bottom, and you'll know that you're at the bottom because there isn't a way to go further down in either direction [4].
So, suppose someone is blindfolded and then dropped on some random point on a valley; an informal algorithm for them descending to the bottom might be:
Check the slope of where I am at right now. Is there a direction where the slope is pointing downwards?
If not, then I'm at the bottom, so stop moving.If there is a downward sloping direction, then take a step in that direction, and go back to step 1.
And, of course, once they get to the bottom of the "valley", they can report back their location coordinates, which is equivalent to finding the model parameter value(s) where the loss is the smallest.
What is the "Gradient"?
As noted above, the only thing needed to descend to the bottom of the curve is know which direction is "sloping down". And, in the 2-d case, direction can be determined by looking at the slope of the curve at any given point—if the slope is negative, then going right will be the downward direction. If the slope is positive, then going left will be the downward direction. If the slope is 0, then we have reached the minimum point.
In the 2-d case, this descent direction can be determined by the slope, which is just a single number for each location on the x-axis. When there are more than 2 dimensions, however, the "slope" is no longer called by that name, and is instead called the gradient [6].
Despite the different name, the gradient works the same way as the slope for the purpose of finding the minimum of the loss function—it allows one to determine which direction to go in order to descend on the curve. And like the slope, once the gradient is 0, that is when we know that we have reached the bottom.
Calculating the Gradient
Performing a full by-the-book calculation of the gradient is a rather computationally intensive process, and so in practice various shortcuts are used to estimate its true value to a good approximation. A couple of the most fundamental such techniques used in machine learning are making the calculation stochastic [10] (based on a sample of the training data) and using the backpropagation algorithm [11] (an application of the chain rule from calculus).
Footnotes/Resources
[3] Assuming that it actually has a bottom; but (under mild conditions) we can assume this to be true in the case of loss minimization, because losses will always be at least 0 so we know that the "valley" doesn't just keep going down forever.
[4] Clearly there are some complications here—for example, what happens if there's another valley beyond this one whose bottom is even lower? Or what if the steps are too small and it takes too long to get to the bottom, and you only have a limited amount of time to do so? Or they're too big and you accidentally step over the bottom? Dealing with these complications is an important part of designing gradient descent algorithms in real life.
[6] The gradient is also often characterized as the "direction of greatest rate of increase", or the "vector of partial derivatives along the basis directions". These are more mathematically precise formulations, but the author finds "multidimensional equivalent of the slope" to be more intuitive.
[8] Do note that gradient descent (and its common implementation, stochastic gradient descent) are not specific to machine learning—it's a general numerical optimization technique that just happens to be extremely useful and prevalent in the machine learning context. For those who have some experience with calculus, [9] provides a more in-depth (but still intuitive) explanation of it.
[9] https://betterexplained.com/articles/vector-calculus-understanding-the-gradient/
[11] [backpropagation card]