A Guide To Gradient Descent
Gradient Descent is a technique used in neural networks during backpropagation in order to tune the weights to minimize the loss. There are 3 different kinds of widely used gradient descent algorithms, and I will be explaining the differences between them in this article. The algorithm for gradient descent is as follows
The difference between the gradient descent algorithms: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent is in the subset of the data that the gradients are calculated on.
Batch Gradient Descent
In Batch Gradient Descent or BGD, gradients are calculated on the entire dataset, this provides a good representation of the data and guarantees that your neural network will reach local minima. On paper this sounds fine, however, there are many problems with Batch Gradient Descent. For one, it is insanely slow to iterate through the entire dataset on every single train step. Batch Gradient Descent will also never reach global minima if the graph of convergence looks something like this:
Batch Gradient Descent will always get stuck at the local minimum in a case like this. Batch Gradient Descent is the worst of the Gradient Descent algorithms.
Stochastic Gradient Descent
Stochastic Gradient Descent or SGD for short takes a single sample of data from the dataset at every training step and calculates gradients based on that. This, just like Batch Gradient Descent, is a flawed approach. Stochastic Gradient Descent tends to not represent the data too well, making the network take longer to converge, however it is incredibly fast and it will most likely reach global minima. This is because, in our previously mentioned graph of convergence(of doom), the randomly selected data points will cause the weights to bounce around and get pushed out of local minima. This means that SGD is most likely going to converge, but it will take many training steps. The main problem with this algorithm is its random nature. Stochastic, if you didn’t know, is simply a fancy word for random.
Mini-Batch Gradient Descent
Mini-Batch Gradient Descent or MBGD is the best of SGD and BGD. MBGD takes a few random data points and calculates the gradients based on that. This has many advantages. MBGD provides a pretty good representation of the data, is as fast as SGD thanks to parallel processing on a GPU, and can bounce out of local minima just like SGD. However, it is a lot loss bouncy than SGD as can be seen in this diagram comparing the 3:
This makes it by far the best gradient descent algorithm and the ideal choice in most cases. The ideal number of data points in the subset of data chosen to compute gradients on will vary from case to case, but for most cases, any small power of 2 is good.
Hopefully, you understand the different kinds of gradient descents and their individual properties after reading this article. Other similar articles can be found on my medium page as long as 9th grade doesn’t consume every last second of my time.