Stochastic gradient descent Intuition
T his is a part 2 of my blog on the optimizers. Previously I had explained about the Gradient Descent which is the basic building blocks for the other optimizers, if you want a revision check out my previous article. Link is in the bottom of this article. SGD :- It is a similar type of optimizer as Gradient Descent, but the main difference is that it only runs for one data point at a time. Let me explains a little bit more. When we use our Gradient Descent, we take all the data points, do the iteration for number of the epochs we assigned and compute our loss function. But in the SGD we only take one single random point and compute the loss function for every epochs. The main advantage of the SGD over the Gradient Descent is that it is less time computational and less expensive. In SGD we only use one data point, due to this some noise occurs while finding our global minima. To remove these type of noise we use the concept of Momentum . SGD with Momentum :- As, I have told above