Stochastic gradient descent Intuition

 This is a part 2 of my blog on the optimizers. Previously I had explained about the Gradient Descent which is the basic building blocks for the other optimizers, if you want a revision check out my previous article. Link is in the bottom of this article.






SGD :- 


It is a similar type of optimizer as Gradient Descent, but the main difference is that it only runs for one data point at a time. Let me explains a little bit more.

When we use our Gradient Descent, we take all the data points, do the iteration for number of the epochs we assigned and compute our loss function. But in the SGD we only take one single random point and compute the loss function for every epochs.






The main advantage of the SGD over the Gradient Descent is that it is less time computational and less expensive.

In SGD we only use one data point, due to this some noise occurs while finding our global minima. To remove these type of noise we use the concept of Momentum






SGD with Momentum :- 

As, I have told above that to remove the biasness we use the concept to momentum. The main work of the Momentum is that it helps us to smoothens the curve for finding the global minima point. This process is also known as Exponential weighted average. 

In this article we will not talk about the Mathematical part, but here is some if you want - 





In the above image we have our bias value with respect to time and the 'V' is basically a random variable to which we assign the values for the momentum.



In this figure I have explained that how the weights are updated.
The momentum value is been added with the learning rate and the derived value.



Mini-Batch SGD :-

In mini-batch SGD instead of taking one data point(SGD) we will take batch of data points to find the loss function.

Firstly we will assign a batch of how many data points should we use. After it apply the optimizer as SGD and find the cost function. The bias of Mini- batch SGD is pretty less as compared to the SGD.




In the deep learning we use some other type of Optimizers all the time, but those optimizers takes the concept of our Mini-batch SGD. 
  
Here is an image of all the three gradient descent, how they converge to the global minima point.



The main advantage of using SGD is that it is less time computational and less expensive.

When we compare the SGD and Mini-Batch SGD, the biasness is pretty less in Mini-Batch.  
The concept behind this is that, in SGD we only take one data point at a time to find loss function due to this the bias adds on until we find our global minima, while in mini-batch we have some number of points so it's easy to converge for our global minima point .

That's all about the intuition of the SGD optimizer.

Hope you guys like it.

Here is the link of article for Gradient Descent Optimizer - Gradient Descent

THANK YOU! 🌝🌝






















Comments

Popular posts from this blog

GRADIENT DESCENT