Exploring Stochastic Gradient Descent with Restarts (SGDR)
This is my first deep learning blog post. I started my deep learning journey around January of 2017 after I heard about fast.ai from a presentation at a ChiPy meetup I attended (the Chicago Python user group). It was my first introduction to the topic and the only thing I knew about neural networks before were simple multi layer perception (MLP) models that acted as decent non-linear approximators for various prediction problems. Once I started my first couple of lessons, I was completely hooked on the amount of knowledge exploding about his subfield, especially because I had recently received my masters degree in analytics and was trying to stay up to date on becoming the best data science practitioner I could be. I am currently going through the third fast.ai course, which is part1_v2 using a framework built on top of PyTorch. One of the very important skills I have been picking up from these courses is the ability to decompose research papers that are coming out in the field and understanding / implement them. One of the interesting concepts I have seen come out is the idea of Stochastic Gradient Descent with Restarts, or SGDR.
Before, we get too far into SGDR, let’s first briefly cover what is the idea behind normal stochastic gradient descent and why do we use it.
In deep learning, you may have an architecture that may have millions of parameters and those millions of parameters are connected via functions that take some input and try to predict the output. When training a model, we know what the output should be since we pre-label our data. This leads to the idea of a “loss function”, that is essentially a function that attempts to measure the deviation of our prediction to the actual value we know is correct. Some common examples of loss functions include Mean Squared Error and Log Loss.
You can think of the entire idea behind deep learning as setting up some loss function and trying to update the many parameters the model has in some non random way to minimize this loss function.
Visualized, you can see the example to the left that with 2 parameters we have some non linear…