Understanding accumulated gradients in PyTorch

You are not actually accumulating gradients. Just leaving off optimizer.zero_grad() has no effect if you have a single .backward() call, as the gradients are already zero to begin with (technically None but they will be automatically initialised to zero). The only difference between your two versions, is how you calculate the final loss. The for … Read more

How to calculate optimal batch size?

From the recent Deep Learning book by Goodfellow et al., chapter 8: Minibatch sizes are generally driven by the following factors: Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below … Read more

What is `lr_policy` in Caffe?

It is a common practice to decrease the learning rate (lr) as the optimization/learning process progresses. However, it is not clear how exactly the learning rate should be decreased as a function of the iteration number. If you use DIGITS as an interface to Caffe, you will be able to visually see how the different … Read more

Sklearn SGDClassifier partial fit

I have finally found the answer. You need to shuffle the training data between each iteration, as setting shuffle=True when instantiating the model will NOT shuffle the data when using partial_fit (it only applies to fit). Note: it would have been helpful to find this information on the sklearn.linear_model.SGDClassifier page. The amended code reads as … Read more

pytorch how to set .requires_grad False

requires_grad=False If you want to freeze part of your model and train the rest, you can set requires_grad of the parameters you want to freeze to False. For example, if you only want to keep the convolutional part of VGG16 fixed: model = torchvision.models.vgg16(pretrained=True) for param in model.features.parameters(): param.requires_grad = False By switching the requires_grad … Read more

What is the difference between SGD and back-propagation?

Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks. This is not a learning method, but rather a nice computational trick which is often used in learning methods. This is actually a simple implementation of chain rule of derivatives, which simply gives you the ability to compute … Read more

Why do we need to explicitly call zero_grad()? [duplicate]

We explicitly need to call zero_grad() because, after loss.backward() (when gradients are computed), we need to use optimizer.step() to proceed gradient descent. More specifically, the gradients are not automatically zeroed because these two operations, loss.backward() and optimizer.step(), are separated, and optimizer.step() requires the just computed gradients. In addition, sometimes, we need to accumulate gradient among … Read more

What is the difference between Gradient Descent and Newton’s Gradient Descent?

At a local minimum (or maximum) x, the derivative of the target function f vanishes: f'(x) = 0 (assuming sufficient smoothness of f). Gradient descent tries to find such a minimum x by using information from the first derivative of f: It simply follows the steepest descent from the current point. This is like rolling … Read more

why gradient descent when we can solve linear regression analytically

When you use the normal equations for solving the cost function analytically you have to compute: Where X is your matrix of input observations and y your output vector. The problem with this operation is the time complexity of calculating the inverse of a nxn matrix which is O(n^3) and as n increases it can … Read more