Understanding accumulated gradients in PyTorch
You are not actually accumulating gradients. Just leaving off optimizer.zero_grad() has no effect if you have a single .backward() call, as the gradients are already zero to begin with (technically None but they will be automatically initialised to zero). The only difference between your two versions, is how you calculate the final loss. The for … Read more