Adam optimizer goes haywire after 200k batches, training loss grows

Question

Yes. This is a known problem of Adam.

The equations for Adam are

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.

By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.

To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0.
Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.

Leave a Comment Cancel reply