Adam optimizer goes haywire after 200k batches, training loss grows
Yes. This is a known problem of Adam. The equations for Adam are t <- t + 1 lr_t <- learning_rate * sqrt(1 – beta2^t) / (1 – beta1^t) m_t <- beta1 * m_{t-1} + (1 – beta1) * g v_t <- beta2 * v_{t-1} + (1 – beta2) * g * g variable <- … Read more