Lessons
1
Why vanilla gradient descent struggles
2
Exponential moving averages
3
SGD with momentum
4
Nesterov momentum: looking ahead
5
AdaGrad: per-parameter learning rates
6
RMSprop: fixing AdaGrad
7
Adam: the complete derivation
8
Learning rate schedules: warmup and decay
9
Optimizer cookbook: when to use what