Making deep networks trainable
The activation distribution problem
Batch normalization: the algorithm
BatchNorm at inference time
Layer normalization
Instance, group, and weight normalization
Why weight initialization matters
Xavier and He initialization: the math