Reading list¶ Core papers¶ Scalable optimization in the modular norm Modular duality in deep learning Optimization¶ Preconditioned spectral descent for deep learning The duality structure gradient descent algorithm: analysis and applications to neural networks On the distance between two neural networks and the stability of learning Automatic gradient descent: Deep learning without hyperparameters A spectral condition for feature learning Universal majorization-minimization algorithms An isometric stochastic optimizer Old optimizer, new norm: An anthology Muon: An optimizer for hidden layers in neural networks Generalization¶ Spectrally-normalized margin bounds for neural networks A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks Investigating generalization by controlling normalized margin New developments¶ Preconditioning and normalization in optimizing deep neural networks Improving SOAP using iterative whitening and Muon On the concurrence of layer-wise preconditioning methods and provable feature learning A note on the convergence of Muon and further Training deep learning models with norm-constrained LMOs Muon is scalable for LLM training