Hide navigation sidebar

Hide table of contents sidebar

Skip to content

Toggle site navigation sidebar

docs.modula.systems

Toggle table of contents sidebar

Introduction:

Quickstart
What’s in a norm?
Reading list

Algorithms:

Newton-Schulz
Manifold duality maps
Toggle navigation of Manifold duality maps

Examples:

Hello, World!
Hello, MNIST!
Hello, GPT!
Weight erasure

More on Modula:

Modula FAQ
Modula codebase
Modula homepage

Toggle table of contents sidebar

Reading list¶

Core papers¶

Scalable optimization in the modular norm
Modular duality in deep learning

Optimization¶

Preconditioned spectral descent for deep learning
The duality structure gradient descent algorithm: analysis and applications to neural networks
On the distance between two neural networks and the stability of learning
Automatic gradient descent: Deep learning without hyperparameters
A spectral condition for feature learning
Universal majorization-minimization algorithms
An isometric stochastic optimizer
Old optimizer, new norm: An anthology
Muon: An optimizer for hidden layers in neural networks

Generalization¶

Spectrally-normalized margin bounds for neural networks
A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks
Investigating generalization by controlling normalized margin

New developments¶

Preconditioning and normalization in optimizing deep neural networks
Improving SOAP using iterative whitening and Muon
On the concurrence of layer-wise preconditioning methods and provable feature learning
A note on the convergence of Muon and further
Training deep learning models with norm-constrained LMOs
Muon is scalable for LLM training

What’s in a norm?

Copyright © 2024, Jeremy Bernstein

Made with Sphinx and @pradyunsg's Furo

On this page

Reading list