Reading list

There are two main academic papers for understanding Modula. The first is called “Scalable optimization in the modular norm”. In this paper, we construct a recursive procedure for assigning a norm to the weight space of general neural architectures. Neural networks are automatically Lipschitz and (when possible) Lipschitz smooth in this norm with respect to their weights. The construction also provides means to track input-output Lipschitz properties. The paper is available here:

Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola & Jeremy Bernstein
NeurIPS 2024

The second paper builds on the first and is called “Modular duality in deep learning”. In this paper, we take the modular norm and use it to derive optimizers via a procedure called “modular dualization”. Modular dualization chooses a weight update \(\Delta w\) to minimize the linearization of the loss \(\mathcal{L}(w)\) subject to a constraint on the modular norm \(\|\Delta w\|_{M}\) of the weight update. In symbols, we solve:

\[\Delta w = \operatorname{arg min}_{\Delta w : \|\Delta w\|_{M} \leq \eta} \;\langle \Delta w, \nabla \mathcal{L}(w) \rangle,\]

where \(\eta\) sets the learning rate. Due to the structure of the modular norm, this duality procedure can be solved recursively leveraging the modular structure of the neural architecture. This procedure leads to modular optimization algorithms, where different layer types can have different optimization rules depending on which norm is assigned to that layer. The paper is available here:

Jeremy Bernstein & Laker Newhouse
arXiv 2024

There are many other papers by myself and other authors that I feel contain important ideas on this topic. Here are some of them:

Optimization

Generalization