Frequently asked questions¶
Feel free to reach out or start a GitHub issue if you have any questions about Modula. We’ll post answers to any useful or common questions on this page.
Conceptual questions¶
The gradient is a vector: how can a vector have a spectral norm?
An important mental jump in Modula is to think of the weights of our neural network as a list of tensors \((\mathbf{W}_1, \dots \mathbf{W}_L)\) where \(\mathbf{W}_k\) is the weight tensor of layer \(k\). It then makes sense to think of the gradient of the loss \(\mathcal{L}\) with respect to the \(k\text{th}\) weight tensor \(\mathbf{W}_k\) as itself being a tensor \(\nabla_{\mathbf{W_k}}\mathcal{L}\) with the same shape as \(\mathbf{W}_k\). We can then meaningfully ask what is the operator norm of this gradient tensor.
This contrasts with a common approach to optimization theory where the whole weight space is “flattened” into one big weight vector \(\mathbf{w}\) with a corresponding gradient vector \(\nabla_\mathbf{w} \mathcal{L}\), thus “losing” the operator structure.
Why does Adam beat SGD on transformer, and how does normalization fix SGD?
While some researchers have challenged the use of Adam in deep learning, Adam is certainly the optimizer of choice for training large language models, performing much better than SGD in practice. Still, it is not widely known why Adam is better than SGD. Here we aim to provide a mechanistic explanation of one of the main reasons. The basic idea is that there is no reason the raw gradients should have good relative sizes across layers. And a major thing that Adam does is to “rebalance” the update sizes across layers.
Let’s give a concrete example to see what we mean. Consider a machine learning model with a list of weight tensors \(\mathbf{w} = (\mathbf{W}_1, \dots \mathbf{W}_L)\) and a loss function \(\mathcal{L}\). Then a vanilla gradient update is given by \((\mathbf{W}_1, \dots \mathbf{W}_L) - \eta \times (\nabla_{\mathbf{W}_1}\mathcal{L}, \dots \nabla_{\mathbf{W}_L}\mathcal{L})\) where \(\eta\) is the global learning rate. Now, suppose that our neural network is a toy residual network with \(L\) layers:
This toy network consists of “read-in” and “read-out” matrices \(\mathbf{W}_0\) and \(\mathbf{W}_L\) along with \(L-2\) “residual” matrices each depressed by a factor of \(1/L\). These depression factors are included to give the model a better large depth limit—in Modula we advocate for \(1/L\) depression factors, while the the inclusion of \(1/\sqrt{L}\) depression factors is standard in large language models. We do not include a nonlinearity in this toy model for simplicity.
The point is that the depression factors—be they \(1/L\) or \(1/\sqrt{L}\)—also depress the gradients to the residual blocks by the same factor. So if one takes the depth \(L\) large and uses vanilla gradient descent or SGD to train a transformer, one is essentially applying the update:
In words: the inclusion of the depression factors kills the size of the updates to the residual blocks in comparison to the read-in and read-out layers in deep networks. If you use SGD to train such a model, depending on how you set the learning rate \(\eta\), you are stuck between severely under-training the middle layers or severely over-training the input and output layers. Adam largely fixes this issue by normalizing each update tensor individually and thus removing the effect of the depression factors. So, Adam is a form of gradient normalization! Modular normalization also automatically fixes this issue by rebalancing the size of the updates for any base optimizer.
Why does modular normalization lead to learning rate transfer across scale?
By the definition of a “well-normed module” \(\mathsf{M}\), when weight updates \(\Delta \mathbf{w}\) are normalized in the modular norm \(\|\cdot\|_\mathsf{M}\) then updates \(\Delta \mathbf{y}\) to the module output are well-behaved in the output norm \(\|\cdot\|_\mathcal{Y}\). We set up our actual architectures, including complicated models like GPT, to actually be well-normed independent of the scale of the architecture. A little bit more formally:
well-normed modules are one-Lipschitz in the modular norm, meaning \(\|\Delta \mathbf{y}\|_\mathcal{Y} \leq \|\Delta \mathbf{w}\|_\mathsf{M}\);
this inequality holds tightly when tensors in the network “align” during training, meaning that we may approximate \(\|\Delta \mathbf{y}\|_\mathcal{y} \approx \|\Delta \mathbf{w}\|_\mathsf{M}\) in a fully aligned network;
therefore normalizing updates in the modular norm provides control on the change in outputs;
these statements are all independent of the size of the architecture.
Since modular normalization works by recursively normalizing the weight updates to each submodule, these desirable properties extend to all submodules as well as the overall compound.
What do we mean by “tensor alignment” in Modula?
In the guts of a neural network there can be found lots and lots of tensors. And sometimes these tensors like to multiply each other. For example, there are:
vector-vector products \(\mathbf{u}^\top\mathbf{v}\)
matrix-vector products \(\mathbf{A}\mathbf{v}\)
matrix-matrix products \(\mathbf{A}\mathbf{B}\)
and so on…
An important question is “how big are such tensor products inside a neural network?” In other words, if we know the size of the inputs to the product, can we predict the size of the product itself?
Let’s start with the simplest example of the vector-vector product, otherwise known as a friendly “dot product”. Suppose we have two \(n\) dimensional vectors \(\mathbf{u}\) and \(\mathbf{v}\) of known sizes \(\|\mathbf{u}\|_2\) and \(\|\mathbf{v}\|_2\). Here the symbol \(\|\mathbf{\cdot}\|_2\) denotes the “Euclidean length” or “\(\ell_2\) norm” of the vectors. How large can the dot product be? Well, by the Cauchy-Schwarz inequality, we have that:
In words: the size of the dot product is limited by the size of its two inputs. What’s more the Cauchy-Schwarz inequality is “tight”, meaning that \(|\mathbf{u}^\top \mathbf{v}| = \|\mathbf{u}\|_2 \times \|\mathbf{v}\|_2\), when the two vectors \(\mathbf{u}\) and \(\mathbf{v}\) point in the same (or opposite) directions—when the two vectors “align”.
This idea of having an inequality that limits the size of a tensor product, which is tight under certain configurations of the input tensors, generalizes to higer-order forms of tensor product. For example, for the matrix-vector product \(\mathbf{A}\mathbf{v}\) the relevant inequality is given by:
where \(\|\cdot\|_*\) is the matrix spectral norm. This inequality is tight when the vector \(\mathbf{v}\) lies in the top singular subspace of the matrix \(\mathbf{A}\)—when the matrix and vector “align”.
And for matrix-matrix products, we have the “sub-multiplicativity of the spectral norm”:
We will say that this inequality is tight when the two matrices “align”—you get the idea!
Why does any of this matter? Well for a neural network at initialization, some of these inequalities may be quite slack because tensors in the network are randomly oriented with respect to each other. But it is a central tenet of the Modula framework that after training has sufficiently “warmed up”, the network will fall into a fully aligned state where all inequalities of the type mentioned in the section hold reasonably tightly, and may therefore be used to predict the size and scaling of various quantities in the network.
Other notions of alignment
We have outlined a notion of alignment which captures whether or not a certain inequality governing a tensor product is tight. This is different to the notion of alignment measured in Scaling Exponents Across Parameterizations and Optimizers which turns out to be coupled to the matrix stable rank. Essentially, the findings on alignment in that paper don’t have an obvious bearing on the notion of alignment used in Modula. Large-scale empirical tests of alignment as we have described it are certainly a valuable direction for future work.
Is there a unique and optimal way to parameterize an architecture?
The short answer is no: if you’re careful, there is some freedom in how you can parameterize your architecture. With that said, there are some constraints that you can’t really avoid if you want things to work well. And there are some “natural choices” which I think we may as well agree on at least to ease communication between researchers.
A LoRA layer provides a really good setting to think about these points. Given a \(n \times r\) matrix \(B\) and an \(r \times n\) matrix \(A\), a LoRA layer is just the matrix product \(B A\). Now if you’re a spectral-μP afficionado, you’d know that the “right way” to scale these matrices is so that their initialization and updates have spectral norm proportional to \(\sqrt{\text{fan-out/fan-in}}\). Written out in full:
matrix \(B\) and update \(\Delta B\) have spectral norms \(\|B\|_*\) and \(\|\Delta B\|_* \propto \sqrt{n / r}\),
matrix \(A\) and update \(\Delta A\) have spectral norms \(\|A\|_*\) and \(\|\Delta A\|_* \propto \sqrt{r / n}\).
However, these conditions are more restrictive than necessary. Because matrices are homogeneuous linear maps, in the product \(BA\) we are free to scale up the matrix \(B\) by any factor so long as we divide the matrix \(A\) by the same factor. Nothing changes if we do this. In particular, if we scale \(B\) by factor \(\sqrt{r/n}\) and divide \(A\) by this same factor we obtain new conditions:
matrix \(B\) and update \(\Delta B\) have spectral norms \(\|B\|_*\) and \(\|\Delta B\|_* \propto 1\),
matrix \(A\) and update \(\Delta A\) have spectral norms \(\|A\|_*\) and \(\|\Delta A\|_* \propto 1\).
Using these new spectral scaling conditions will have exactly the same training dynamics.
Matters of precision
When considering representing the weight entries in floating point, a difference may emerge between these two schemes. In particular, one scheme may lead to weight entries more easily representable in a low-precision floating point number format. Charlie Blake et al. consider exploiting this type of “scale symmetry” in u-μP: The Unit-Scaled Maximal Update Parametrization.
In summary, I hope that this section demonstrates that:
the conditions in the spectral-μP paper provide a sensible default way of scaling matrices which should work well in generic situations;
however, the conditions are not unique, and in specific cases you can modify the rules—so long as you know what you’re doing;
you may want to take advantage of scale symmetries if you are interested in designing low-precision training algorithms.
Modula package¶
The modular norm involves a max—why do I not see any maxes in the package?
Computing the modular norm involves evaluating lots of expressions of the form:
So you might be surprised not to see lots of maxes in the package. This is because to normalize a vector \((\mathbf{w}_1, \mathbf{w}_2)\) we do not just compute \((\mathbf{w}_1, \mathbf{w}_2) / \|(\mathbf{w}_1, \mathbf{w}_2)\|_\mathsf{M}\). Instead, we separately normalize both sub-vectors in order to “saturate” the max. That is, we send:
In other words, we maximize the size of each subvector under the constraint that the full vector has unit modular norm.
Is it necessary to use orthogonal intialization in Modula?
No. You could re-write the atomic modules to use Gaussian initialization if you wanted. The reason we choose to use orthogonal initialization is that it makes it much easier to get scaling right. This is because the spectral norm of any \(m \times n\) random orthogonal matrix is always one. In contrast, the spectral norm of an \(m \times n\) random Gaussian matrix depends on the dimensions \(m\) and \(n\) and also the entry-wise variance \(\sigma^2\), making it more difficult to properly set the initialization scale. In addition, orthogonal matrices have the benign property that all singular values are one. In Gaussian matrices, on the other hand, the average singular value and the max singular value are different, meaning that Gaussian matrices have more subtle numerical properties.
Does Modula support weight sharing?
Not at present, although it would be possible to support this.
Research philosophy¶
Do I need to be a mathematical savant to contribute to research of this kind?
I don’t think so. There are a lot of very technical people working in this field bringing with them some quite advanced tools from math and theoretical physics, and this is great. But in my experience it’s usually the simpler and more elementary ideas that actually work in practice. I strongly believe that deep learning theory is still at the stage of model building. And I resonate with both Rahimi and Recht’s call for simple theorems and simple experiments and George Dahl’s call for a healthy dose of skepticism when evaluating claims in the literature.