Orthogonal manifold

On this page, we will work out an algorithm for performing gradient descent on the manifold of orthogonal matrices while taking steps that are steepest under the spectral norm. The algorithm will solve for the matrix of unit spectral norm that maximizes the linearized improvement in loss while lying tangent to the manifold. The “retraction map”—which sends the update from the tangent space back to the manifold—can oftentimes be performed by a simple scalar multiplication.

Steepest descent on the orthogonal manifold

Consider a square weight matrix \(W\in\mathbb{R}^{n \times n}\) that is orthogonal, meaning that \(W^\top W = I_n\). Suppose that the “gradient matrix” \(G\in\mathbb{R}^{n\times n}\) is the derivative of some loss function evaluated at \(W\). Given step size \(\eta > 0\), we claim that the following weight update is steepest under the spectral norm while staying on the orthogonal manifold. First, we compute the matrix \(X = [W^\top G - G^\top W]^\sharp\), and then we make the update:

\[\begin{split}W \mapsto \begin{cases}\frac{W (I_n - \eta X)}{\sqrt{1+\eta^2}} &\text{if $W^\top G - G^\top W$ is full-rank;}\\ [W (I_n - \eta X)]^\sharp &\text{otherwise.}\end{cases}\end{split}\]

In these expressions, \(M^\sharp\) denotes the sharp-operator applied to matrix \(M\), which returns the matrix with the same singular vectors as \(M\) but all positive singular values are set to one. The sharp-operator can be computed by running a “Newton-Schulz” iteration, such as the following cubic iteration:

\[M_0 = \frac{M}{\|M\|_F}; \qquad M_{t+1}=\frac{3}{2}M_t-\frac{1}{2}M_tM_t^\top M_t; \qquad t\to\infty,\]

which goes back to Kovarik (1970) and Björck & Bowie (1971).

Non-Riemannian manifold methods

One reason this algorithm is interesting is that it is an example of a manifold optimization algorithm that is non-Riemannian. A Riemmanian manifold is a manifold equipped with a structure called a Riemannian metric, which is an inner product defined at each point on the manifold. The inner product provides a way to measure distance and construct geometry-aware optimization algorithms. There has been a lot of research into Riemannian optimization methods. Some examples in a machine learning context are:

However, there has seemingly been much less research into optimization algorithms on manifolds equipped with non-Riemannian structures. For instance, a matrix manifold equipped with the spectral norm at every point is non-Riemannian since the spectral norm does not emerge from an inner product. But we believe these kinds of non-Riemannian geometries are very important in deep learning.

The structure of the tangent space

We would like to make a weight update so that the updated weights stay on the orthogonal manifold. First we need to figure out the structure of the “tangent space” at a point on the manifold. Roughly speaking, the tangent space is the set of possible velocities a particle could have as it passes through that particular point. So we need to consider all curves passing through the point on the manifold.

If we consider a curve \(W(t)\) on the manifold parameterized by time \(t \in \mathbb{R}\), then this curve must satisfy \(W(t)^\top W(t) = I_n\). Differentiating with respect to \(t\), we find that the velocity must satisfy:

\[\frac{\partial W(t)}{\partial t}^\top W(t) + W(t)^\top \frac{\partial W(t)}{\partial t} = 0.\]

So to be in the tangent space of a point \(W\) on the manifold, a matrix \(A\) must satisfy \(A^\top W + W^\top A = 0\). Conversely, if a matrix \(A\) satisfies \(A^\top W + W^\top A=0\), then it is the velocity of a curve on the manifold that passes through \(W\), as evidenced by the curve \(W(t) = W \exp(tW^\top A)\). Therefore, the tangent space at \(W\) is completely characterized by the set:

\[\{A\in \mathbb{R}^{n\times n}:A^\top W + W^\top A = 0\}.\]

Finally, if we use the orthogonal matrix \(W\) to make the change of variables \(A = W X\), then we see that \(A\) belongs to the tangent space at \(W\) if and only if \(X\) is skew-symmetric: \(X^\top + X = 0\). So the tangent space to the orthogonal manifold can be parameterized by skew-symmetric matrices.

Steepest direction in the tangent space

We will solve for the matrix \(A\) that belongs to the tangent space to the orthogonal manifold at matrix \(W\) and maximizes the linearized improvement in loss \(\operatorname{trace}(G^\top A)\) under the constraint that \(A\) has unit spectral norm. Formally, we wish to solve:

\[\operatorname{arg max}_{A\in \mathbb{R}^{n\times n}: \|A\|_*\leq 1 \text{ and } A^\top W + W^\top A = 0}\; \operatorname{trace}(G^\top A).\]

To simplify, we make the change of variables \(A = W X\) so that we now only need to maximize over skew-symmetric matrices \(X\) of unit spectral norm:

\[\operatorname{arg max}_{X\in \mathbb{R}^{n\times n}:\|X\|_*\leq 1 \text{ and } X^\top + X= 0}\; \operatorname{trace}([W^\top G]^\top X).\]

Next, we decompose \(W^\top G = \frac{1}{2}[W^\top G + G^\top W] + \frac{1}{2}[W^\top G - G^\top W]\) into its symmetric and skew-symmetric components and realize that, because \(X\) is skew-symmetric, the contribution to the trace from the symmetric part of \(W^\top G\) vanishes. So the problem becomes:

\[\operatorname{arg max}_{X\in \mathbb{R}^{n\times n}:\|X\|_*\leq 1 \text{ and } X^\top + X= 0}\; \operatorname{trace}\left(\left[\frac{W^\top G - G^\top W}{2}\right]^\top X\right).\]

If we simply ignore the skew-symmetric constraint, the solution for \(X\) is given by \(X = [W^\top G - G^\top W]^\sharp\). But this solution for \(X\) actually satisfies the skew-symmetric constraint! This is because the sharp-operator preserves skew-symmetry. An easy way to see this is that \([W^\top G - G^\top W]^\sharp\) can be computed by running an odd polynomial iteration (“Newton-Schulz”) on \(W^\top G - G^\top W\), and odd polynomials preserve skew-symmetry. [1]

Undoing the change of variables, our tangent vector is given by \(A = W X = W [W^\top G - G^\top W]^\sharp\).

Finding the retraction map

The previous section suggests making the weight update \(W \mapsto W - \eta W X = W (I_n - \eta X)\). This update takes a step in the tangent space and will actually diverge slightly from the orthogonal manifold. We can fix this issue by using the sharp-operator, i.e. \(W \mapsto [W (I_n - \eta X)]^\sharp\) to project the weights back to the manifold. But there is actually a shortcut: if \(W^\top G - G^\top W\) is full rank, then \(X\) is an orthogonal matrix and \([W (I_n - \eta X)]^\top [W (I_n - \eta X)] = (1 + \eta^2) I_n\). Therefore, in this case, we can project back to the manifold simply by dividing through by the scalar \(\sqrt{1+\eta^2}\). It’s worth noting that if \(n\) is odd, then \(W^\top G - G^\top W\) must have at least one zero singular value so it cannot be full rank. This issue can be avoided simply by making \(n\) even!

Open problem: Extending to the Stiefel Manifold

I initially thought that this solution easily extended to the Stiefel manifold—i.e. the set of \(m \times n\) semi-orthogonal matrices. But this turns out not to be the case: the algorithm we derived is generally not optimal if \(W\) is rectangular. To see this, let’s consider an \(m \times n\) matrix \(W\) with \(m > n\), and suppose that it belongs to the Stiefel manifold \(W^\top W = I_n\). The problem with our derivation is that the change of variables \(A = W X\) no longer parameterizes the full set of \(m \times n\) matrices. Instead, we need to make the change of variable \(A = WX + \overline{W}Y\) where the columns of \(\overline{W}\) are the “missing” columns of \(W\). In other words, the combined matrix \([W | \overline{W}]\) is a square orthogonal matrix. For this parameterization, the tangent space to the Stiefel manifold is obtained by requiring that \(X\in\mathbb{R}^{n\times n}\) is skew-symmetric while \(Y\in\mathbb{R}^{(m-n)\times n}\) is completely unconstrained. I do not know how to analytically solve the resulting maximization problem in this parameterization.