Hypersphere¶
On this page, we will work out an algorithm for performing steepest descent under the Euclidean norm on the hypersphere. While this algorithm may seem obvious, it is a good warmup and the technical scaffolding will help in more complicated examples.
Steepest descent on the hypersphere¶
Consider a weight vector \(w\in\mathbb{R}^{n}\) on the unit hypersphere, meaning that the squared Euclidean norm \(\|w\|_2^2 = \sum_{i=1}^n w_i^2 = 1\). Suppose that the “gradient vector” \(g\in\mathbb{R}^{n}\) is the derivative of some loss function evaluated at \(w\). Given step size \(\eta > 0\), we claim that the following weight update is steepest under the Euclidean norm while staying on the unit hypersphere:
So we simply project the gradient on to the subspace orthogonal to the weight vector and normalize to obtain a unit vector. We offload the problem of setting the size of the update to choosing the step size parameter \(\eta\). Dividing through by \(\sqrt{1 + \eta^2}\) projects the updated weights back to the hypersphere.
The structure of the tangent space¶
The tangent space to the unit hypersphere at vector \(w\) is simply the set of vectors orthogonal to \(w\):
While it’s probably overkill, let’s show this formally. The tangent space at \(w\) is the set of possible velocities of curves passing through \(w\). For a real-valued parameter \(t\), consider a curve \(w(t)\) on the unit hypersphere. If we differentiate the condition \(w(t)^\top w(t) = 1\), we find that \(\frac{\partial w(t)}{\partial t}^\top w(t) = 0\). This means that a tangent vector at \(w\) must be orthogonal to \(w\). Conversely, if a vector \(a\) satisfies \(a^\top w = 0\) then \(a\) is a tangent vector to the manifold at \(w\), as can be seen by studying the curve \(w(t) = w\cdot cos(t) + a\cdot sin(t)\) at \(t = 0\). So the tangent space really is \(\{ a \in \mathbb{R}^n : w^\top a = 0 \}\).
Steepest direction in the tangent space¶
To find the steepest direction in the tangent space under the Euclidean norm, we must solve:
We can solve this problem using the method of Lagrange multipliers. We write down the Lagrangian:
Taking the derivative with respect to \(a\) and setting to zero, we find that \(a = (g - \mu w) / \lambda\). We can solve for \(\lambda\) and \(\mu\) by substituting in the constraints that \(a^\top a = 1\) and \(a^\top w = 0\). Finally, we obtain:
Finding the retraction map¶
Making a weight update \(w \mapsto w - \eta\cdot a\) along the steepest direction in the tangent space that we calculated in the previous section will leave the hypersphere. In fact, by Pythagoras’ theorem, we have that \(\|w - \eta\cdot a\|_2 = \sqrt{1 + \eta^2}\). So to project the update back to the manifold, we can simply divide through by this scalar: