Differential geometry of ML

by Kyuhyeon Choi & fal.ai
kyuhyeon@fal.ai

Machine learning has achieved remarkable advancements largely due to the success of gradient descent algorithms. To gain deeper mathematical insight into these algorithms, it is essential to adopt an accurate geometric perspective. In this article, we introduce the fundamental notion of a manifold as a mathematical abstraction of continuous spaces. By providing a clear geometric interpretation of gradient descent within this manifold framework, we aim to help readers develop a precise understanding of gradient descent algorithms.


To explain manifolds in the simplest way, a manifold is the most concise mathematical model for understanding “continuous spaces.” But what does “continuous” mean?

Let’s consider this through an example. Suppose there is a point in space, and I have nn degrees of freedom to move this point, so that I can move the point continuously. Then, every point in the neighborhood of that point can be uniquely obtained by appropriately adjusting the weights of those nn degrees of freedom (they can be parameterized). To express this concisely in mathematical terms:

“The space around a point in the space is somehow looks like Rn\mathbb{R}^n.”

2D surface with a point and two red directions

Figure 1: 2D sphere with a point PP. We can locally perturb PP with two independent directions v1\vec{v}_1, v2\vec{v}_2.

With this philosophy, we can define a manifold as follows:

“When the space around every point pp in a space MM looks like Euclidean space, we call this space a manifold.”

In this case, if the space around every point p can be continuously obtained by nn degrees of freedom, we define the dimension of this manifold to be nn.

Here are some concrete examples of manifolds.

Example 1 (The sphere S2S^2) The sphere SnRn+1S^n \subset \mathbb{R}^{n+1} is a manifold, because we can “smoothly” flatten it to a plane, locally.

Example 2 (regular value) From the previous example, we can see that the sphere SnS^n is a manifold. Actually, the sphere can be defined by the equation x12+x22++xn+12=1x_1^2 + x_2^2 + \cdots + x_{n+1}^2 = 1. Equivalently, if we define f:Rn+1Rf: \mathbb{R}^{n+1} \to \mathbb{R} by f(x)=x12+x22++xn+12f(x) = x_1^2 + x_2^2 + \cdots + x_{n+1}^2, then the sphere is f1(1)f^{-1}(1).

In general, for any smooth function f:RmRf: \mathbb{R}^m \to \mathbb{R}, a point cRmc\in \mathbb{R}^m is called a regular point of ff if the gradient f(c)0\nabla f(c) \neq 0. For cRc\in \mathbb{R}, it is called a regular value of ff if all points in f1(c)f^{-1}(c) are regular points. Then, the regular level set theorem says that for any regular value cc, the level set f1(c)f^{-1}(c) is a manifold [6].

Example 3 (Torus) For a>b>0a>b>0, consider the torus contained in R3\mathbb{R}^3 parametrized by

(x,y,z)=((a+bcosu)cosv,(a+bcosu)sinv,bsinu)(x,y,z) = \left( (a+b\cos u)\cos v, (a+b\cos u)\sin v, b\sin u \right)

for u,v[0,2π]u,v\in [0,2\pi]. This is a smooth manifold, because intuitively, we can cover the torus with smooth charts. On the other hand, it can also be proved by taking the function f:R3Rf: \mathbb{R}^3 \to \mathbb{R} defined by f(x,y,z)=(x2+y2a)2+z2f(x,y,z) = (\sqrt{x^2 + y^2} - a)^2 + z^2 and checking that the level set f1(b2)f^{-1}(b^2) is a smooth manifold.

Goal of this post

If we have a space called a manifold, we can consider various mathematical objects on that space. The mathematical objects of a manifold are concepts that extend from objects in Euclidean space, which is the local model (by ‘local’ we mean information about how the space around every point in the space looks). The basic geometric objects in the euclidean space are …

  • The euclidean space Rn\mathbb{R}^n.
  • The vector field XX on Rn\mathbb{R}^n.
  • The covector field α\alpha on Rn\mathbb{R}^n.
  • Euclidean metric |\cdot| on Rn\mathbb{R}^n.

Our goal is to understand these objects, and generalize these things by replacing the space Rn\mathbb{R}^n with an arbitrary smooth manifold MM.

Tangent vectors, Tangent space

“For Euclidean space Rn\mathbb{R}^n, the commonly understood concepts of tangent vectors and tangent spaces are as follows:”

  • A tangent vector at a single point is a vector attached at that point.
  • A tangent space at each point is a duplicate of Rn\mathbb{R}^n, attached to each point. (This is the space where tangent vector is living.)
  • A vector field is a collection of tangent vectors, attached to each point, or equivalently, a map that assigns a tangent vector to each point.

We must convert these notions to the general manifold MM. To do so, we need to understand the notion of tangent vector in a new way.

Philosophy 1

A tangent vector at pp = An infinitesimal movement at pp = Differential operator at pp

Correspondence 1 (tangent vector ↔ infinitesimal movement) If a vector is attached to a point, we can move the point for a very short moment in the direction of that vector. Conversely, if a point moves slightly for a very short moment, the derivative in the direction of that movement becomes a vector. This is the correspondence between tangent vectors and infinitesimal movements.

Correspondence 2 (infinitesimal movement ↔ differential operator) Local movement can be understood as a differential operator. A differential operator means that when there is a function ff, we differentiate ff appropriately. In other words, if we want to understand local movement as a differential operator, we need to make the local movement act on any function ff. If we move locally from point pp to pp', we can observe the amount by which the function value of ff changes due to this movement, and this becomes a first-order differential operator acting on ff. Conversely, the fact that any differential operator actually comes from local movement is beyond the scope of this discussion, so we will omit the proof.

In this perspective, we can generalize the notion of tangent vector to an arbitrary smooth manifold.

  • A tangent vector at a single point is a differential operator at that point.
  • A tangent space at each point is a vector space of differential operators at that point.
  • A vector field is a collection of differential operators, attached to each point, or equivalently, a map that assigns a differential operator to each point.

Example 4 Consider the Euclidean space Rn\mathbb{R}^n. Suppose we have a vector field x(a1(x),,an(x)),xRnx\mapsto (a^1(x), \cdots, a^n(x)), x\in \mathbb{R}^n. We now understand this as a differential operator aiia^i\partial_i, which acts on a smooth function f:RnRf: \mathbb{R}^n \to \mathbb{R} as aiif=i=1naifxi.a^i\partial_i f = \sum_{i=1}^n a^i \frac{\partial f}{\partial x^i}.

Now, we introduce a notion that might be unfamiliar to some readers, which is the notion of a cotangent vector. A cotangent vector is basically a linear measurement of tangent vectors. Mathematically, it can be written as:

Definition (Cotangent vector) Let MM be a manifold. Let pMp\in M be a point, and let TpMT_pM be the tangent space at pp. A cotangent vector at pp is a linear map TpMRT_pM \to \mathbb{R}. We denote the space of cotangent vectors at pp by TpMT_p^*M. (By definition, it is a dual space of TpMT_pM.)

A cotangent vector is a linear measurement of tangent vectors.
Figure 2: A cotangent vector is a "linear measurement" of tangent vectors.

It might seem very unnatural to define a cotangent space at first glance, but it turns out that it is a very natural notion, since it is equipped with a canonical object called the differential of a smooth function.

Example 5 (Differential of a smooth function) Let MM be a manifold. Let pMp\in M be a point, and let f:MRf: M \to \mathbb{R} be a smooth function. The differential of ff at pp is a linear map dfp:TpMRdf_p : T_pM \to \mathbb{R}, defined by dfp(X)=X(f)(p).df_p(X) = X(f)(p). Now, we can understand dfpdf_p as follows. dfpdf_p is an object that contains the information of the first derivative of ff at pp, and if we put a direction into dfpdf_p, it returns the derivative of ff in that direction. That is, any function on a manifold actually has the ability to linearly measure tangent vectors, and therefore can function as a cotangent vector.

Similar to the case of vector space, we can define the notion of cotangent space, vector and field to an arbitrary smooth manifold.

  • A cotangent space at each point is the dual vector space of tangent space at that point.
  • A cotangent vector at a single point is an element of cotangent space at that point.
  • A cotangent vector field is a collection of cotangent vectors, attached to each point, or equivalently, a map that assigns a cotangent vector to each point.

The remaining object is a metric; we need to understand how can we define a metric on a manifold. To do so, we first need to understand what is a canonical bundle over a smooth manifold.

What is a tensor bundle?

In the previous section, we defined the notion of tangent space and cotangent space. Using these objects, we can define a tangent bundle and a cotangent bundle over a smooth manifold.

Definition (Tangent bundle, cotangent bundle) Let MM be a smooth manifold. The tangent bundle of MM is the set of all tangent spaces at each point, i.e. TM=pMTpM.TM = \bigcup_{p\in M} T_pM. The cotangent bundle of MM is the set of all cotangent spaces at each point, i.e. TM=pMTpM.T^*M = \bigcup_{p\in M} T_p^*M.

The easiest way to understand bundles is to think of a bundle as an object that is actually formed by collecting and gluing together Euclidean spaces. For example, in the case of a 2-dimensional surface, the manifold itself has 2 degrees of freedom, and since there is a 2-dimensional tangent space at each point, when we attach additional 2-dimensional spaces with two more degrees of freedom, we get a 4-dimensional space, and that becomes the tangent bundle of that surface.

visualization of tangent bundle of a 2-sphere
Figure 3: Visualization of tangent bundle of a 2-sphere.

This object called the tangent bundle has one more nice property in addition to the characteristic that it becomes a manifold itself. Before explaining this, let us first define what a fiber is.

Definition (fiber) For two manifolds MM, NN, consider a continuous function f:MNf:M\to N. For a point pp in NN, we define f1(p)f^{-1}(p) as the fiber of pp.

Now let us examine one more nice property of the tangent(cotangent) bundle.

Property Let MM be a manifold, and let TMTM (TMT^*M) be the tangent bundle (cotangent bundle, respectively). Then TM(TM)M,    (p,v)p   TM(T^*M) \to M,\;\; (p,v) \mapsto p\; has an nn-dimensional vector space structure on each fiber.

We call the manifold EE with the map EME \to M satisfying the property that each fiber has a vector space structure a vector bundle over MM. From this perspective, TMTM and TMT^*M are vector bundles over MM. The following are simple properties of vector bundles.

Proposition Let E,FE, F be a vector bundle over MM. Then, EE is a smooth manifold.

  1. We can define a dual bundle EE^{\vee} of EE.
  2. We can define a tensor product bundle EFE\otimes F of EE and FF.

For those who are not familiar with dual spaces and tensor products, in the next section we will briefly review these concepts.

Dual space, Tensor product

For a vector space VV, consider the collection VV^* of all linear functions from that vector space to the real numbers R\mathbb{R}. If we fix f,gVf,g\in V^*, then for a real number cc, f+cgf+cg is also a linear function from VV to the real numbers, and therefore VV^* has a natural vector space structure. We define this space VV^* as the dual space of VV.

Example 6 Consider the vector space V=RnV = \mathbb{R}^n. What does the dual space of this space look like? First, the easiest basis to think of in this space is ei=(0,...,1(ith),...,0)e_i = (0,...,1_{(ith)},...,0). Now let us define fi:VRf^i:V\to \mathbb{R} as follows:

fi(jajej)=ai.f^i(\sum_j a_je_j) = a_i.

That is, fif^i becomes the projection onto the i-th coordinate. We can easily see that linear combinations of the fif^i‘s constitute VV^*, and therefore

V=span{fi:1in}.V^* = \text{span} \{f^i : 1\le i\le n\}.

When there are two vector spaces VV and WW, let the basis of VV be v1v_1, … vnv_n, and the basis of WW be w1w_1, … wmw_m. In this case, we want to define a vector space that combines the spaces VV and WW to have mnmn pairs (vi,wj)(v_i,w_j) as a basis. In more mathematical notation, we express the basis as viwjv_i \otimes w_j, and write the space created by combining these bases as VWV\otimes W.

Using dual spaces and tensor products, we can understand matrices mathematically. If V=RnV = \mathbb{R}^n and W=RmW = \mathbb{R}^m, then we understand linear functions from VV to WW as m×nm \times n matrices. Then what does the element in the ii-th row and jj-th column of this matrix mean? Consider a matrix AA that has 1 written only in the ii-th row and jj-th column position, and 0 in all other positions. If we examine the mechanism by which this matrix acts when applied to an element vv of VV, it first performs a projection onto the jj-th coordinate of vv, and then outputs that value to the ii-th coordinate of WW. Expressing this from the tensor product perspective, we can think of A=eifjA = e_i \otimes f^j. Since these matrices form a basis for the space of all linear functions from VV to WW, we have {linear map VW}=WV.\{\text{linear map } V \to W \} = W \otimes V^*.

canonical

Figure 4: Visualization of a canonical basis of matrix space. This visually shows the equation A=eifjA = e_i \otimes f^j.

Return to Bundle

Returning to the discussion, we can now understand the dual of vector bundles and tensor products between vector bundles. We can perform operations on the vector spaces corresponding to the fibers at each point of the manifold, then collect and glue together all the vector spaces that have undergone these operations to create new vector bundles. What we obtain in this way are the EE^{\vee} and EFE\otimes F defined above.

Recall that TMTM and TMT^*M are objects that are defined as soon as we define MM. Then from the perspective of the above proposition, TMkTMTMlTMMTM\underbrace{ \otimes \cdots \otimes }_{k} TM \otimes T^*M\underbrace{ \otimes \cdots \otimes }_{l} T^*M \to M is an object that is immediately defined over MM. We call such a canonical vector bundle a tensor bundle.

Now, we can understand tangent fields and cotangent fields as objects defined from sections of bundles.

Definition (section of a vector bundle) Let EME\to M be a vector bundle. A section of EE is a map s:MEs: M \to E such that ps(p)Epp\mapsto s(p)\in E_p for all pMp\in M.

In other words, a section of a vector bundle is a map that assigns a vector in the fiber to each point of the manifold. Therefore, tangent field and cotangent field can be re-defined as follows.

Definition (tangent field, cotangent field) Let MM be a manifold. A tangent field on MM is a section of TMTM, and a cotangent field on MM is a section of TMT^*M.

Canonical operations on differential geometry

Given a map f:MNf: M \to N, how the objects like tangent vector and cotangent vector move canonically by the map ff?

Philosophy A map between two manifolds “push” tangent vectors and “pull” cotangent vectors.

To explain, we check how ff push and pull such vectors, explicitly. We start with the pushforward of a tangent vector.

Definition Let f:MNf: M \to N be a smooth map. For each pMp\in M, the pushforward of a tangent vector XpTpMX_p\in T_pM is a tangent vector fXpTf(p)Nf_*X_p\in T_{f(p)}N, which is a differential operator acts on smooth function g:NRg:N \to \mathbb{R} as fXp(g)=Xp(gf).f_*X_p(g) = X_p(g\circ f).

Example 7 Suppose we have a smooth map f:RmRnf: \mathbb{R}^m \to \mathbb{R}^n. For a tangent vector Xp=aiiTpRmX_p = a^i\partial_i \in T_p\mathbb{R}^m, we calculate the pushforward fXpf_*X_p. Take a function g:RnRg: \mathbb{R}^n \to \mathbb{R}, and we have fXp(g)=Xp(gf)=aii(gf)=ai(gxjffjxi).f_*X_p(g) = X_p(g\circ f) = a^i\partial_i(g\circ f) = a^i \left( \frac{\partial g}{\partial x^j}\circ f \frac{\partial f^j}{\partial x^i} \right). Therefore, fXp=aifjxijf_*X_p = a^i\frac{\partial f^j}{\partial x^i}\partial_j.

Example 8 When we have a manifold MM and a point pp on it, intuitively, a differential operator at pp measures the rate of change of a function along a local path passing through pp. Let’s write this in a mathematically rigorous way. A path through pp is mathematically expressed as a parametrized path γ:(ϵ,ϵ)M\gamma: (-\epsilon, \epsilon) \to M with γ(0)=p\gamma(0) = p. Now, the differential operator we intuitively thought about above is actually γ(ddt)\gamma_*(\frac{d}{dt}). Mathematically verifying this, for a smooth function f:MRf: M \to \mathbb{R}, we have

γ(ddt)f=ddt(fγ)\gamma_* \left( \frac{d}{dt} \right) f = \frac{d}{dt} (f\circ \gamma)

That is, γ(ddt)\gamma_* \left( \frac{d}{dt} \right) is an operator that gives the derivative of ff along the path γ\gamma.

A line segment is embedding in a manifold. A tangent vector of the line is pushforwarded to a tangent vector of a manifold.
Figure 5: A line segment is embedding in a manifold. A tangent vector of the line is pushforwarded to a tangent vector of a manifold.

Since a smooth map pushes vectors, from the perspective of tangent spaces, we can see that it defines a linear map that pushes between tangent spaces.

Definition Let f:MNf: M \to N be a smooth map. For each pMp\in M, the differential of ff at pp is a linear map dfp:TpMTf(p)Ndf_p : T_pM \to T_{f(p)}N which maps a tangent vector XpTpMX_p\in T_pM to a tangent vector dfp(Xp)Tf(p)Ndf_p(X_p)\in T_{f(p)}N defined by dfp(Xp)=f(Xp).df_p(X_p) = f_*(X_p).

Example 9 Let f:RmRnf: \mathbb{R}^m \to \mathbb{R}^n be a smooth map. From the previous example, we have dfp(aii)=aifjxij.df_p(a^i\partial_i) = a^i\frac{\partial f^j}{\partial x^i}\partial_j. Therefore, dfpdf_p is a linear map TpRmTf(p)RnT_p\mathbb{R}^m \to T_{f(p)}\mathbb{R}^n which maps aiiaifjxij. a^i\partial_i \mapsto a^i\frac{\partial f^j}{\partial x^i}\partial_j. Here, we see that fjxi\frac{\partial f^j}{\partial x^i} is a matrix that represents the map dfpdf_p. This matrix is called the Jacobian matrix of ff at pp.

Now, we see how the smooth map acts on cotangent vectors. Intuitively, a cotangent space is a dual space of a tangent space, so the natural direction of the action should be reversed. Therefore, we may infer that the smooth map will pull cotangent vectors.

Definition Let f:MNf: M \to N be a smooth map. For each pMp\in M, the pullback of a cotangent vector αf(p)Tf(p)N\alpha_{f(p)}\in T^*_{f(p)}N is a cotangent vector fαf(p)TpMf^*\alpha_{f(p)}\in T^*_pM, which is a linear map TpMRT_pM \to \mathbb{R} defined by f(αf(p))=αf(p)dfp.f^*(\alpha_{f(p)}) = \alpha_{f(p)}\circ df_p.

Example 10 In the previous example, we defined the differential of a smooth real-valued function on a manifold. At that time, we saw that this object is a cotangent vector field. In fact, this object can be understood as a pullback. First, let’s define a canonical cotangent vector field dtdt on the real line that satisfies the following: dt(ddt)=1.dt(\frac{d}{dt}) = 1. Now, let’s prove that the pullback of this dt using f is df. Fix p in M. Take a tangent vector XpX_p at p. Then, f(dt)(Xp)=dt(dfp(Xp))=dt(fXp). f^*(dt)(X_p) = dt(df_p(X_p)) = dt(f_* X_p). We claim that fXp=Xp(f)ddtf(p)f_* X_p = X_p(f) \frac{d}{dt}|_{f(p)}. To see this, we take any smooth function g:RRg: \mathbb{R} \to \mathbb{R}, by chain rule, fXp(g)=Xp(gf)=g(f(p))Xpf=(Xpfddtf(p))g.f_* X_p(g) = X_p(g\circ f) = g'(f(p)) X_p f = \left(X_p f \frac{d}{dt}|_{f(p)}\right) g. Therefore, going back to the above equation, f(dt)(Xp)=dt(fXp)=dt(Xpfddt)=Xpf=dfp(Xp).f^*(dt)(X_p) = dt(f_* X_p) = dt(X_p f \frac{d}{dt}) = X_p f = df_p(X_p).

So far, we have examined how f:MNf:M\to N acts on tangent vectors and cotangent vectors. Now, let’s describe this at the level of tangent bundles and cotangent bundles. Before that, let’s first define a morphism between vector bundles.

Definition (morphism of vector bundles) Let EME\to M and FNF\to N be vector bundles. A smooth map between vector bundles f~:EF\tilde{f}: E\to F together with a map f:MNf: M \to N is called a morphism of vector bundles if

  • f~\tilde{f} maps each fiber of EE to a fiber of FF. Precisely, let pMp\in M be a point, and let EpE_p be the fiber of EE at pp. Then, f~(Ep)Ff(p)\tilde{f}(E_p) \subset F_{f(p)}.
  • f~:EpFf(p)\tilde{f}:E_p \to F_{f(p)} is a linear map.

Example 11 Let f:MNf: M \to N be a smooth map. Then, the pushforward ff_* induces a morphism of vector bundles. This map vector bundle morphism is defined as follows: df:TMTN,    dfp((p,Xp))=(f(p),f(Xp)). df: TM \to TN, \;\; df_p((p,X_p)) = (f(p),f_*(X_p)).

Unfortunately, the pullback ff^* does not induce a morphism of vector bundles, since there is no map from NN to MM. However, we still can construct a canonical object.

Definition (Global section of a vector bundle) Let EME \to M be a vector bundle over MM. The global section of EE is a map s:MEs: M \to E such that ps(p)Epp\mapsto s(p)\in E_p for all pMp\in M. The collection of all smooth global sections of EE is denoted by Γ(E)\Gamma(E).

Example 12 Let MM be a smooth manifold. The collection of all smooth vector fields on MM is Γ(TM)\Gamma(TM).

Example 13 Let MM be a smooth manifold. Let f:MRf: M \to \mathbb{R} be a smooth function. Then, the differential dfdf is an element of Γ(TM)\Gamma(T^*M).

Definition (Pullback of a global section) Let f:MNf: M \to N be a smooth map. Take αΓ(TN)\alpha \in \Gamma(T^*N). Then, the pullback of α\alpha is a global section of TMT^*M, which is defined by fα:MTM,    fαp=fαf(p).f^*\alpha: M \to T^*M, \;\; f^*\alpha|_p = f^*\alpha_{f(p)}. Therefore, ff^* induces a map f:Γ(TN)Γ(TM)f^*: \Gamma(T^*N) \to \Gamma(T^*M).

Therefore, given a smooth map f:MNf: M \to N, we have canonical maps

  • df:TMTN,df: TM \to TN,
  • f:Γ(TN)Γ(TM)f^*: \Gamma(T^*N) \to \Gamma(T^*M).

It can be shown that these can be extended to tensor bundles: For positive integer kk, we have canonical maps

  • df:TkMTkN,df: T^k M \to T^k N,
  • f:Γ(TkN)Γ(TkM)f^*: \Gamma(T^{*k}N) \to \Gamma(T^{*k}M).

What is a metric?

In the previous section, we studied various geometric objects on a smooth manifold, including bundles and global sections. In this section, we will study a metric on a smooth manifold, and understand it as a global section of a tensor bundle. To do so, we first see the easiest example of a metric, which is the Euclidean distance arising from an inner product on a vector space.

Definition (Inner product) Let VV be a vector space. An inner product on VV is a map V×VRV\times V \to \mathbb{R} that satisfies the following properties:

  • Bilinearity: g(av+bw,z)=ag(v,z)+bg(w,z)g(av + bw, z) = a g(v,z) + b g(w,z) for all a,bRa,b\in \mathbb{R} and v,w,zVv,w,z\in V.
  • Symmetry: g(v,w)=g(w,v)g(v,w) = g(w,v) for all v,wVv,w\in V.
  • Positive-definiteness: g(v,v)0g(v,v) \geq 0 for all vVv\in V, and g(v,v)=0g(v,v) = 0 if and only if v=0v = 0.

Then, we can equip a vector space with a norm by v=g(v,v),|v| = \sqrt{g(v,v)}, and this norm induces a metric on VV.

What is important here are two things:

  • An inner product is symmetric, bilinear, and positive-definite.
  • An inner product takes two vectors and returns a real number.

When a vector space VV is given, since an inner product is an operator that takes two vectors and outputs a single real value, it can be understood as an element of VVV^* \otimes V^*. Then, when we think from the perspective of manifolds and tangent bundles on them, we can intuitively understand that a metric on a manifold is a 2-tensor field, i.e. an object that takes two tangent vectors and returns a real number at each point, or 2-cotangent vectors at each point of the manifold. More rigorously, this can be written as follows.

Definition (Metric) [7] Let MM be a smooth manifold. A metric gg on MM is a global section of TMTMT^*M\otimes T^*M which is bilinear, symmetric, and positive-definite.

A metric g on a sphere, taking two tangent vectors at a point p and returning a real value

Figure 6: A metric gg on a sphere. At a point pp, gg takes two tangent vectors and gives one real value.

Example 14 The standard metric on Rn\mathbb{R}^n is given by g=i=1ndxidxi,g = \sum_{i=1}^n dx^i \otimes dx^i, where dxidx^i is the differential of the ii-th coordinate function xi:RnRx^i: \mathbb{R}^n \to \mathbb{R}. Fix pRnp\in \mathbb{R}^n. Suppose we have a tangent vector Xp=aiiTpRnX_p = a^i\partial_i \in T_p\mathbb{R}^n. Then, we have gp(Xp,Xp)=gp(aii,ajj)=aiajgp(i,j)=i=1n(ai)2.g_p(X_p, X_p) = g_p(a^i\partial_i, a^j\partial_j) = a^i a^j g_p(\partial_i, \partial_j) = \sum_{i=1}^n (a^i)^2. Since a tangent vector represents an infinitesimal direction of change, this value can be understood as the square of the infinitesimal change in the tangent direction.

From this definition, we can calculate the length of a curve explicitly.

Example 15 (Length of a curve) Let MM be a smooth manifold. Let γ:[0,1]M\gamma: [0,1] \to M be a curve. The length of γ\gamma is given by L(γ)=01g(γ(t),γ(t))dt.L(\gamma) = \int_0^1 \sqrt{g(\gamma'(t),\gamma'(t))} dt. Here, γ(t)\gamma'(t) is a tangent vector at the point γ(t)\gamma(t), tangent to the curve γ\gamma, defined by γ(t)=γ(ddtt). \gamma'(t) = \gamma_*(\frac{d}{dt}|_t).

One common misconception is thinking that when there is a manifold and a function ff on it, there naturally exists an object called the gradient. This is incorrect; a gradient only exists when there is a metric on the manifold. In fact, as we saw in the previous section, when there is a function on a manifold, the naturally existing object is the differential of the function, which belongs to Γ(TM)\Gamma(T^*M). That is, since a gradient is a vector field, we can only consider a gradient when there exists a means to transfer objects in Γ(TM)\Gamma(T^*M) to Γ(TM)\Gamma(TM).

When there is a vector space VV with an inner product ,\langle\cdot,\cdot\rangle, we can define an isomorphism between the vector space VV and its dual space VV^* as follows: VV,    vv,.V \to V^*, \;\; v \mapsto \langle v, \cdot \rangle. Through this, we can view elements of the dual space VV^* of vector space VV as elements of vector space VV.

Using this, when there is a metric gg, we can define a bundle isomorphism :TMTM\sharp: T^*M \to TM as follows: (p,αp)(p,αp),     where     gp(αp,)=αp. (p,\alpha|_p) \mapsto (p, \alpha|_p^\sharp),\;\;\text{ where }\;\; g_p(\alpha|_p^\sharp, \cdot) = \alpha|_p.

Definition (Gradient) Let MM be a smooth manifold. Let f:MRf: M \to \mathbb{R} be a smooth function. The gradient of ff is a vector field f\nabla f on MM defined by f=df. \nabla f = \sharp df.

Example 16 Consider the Euclidean space Rn\mathbb{R}^n, equipped with the standard metric g=i=1ndxidxig = \sum_{i=1}^n dx^i \otimes dx^i. Consider a smooth function f:RnRf: \mathbb{R}^n \to \mathbb{R}. Then, the differential of ff is given by df=i=1nfxidxi. df = \sum_{i=1}^n \frac{\partial f}{\partial x^i} dx^i. Therefore, the gradient of ff is given by f=i=1nfxii. \nabla f = \sum_{i=1}^n \frac{\partial f}{\partial x^i} \partial_i.

We will understand the gradient from another perspective and end this section.

Definition Let MM be a smooth manifold. Let gg be a metric on MM. The metric gg induces a tensor gΓ(TMTM)g^\sharp \in \Gamma(TM\otimes TM) by g(α,β)=g(α,β).g^\sharp(\alpha,\beta) = g(\alpha^\sharp, \beta^\sharp).

Example 17 Consider the Euclidean space Rn\mathbb{R}^n, equipped with the standard metric g=i=1ndxidxig = \sum_{i=1}^n dx^i \otimes dx^i. Then, the metric gg induces a tensor gΓ(TRnTRn)g^\sharp \in \Gamma(T \mathbb{R}^n \otimes T \mathbb{R}^n), which is given by g=i=1nii.g^\sharp = \sum_{i=1}^n \partial_i \otimes \partial_i.

Definition (gradient using gg^\sharp) Let MM be a smooth manifold. Let gg be a metric on MM. The gradient of a smooth function f:MRf: M \to \mathbb{R} using gg^\sharp is given by gf=g(df). \nabla_g f = g^\sharp (df). Here, gg^\sharp is a 2-tensor, so applying it to a 1-covector will give a 1-tensor, which is a tangent vector.

Example 18 Consider the Euclidean space Rn\mathbb{R}^n, equipped with the standard metric g=i=1ndxidxig = \sum_{i=1}^n dx^i \otimes dx^i. Then, the gradient of a smooth function f:RnRf: \mathbb{R}^n \to \mathbb{R} using gg^\sharp is given by g(df)=(i=1nii)(df)=i=1nii,df=i=1nfxii. g^\sharp (df) = \left(\sum_{i=1}^n \partial_i \otimes \partial_i\right) (df) = \sum_{i=1}^n \partial_i \langle \partial_i, df \rangle = \sum_{i=1}^n \frac{\partial f}{\partial x^i} \partial_i. This is identical to the gradient of ff in the previous example.

Differential geometric setting for DNN

This section is to define mathematical objects that appear in deep learning, and understand them geometrically.

  • Let X0X_0 be a input space.
  • Let X1X_1 be a output space.
  • Let F\mathcal{F} be a space of functions from X0X_0 to X1X_1.
  • Let P\mathcal{P} be a manifold of parameters. We use θ\theta to denote a point in P\mathcal{P}.
  • Let FF be a model, which is a map F:PFF: \mathcal{P} \to \mathcal{F}.
  • Let L:FRL: \mathcal{F} \to \mathbb{R} be a loss function.

Since DNNs use a gradient descent method to optimize the parameters, we understand that the parameter space P\mathcal{P} must be equipped with a metric.

Remark The spaces X0X_0, X1X_1, and P\mathcal{P} are usually chosen as Euclidean spaces. Therefore, F\mathcal{F} is also a vector space (vector space of functions) Since a tangent space of a vector space can be identified with itself, all of their tangent spaces are canonically identified with themselves.

Remark Different metrics assigned to P\mathcal{P} induce different optimization algorithms. For example, when P\mathcal{P} is equipped with a Euclidean metric, the optimization algorithm is standard gradient descent. However, by assigning a spectral norm (of matrix space), we obtain some different optimization algorithms like muon or shampoo [3],[4],[5].

Diagram showing a model F pushing the 2-tensor g^sharp to the NTK over function space F

Figure 7: A model FF pushes the 2-tensor gg^\sharp to the NTK F(g)F_*(g^\sharp) living over the function space F\mathcal{F}.

Neural Tangent Kernel

The strength of this framework is that we can understand NTK directly. NTK is fundamentally an approach that understands parameter changes not as parameter changes but as function changes [1],[2].

Definition (Neural Tangent Kernel) Let F:PFF: \mathcal{P} \to \mathcal{F} be a model. Let gg be a metric on P\mathcal{P}. The NTK Θ\Theta at θP\theta\in \mathcal{P} is Θ(θ)=F(gθ)\Theta(\theta) = F_*(g^\sharp|_{\theta}).

Recall the previous definition. We can understand the gradient of using 2-tensor gg^\sharp. In the DNN setting, there is a gradient flow on P\mathcal{P}, and this gradient is with respect to the composition of LFL \circ F. That is, we can understand this gradient flow by taking the differential of LFL\circ F and compute with gg^\sharp. If we want to view this flow not on P\mathcal{P} but on F\mathcal{F}, we pushforward the whole situation to F\mathcal{F}, obtaining a vector flow by computing the differential of LL with the pushforward of gg^\sharp. In other words, NTK is what allows us to view the gradient flow in the sense of function space.

We also verify this definition is equivalent to the one in the literature by explicit calculation.

Example 19 By equipping P\mathcal{P} with a Euclidean metric, we have F(g)=F(p=1Pθpθp)=p=1PθpFθpF.F_*(g^\sharp) = F_*( \sum_{p=1}^{P} \partial_{\theta_p} \otimes \partial_{\theta_p})= \sum_{p=1}^{P} \partial_{\theta_p}F \otimes \partial_{\theta_p}F. From [2], we have Θ(θ)=p=1PθpFθpF.\Theta(\theta) = \sum_{p=1}^{P} \partial_{\theta_p}F \otimes \partial_{\theta_p}F. Therefore, they are identical.

References

[1] Arora, Sanjeev et al. On Exact Computation with an Infinitely Wide Neural Net. NeurIPS 2019.

[2] Arthur Jacot, Franck Gabriel, Clément Hongler, Neural Tangent Kernel: Convergence and Generalization in Neural Networks, arXiv:1806.07572

[3] Gupta, Vineet, et al. “Shampoo: Preconditioned stochastic tensor optimization” (2018)

[4] Jeremy Bernstein and Laker Newhouse. “Old optimizer, new norm: An anthology.” arXiv preprint arXiv:2409.20325 (2024).

[5] Jordan Keller, Muon: A Matrix Norm Optimizer for Deep Learning, https://kellerjordan.github.io/posts/muon/

[6] Lee, John M. Introduction to Smooth Manifolds, (2002)

[7] Lee, John M. Introduction to Riemannian Manifolds, (2018)