Differential geometry of ML

Jul 14, 2025

by Kyuhyeon Choi & fal.ai

Machine learning has achieved remarkable advancements largely due to the success of gradient descent algorithms. To gain deeper mathematical insight into these algorithms, it is essential to adopt an accurate geometric perspective. In this article, we introduce the fundamental notion of a manifold as a mathematical abstraction of continuous spaces. By providing a clear geometric interpretation of gradient descent within this manifold framework, we aim to help readers develop a precise understanding of gradient descent algorithms.

To explain manifolds in the simplest way, a manifold is the most concise mathematical model for understanding “continuous spaces.” But what does “continuous” mean?

Let’s consider this through an example. Suppose there is a point in space, and I have $n$ degrees of freedom to move this point, so that I can move the point continuously. Then, every point in the neighborhood of that point can be uniquely obtained by appropriately adjusting the weights of those $n$ degrees of freedom (they can be parameterized). To express this concisely in mathematical terms:

“The space around a point in the space is somehow looks like $\mathbb{R}^n$ .”

2D surface with a point and two red directions — **Figure 1:** 2D sphere with a point $P$ . We can locally perturb $P$ with two independent directions $\vec{v}_1$ , $\vec{v}_2$ .

With this philosophy, we can define a manifold as follows:

“When the space around every point $p$ in a space $M$ looks like Euclidean space, we call this space a manifold.”

In this case, if the space around every point p can be continuously obtained by $n$ degrees of freedom, we define the dimension of this manifold to be $n$ .

Here are some concrete examples of manifolds.

Example 1 (The sphere $S^2$ ) The sphere $S^n \subset \mathbb{R}^{n+1}$ is a manifold, because we can “smoothly” flatten it to a plane, locally.

Example 2 (regular value) From the previous example, we can see that the sphere $S^n$ is a manifold. Actually, the sphere can be defined by the equation $x_1^2 + x_2^2 + \cdots + x_{n+1}^2 = 1$ . Equivalently, if we define $f: \mathbb{R}^{n+1} \to \mathbb{R}$ by $f(x) = x_1^2 + x_2^2 + \cdots + x_{n+1}^2$ , then the sphere is $f^{-1}(1)$ .

In general, for any smooth function $f: \mathbb{R}^m \to \mathbb{R}$ , a point $c\in \mathbb{R}^m$ is called a regular point of $f$ if the gradient $\nabla f(c) \neq 0$ . For $c\in \mathbb{R}$ , it is called a regular value of $f$ if all points in $f^{-1}(c)$ are regular points. Then, the regular level set theorem says that for any regular value $c$ , the level set $f^{-1}(c)$ is a manifold [6].

Example 3 (Torus) For $a>b>0$ , consider the torus contained in $\mathbb{R}^3$ parametrized by

(x,y,z) = \left( (a+b\cos u)\cos v, (a+b\cos u)\sin v, b\sin u \right)

for $u,v\in [0,2\pi]$ . This is a smooth manifold, because intuitively, we can cover the torus with smooth charts. On the other hand, it can also be proved by taking the function $f: \mathbb{R}^3 \to \mathbb{R}$ defined by $f(x,y,z) = (\sqrt{x^2 + y^2} - a)^2 + z^2$ and checking that the level set $f^{-1}(b^2)$ is a smooth manifold.

Goal of this post

If we have a space called a manifold, we can consider various mathematical objects on that space. The mathematical objects of a manifold are concepts that extend from objects in Euclidean space, which is the local model (by ‘local’ we mean information about how the space around every point in the space looks). The basic geometric objects in the euclidean space are …

The euclidean space $\mathbb{R}^n$ .
The vector field $X$ on $\mathbb{R}^n$ .
The covector field $\alpha$ on $\mathbb{R}^n$ .
Euclidean metric $|\cdot|$ on $\mathbb{R}^n$ .

Our goal is to understand these objects, and generalize these things by replacing the space $\mathbb{R}^n$ with an arbitrary smooth manifold $M$ .

Tangent vectors, Tangent space

“For Euclidean space $\mathbb{R}^n$ , the commonly understood concepts of tangent vectors and tangent spaces are as follows:”

A tangent vector at a single point is a vector attached at that point.
A tangent space at each point is a duplicate of $\mathbb{R}^n$ , attached to each point. (This is the space where tangent vector is living.)
A vector field is a collection of tangent vectors, attached to each point, or equivalently, a map that assigns a tangent vector to each point.

We must convert these notions to the general manifold $M$ . To do so, we need to understand the notion of tangent vector in a new way.

Philosophy 1

A tangent vector at $p$ = An infinitesimal movement at $p$ = Differential operator at $p$

Correspondence 1 (tangent vector ↔ infinitesimal movement) If a vector is attached to a point, we can move the point for a very short moment in the direction of that vector. Conversely, if a point moves slightly for a very short moment, the derivative in the direction of that movement becomes a vector. This is the correspondence between tangent vectors and infinitesimal movements.

Correspondence 2 (infinitesimal movement ↔ differential operator) Local movement can be understood as a differential operator. A differential operator means that when there is a function $f$ , we differentiate $f$ appropriately. In other words, if we want to understand local movement as a differential operator, we need to make the local movement act on any function $f$ . If we move locally from point $p$ to $p'$ , we can observe the amount by which the function value of $f$ changes due to this movement, and this becomes a first-order differential operator acting on $f$ . Conversely, the fact that any differential operator actually comes from local movement is beyond the scope of this discussion, so we will omit the proof.

In this perspective, we can generalize the notion of tangent vector to an arbitrary smooth manifold.

A tangent vector at a single point is a differential operator at that point.
A tangent space at each point is a vector space of differential operators at that point.
A vector field is a collection of differential operators, attached to each point, or equivalently, a map that assigns a differential operator to each point.

Example 4 Consider the Euclidean space $\mathbb{R}^n$ . Suppose we have a vector field $x\mapsto (a^1(x), \cdots, a^n(x)), x\in \mathbb{R}^n$ . We now understand this as a differential operator $a^i\partial_i$ , which acts on a smooth function $f: \mathbb{R}^n \to \mathbb{R}$ as $a^i\partial_i f = \sum_{i=1}^n a^i \frac{\partial f}{\partial x^i}.$

Now, we introduce a notion that might be unfamiliar to some readers, which is the notion of a cotangent vector. A cotangent vector is basically a linear measurement of tangent vectors. Mathematically, it can be written as:

Definition (Cotangent vector) Let $M$ be a manifold. Let $p\in M$ be a point, and let $T_pM$ be the tangent space at $p$ . A cotangent vector at $p$ is a linear map $T_pM \to \mathbb{R}$ . We denote the space of cotangent vectors at $p$ by $T_p^*M$ . (By definition, it is a dual space of $T_pM$ .)

A cotangent vector is a linear measurement of tangent vectors. — **Figure 2:** A cotangent vector is a "linear measurement" of tangent vectors.

It might seem very unnatural to define a cotangent space at first glance, but it turns out that it is a very natural notion, since it is equipped with a canonical object called the differential of a smooth function.

Example 5 (Differential of a smooth function) Let $M$ be a manifold. Let $p\in M$ be a point, and let $f: M \to \mathbb{R}$ be a smooth function. The differential of $f$ at $p$ is a linear map $df_p : T_pM \to \mathbb{R}$ , defined by $df_p(X) = X(f)(p).$ Now, we can understand $df_p$ as follows. $df_p$ is an object that contains the information of the first derivative of $f$ at $p$ , and if we put a direction into $df_p$ , it returns the derivative of $f$ in that direction. That is, any function on a manifold actually has the ability to linearly measure tangent vectors, and therefore can function as a cotangent vector.

Similar to the case of vector space, we can define the notion of cotangent space, vector and field to an arbitrary smooth manifold.

A cotangent space at each point is the dual vector space of tangent space at that point.
A cotangent vector at a single point is an element of cotangent space at that point.
A cotangent vector field is a collection of cotangent vectors, attached to each point, or equivalently, a map that assigns a cotangent vector to each point.

The remaining object is a metric; we need to understand how can we define a metric on a manifold. To do so, we first need to understand what is a canonical bundle over a smooth manifold.

What is a tensor bundle?

In the previous section, we defined the notion of tangent space and cotangent space. Using these objects, we can define a tangent bundle and a cotangent bundle over a smooth manifold.

Definition (Tangent bundle, cotangent bundle) Let $M$ be a smooth manifold. The tangent bundle of $M$ is the set of all tangent spaces at each point, i.e. $TM = \bigcup_{p\in M} T_pM.$ The cotangent bundle of $M$ is the set of all cotangent spaces at each point, i.e. $T^*M = \bigcup_{p\in M} T_p^*M.$

The easiest way to understand bundles is to think of a bundle as an object that is actually formed by collecting and gluing together Euclidean spaces. For example, in the case of a 2-dimensional surface, the manifold itself has 2 degrees of freedom, and since there is a 2-dimensional tangent space at each point, when we attach additional 2-dimensional spaces with two more degrees of freedom, we get a 4-dimensional space, and that becomes the tangent bundle of that surface.

visualization of tangent bundle of a 2-sphere — **Figure 3:** Visualization of tangent bundle of a 2-sphere.

This object called the tangent bundle has one more nice property in addition to the characteristic that it becomes a manifold itself. Before explaining this, let us first define what a fiber is.

Definition (fiber) For two manifolds $M$ , $N$ , consider a continuous function $f:M\to N$ . For a point $p$ in $N$ , we define $f^{-1}(p)$ as the fiber of $p$ .

Now let us examine one more nice property of the tangent(cotangent) bundle.

Property Let $M$ be a manifold, and let $TM$ ( $T^*M$ ) be the tangent bundle (cotangent bundle, respectively). Then $TM(T^*M) \to M,\;\; (p,v) \mapsto p\;$ has an $n$ -dimensional vector space structure on each fiber.

We call the manifold $E$ with the map $E \to M$ satisfying the property that each fiber has a vector space structure a vector bundle over $M$ . From this perspective, $TM$ and $T^*M$ are vector bundles over $M$ . The following are simple properties of vector bundles.

Proposition Let $E, F$ be a vector bundle over $M$ . Then, $E$ is a smooth manifold.

We can define a dual bundle $E^{\vee}$ of $E$ .
We can define a tensor product bundle $E\otimes F$ of $E$ and $F$ .

For those who are not familiar with dual spaces and tensor products, in the next section we will briefly review these concepts.

Dual space, Tensor product

For a vector space $V$ , consider the collection $V^*$ of all linear functions from that vector space to the real numbers $\mathbb{R}$ . If we fix $f,g\in V^*$ , then for a real number $c$ , $f+cg$ is also a linear function from $V$ to the real numbers, and therefore $V^*$ has a natural vector space structure. We define this space $V^*$ as the dual space of $V$ .

Example 6 Consider the vector space $V = \mathbb{R}^n$ . What does the dual space of this space look like? First, the easiest basis to think of in this space is $e_i = (0,...,1_{(ith)},...,0)$ . Now let us define $f^i:V\to \mathbb{R}$ as follows:

$f^i(\sum_j a_je_j) = a_i.$

That is, $f^i$ becomes the projection onto the i-th coordinate. We can easily see that linear combinations of the $f^i$ ‘s constitute $V^*$ , and therefore

$V^* = \text{span} \{f^i : 1\le i\le n\}.$

When there are two vector spaces $V$ and $W$ , let the basis of $V$ be $v_1$ , … $v_n$ , and the basis of $W$ be $w_1$ , … $w_m$ . In this case, we want to define a vector space that combines the spaces $V$ and $W$ to have $mn$ pairs $(v_i,w_j)$ as a basis. In more mathematical notation, we express the basis as $v_i \otimes w_j$ , and write the space created by combining these bases as $V\otimes W$ .

Using dual spaces and tensor products, we can understand matrices mathematically. If $V = \mathbb{R}^n$ and $W = \mathbb{R}^m$ , then we understand linear functions from $V$ to $W$ as $m \times n$ matrices. Then what does the element in the $i$ -th row and $j$ -th column of this matrix mean? Consider a matrix $A$ that has 1 written only in the $i$ -th row and $j$ -th column position, and 0 in all other positions. If we examine the mechanism by which this matrix acts when applied to an element $v$ of $V$ , it first performs a projection onto the $j$ -th coordinate of $v$ , and then outputs that value to the $i$ -th coordinate of $W$ . Expressing this from the tensor product perspective, we can think of $A = e_i \otimes f^j$ . Since these matrices form a basis for the space of all linear functions from $V$ to $W$ , we have $\{\text{linear map } V \to W \} = W \otimes V^*.$

**Figure 4:** Visualization of a canonical basis of matrix space. This visually shows the equation $A = e_i \otimes f^j$ .

Return to Bundle

Returning to the discussion, we can now understand the dual of vector bundles and tensor products between vector bundles. We can perform operations on the vector spaces corresponding to the fibers at each point of the manifold, then collect and glue together all the vector spaces that have undergone these operations to create new vector bundles. What we obtain in this way are the $E^{\vee}$ and $E\otimes F$ defined above.

Recall that $TM$ and $T^*M$ are objects that are defined as soon as we define $M$ . Then from the perspective of the above proposition, $TM\underbrace{ \otimes \cdots \otimes }_{k} TM \otimes T^*M\underbrace{ \otimes \cdots \otimes }_{l} T^*M \to M$ is an object that is immediately defined over $M$ . We call such a canonical vector bundle a tensor bundle.

Now, we can understand tangent fields and cotangent fields as objects defined from sections of bundles.

Definition (section of a vector bundle) Let $E\to M$ be a vector bundle. A section of $E$ is a map $s: M \to E$ such that $p\mapsto s(p)\in E_p$ for all $p\in M$ .

In other words, a section of a vector bundle is a map that assigns a vector in the fiber to each point of the manifold. Therefore, tangent field and cotangent field can be re-defined as follows.

Definition (tangent field, cotangent field) Let $M$ be a manifold. A tangent field on $M$ is a section of $TM$ , and a cotangent field on $M$ is a section of $T^*M$ .

Canonical operations on differential geometry

Given a map $f: M \to N$ , how the objects like tangent vector and cotangent vector move canonically by the map $f$ ?

Philosophy A map between two manifolds “push” tangent vectors and “pull” cotangent vectors.

To explain, we check how $f$ push and pull such vectors, explicitly. We start with the pushforward of a tangent vector.

Definition Let $f: M \to N$ be a smooth map. For each $p\in M$ , the pushforward of a tangent vector $X_p\in T_pM$ is a tangent vector $f_*X_p\in T_{f(p)}N$ , which is a differential operator acts on smooth function $g:N \to \mathbb{R}$ as $f_*X_p(g) = X_p(g\circ f).$

Example 7 Suppose we have a smooth map $f: \mathbb{R}^m \to \mathbb{R}^n$ . For a tangent vector $X_p = a^i\partial_i \in T_p\mathbb{R}^m$ , we calculate the pushforward $f_*X_p$ . Take a function $g: \mathbb{R}^n \to \mathbb{R}$ , and we have $f_*X_p(g) = X_p(g\circ f) = a^i\partial_i(g\circ f) = a^i \left( \frac{\partial g}{\partial x^j}\circ f \frac{\partial f^j}{\partial x^i} \right).$ Therefore, $f_*X_p = a^i\frac{\partial f^j}{\partial x^i}\partial_j$ .

Example 8 When we have a manifold $M$ and a point $p$ on it, intuitively, a differential operator at $p$ measures the rate of change of a function along a local path passing through $p$ . Let’s write this in a mathematically rigorous way. A path through $p$ is mathematically expressed as a parametrized path $\gamma: (-\epsilon, \epsilon) \to M$ with $\gamma(0) = p$ . Now, the differential operator we intuitively thought about above is actually $\gamma_*(\frac{d}{dt})$ . Mathematically verifying this, for a smooth function $f: M \to \mathbb{R}$ , we have

\gamma_* \left( \frac{d}{dt} \right) f = \frac{d}{dt} (f\circ \gamma)

That is, $\gamma_* \left( \frac{d}{dt} \right)$ is an operator that gives the derivative of $f$ along the path $\gamma$ .

**Figure 5:** A line segment is embedding in a manifold. A tangent vector of the line is pushforwarded to a tangent vector of a manifold.

Since a smooth map pushes vectors, from the perspective of tangent spaces, we can see that it defines a linear map that pushes between tangent spaces.

Definition Let $f: M \to N$ be a smooth map. For each $p\in M$ , the differential of $f$ at $p$ is a linear map $df_p : T_pM \to T_{f(p)}N$ which maps a tangent vector $X_p\in T_pM$ to a tangent vector $df_p(X_p)\in T_{f(p)}N$ defined by $df_p(X_p) = f_*(X_p).$

Example 9 Let $f: \mathbb{R}^m \to \mathbb{R}^n$ be a smooth map. From the previous example, we have $df_p(a^i\partial_i) = a^i\frac{\partial f^j}{\partial x^i}\partial_j.$ Therefore, $df_p$ is a linear map $T_p\mathbb{R}^m \to T_{f(p)}\mathbb{R}^n$ which maps $a^i\partial_i \mapsto a^i\frac{\partial f^j}{\partial x^i}\partial_j.$ Here, we see that $\frac{\partial f^j}{\partial x^i}$ is a matrix that represents the map $df_p$ . This matrix is called the Jacobian matrix of $f$ at $p$ .

Now, we see how the smooth map acts on cotangent vectors. Intuitively, a cotangent space is a dual space of a tangent space, so the natural direction of the action should be reversed. Therefore, we may infer that the smooth map will pull cotangent vectors.

Definition Let $f: M \to N$ be a smooth map. For each $p\in M$ , the pullback of a cotangent vector $\alpha_{f(p)}\in T^*_{f(p)}N$ is a cotangent vector $f^*\alpha_{f(p)}\in T^*_pM$ , which is a linear map $T_pM \to \mathbb{R}$ defined by $f^*(\alpha_{f(p)}) = \alpha_{f(p)}\circ df_p.$

Example 10 In the previous example, we defined the differential of a smooth real-valued function on a manifold. At that time, we saw that this object is a cotangent vector field. In fact, this object can be understood as a pullback. First, let’s define a canonical cotangent vector field $dt$ on the real line that satisfies the following: $dt(\frac{d}{dt}) = 1.$ Now, let’s prove that the pullback of this dt using f is df. Fix p in M. Take a tangent vector $X_p$ at p. Then, $f^*(dt)(X_p) = dt(df_p(X_p)) = dt(f_* X_p).$ We claim that $f_* X_p = X_p(f) \frac{d}{dt}|_{f(p)}$ . To see this, we take any smooth function $g: \mathbb{R} \to \mathbb{R}$ , by chain rule, $f_* X_p(g) = X_p(g\circ f) = g'(f(p)) X_p f = \left(X_p f \frac{d}{dt}|_{f(p)}\right) g.$ Therefore, going back to the above equation, $f^*(dt)(X_p) = dt(f_* X_p) = dt(X_p f \frac{d}{dt}) = X_p f = df_p(X_p).$

So far, we have examined how $f:M\to N$ acts on tangent vectors and cotangent vectors. Now, let’s describe this at the level of tangent bundles and cotangent bundles. Before that, let’s first define a morphism between vector bundles.

Definition (morphism of vector bundles) Let $E\to M$ and $F\to N$ be vector bundles. A smooth map between vector bundles $\tilde{f}: E\to F$ together with a map $f: M \to N$ is called a morphism of vector bundles if

$\tilde{f}$ maps each fiber of $E$ to a fiber of $F$ . Precisely, let $p\in M$ be a point, and let $E_p$ be the fiber of $E$ at $p$ . Then, $\tilde{f}(E_p) \subset F_{f(p)}$ .
$\tilde{f}:E_p \to F_{f(p)}$ is a linear map.

Example 11 Let $f: M \to N$ be a smooth map. Then, the pushforward $f_*$ induces a morphism of vector bundles. This map vector bundle morphism is defined as follows: $df: TM \to TN, \;\; df_p((p,X_p)) = (f(p),f_*(X_p)).$

Unfortunately, the pullback $f^*$ does not induce a morphism of vector bundles, since there is no map from $N$ to $M$ . However, we still can construct a canonical object.

Definition (Global section of a vector bundle) Let $E \to M$ be a vector bundle over $M$ . The global section of $E$ is a map $s: M \to E$ such that $p\mapsto s(p)\in E_p$ for all $p\in M$ . The collection of all smooth global sections of $E$ is denoted by $\Gamma(E)$ .

Example 12 Let $M$ be a smooth manifold. The collection of all smooth vector fields on $M$ is $\Gamma(TM)$ .

Example 13 Let $M$ be a smooth manifold. Let $f: M \to \mathbb{R}$ be a smooth function. Then, the differential $df$ is an element of $\Gamma(T^*M)$ .

Definition (Pullback of a global section) Let $f: M \to N$ be a smooth map. Take $\alpha \in \Gamma(T^*N)$ . Then, the pullback of $\alpha$ is a global section of $T^*M$ , which is defined by $f^*\alpha: M \to T^*M, \;\; f^*\alpha|_p = f^*\alpha_{f(p)}.$ Therefore, $f^*$ induces a map $f^*: \Gamma(T^*N) \to \Gamma(T^*M)$ .

Therefore, given a smooth map $f: M \to N$ , we have canonical maps

$df: TM \to TN,$
$f^*: \Gamma(T^*N) \to \Gamma(T^*M)$ .

It can be shown that these can be extended to tensor bundles: For positive integer $k$ , we have canonical maps

$df: T^k M \to T^k N,$
$f^*: \Gamma(T^{*k}N) \to \Gamma(T^{*k}M)$ .

What is a metric?

In the previous section, we studied various geometric objects on a smooth manifold, including bundles and global sections. In this section, we will study a metric on a smooth manifold, and understand it as a global section of a tensor bundle. To do so, we first see the easiest example of a metric, which is the Euclidean distance arising from an inner product on a vector space.

Definition (Inner product) Let $V$ be a vector space. An inner product on $V$ is a map $V\times V \to \mathbb{R}$ that satisfies the following properties:

Bilinearity: $g(av + bw, z) = a g(v,z) + b g(w,z)$ for all $a,b\in \mathbb{R}$ and $v,w,z\in V$ .
Symmetry: $g(v,w) = g(w,v)$ for all $v,w\in V$ .
Positive-definiteness: $g(v,v) \geq 0$ for all $v\in V$ , and $g(v,v) = 0$ if and only if $v = 0$ .

Then, we can equip a vector space with a norm by $|v| = \sqrt{g(v,v)},$ and this norm induces a metric on $V$ .

What is important here are two things:

An inner product is symmetric, bilinear, and positive-definite.
An inner product takes two vectors and returns a real number.

When a vector space $V$ is given, since an inner product is an operator that takes two vectors and outputs a single real value, it can be understood as an element of $V^* \otimes V^*$ . Then, when we think from the perspective of manifolds and tangent bundles on them, we can intuitively understand that a metric on a manifold is a 2-tensor field, i.e. an object that takes two tangent vectors and returns a real number at each point, or 2-cotangent vectors at each point of the manifold. More rigorously, this can be written as follows.

Definition (Metric) [7] Let $M$ be a smooth manifold. A metric $g$ on $M$ is a global section of $T^*M\otimes T^*M$ which is bilinear, symmetric, and positive-definite.

A metric g on a sphere, taking two tangent vectors at a point p and returning a real value — **Figure 6:** A metric $g$ on a sphere. At a point $p$ , $g$ takes two tangent vectors and gives one real value.

Example 14 The standard metric on $\mathbb{R}^n$ is given by $g = \sum_{i=1}^n dx^i \otimes dx^i,$ where $dx^i$ is the differential of the $i$ -th coordinate function $x^i: \mathbb{R}^n \to \mathbb{R}$ . Fix $p\in \mathbb{R}^n$ . Suppose we have a tangent vector $X_p = a^i\partial_i \in T_p\mathbb{R}^n$ . Then, we have $g_p(X_p, X_p) = g_p(a^i\partial_i, a^j\partial_j) = a^i a^j g_p(\partial_i, \partial_j) = \sum_{i=1}^n (a^i)^2.$ Since a tangent vector represents an infinitesimal direction of change, this value can be understood as the square of the infinitesimal change in the tangent direction.

From this definition, we can calculate the length of a curve explicitly.

Example 15 (Length of a curve) Let $M$ be a smooth manifold. Let $\gamma: [0,1] \to M$ be a curve. The length of $\gamma$ is given by $L(\gamma) = \int_0^1 \sqrt{g(\gamma'(t),\gamma'(t))} dt.$ Here, $\gamma'(t)$ is a tangent vector at the point $\gamma(t)$ , tangent to the curve $\gamma$ , defined by $\gamma'(t) = \gamma_*(\frac{d}{dt}|_t).$

One common misconception is thinking that when there is a manifold and a function $f$ on it, there naturally exists an object called the gradient. This is incorrect; a gradient only exists when there is a metric on the manifold. In fact, as we saw in the previous section, when there is a function on a manifold, the naturally existing object is the differential of the function, which belongs to $\Gamma(T^*M)$ . That is, since a gradient is a vector field, we can only consider a gradient when there exists a means to transfer objects in $\Gamma(T^*M)$ to $\Gamma(TM)$ .

When there is a vector space $V$ with an inner product $\langle\cdot,\cdot\rangle$ , we can define an isomorphism between the vector space $V$ and its dual space $V^*$ as follows: $V \to V^*, \;\; v \mapsto \langle v, \cdot \rangle.$ Through this, we can view elements of the dual space $V^*$ of vector space $V$ as elements of vector space $V$ .

Using this, when there is a metric $g$ , we can define a bundle isomorphism $\sharp: T^*M \to TM$ as follows: $(p,\alpha|_p) \mapsto (p, \alpha|_p^\sharp),\;\;\text{ where }\;\; g_p(\alpha|_p^\sharp, \cdot) = \alpha|_p.$

Definition (Gradient) Let $M$ be a smooth manifold. Let $f: M \to \mathbb{R}$ be a smooth function. The gradient of $f$ is a vector field $\nabla f$ on $M$ defined by $\nabla f = \sharp df.$

Example 16 Consider the Euclidean space $\mathbb{R}^n$ , equipped with the standard metric $g = \sum_{i=1}^n dx^i \otimes dx^i$ . Consider a smooth function $f: \mathbb{R}^n \to \mathbb{R}$ . Then, the differential of $f$ is given by $df = \sum_{i=1}^n \frac{\partial f}{\partial x^i} dx^i.$ Therefore, the gradient of $f$ is given by $\nabla f = \sum_{i=1}^n \frac{\partial f}{\partial x^i} \partial_i.$

We will understand the gradient from another perspective and end this section.

Definition Let $M$ be a smooth manifold. Let $g$ be a metric on $M$ . The metric $g$ induces a tensor $g^\sharp \in \Gamma(TM\otimes TM)$ by $g^\sharp(\alpha,\beta) = g(\alpha^\sharp, \beta^\sharp).$

Example 17 Consider the Euclidean space $\mathbb{R}^n$ , equipped with the standard metric $g = \sum_{i=1}^n dx^i \otimes dx^i$ . Then, the metric $g$ induces a tensor $g^\sharp \in \Gamma(T \mathbb{R}^n \otimes T \mathbb{R}^n)$ , which is given by $g^\sharp = \sum_{i=1}^n \partial_i \otimes \partial_i.$

Definition (gradient using $g^\sharp$ ) Let $M$ be a smooth manifold. Let $g$ be a metric on $M$ . The gradient of a smooth function $f: M \to \mathbb{R}$ using $g^\sharp$ is given by $\nabla_g f = g^\sharp (df).$ Here, $g^\sharp$ is a 2-tensor, so applying it to a 1-covector will give a 1-tensor, which is a tangent vector.

Example 18 Consider the Euclidean space $\mathbb{R}^n$ , equipped with the standard metric $g = \sum_{i=1}^n dx^i \otimes dx^i$ . Then, the gradient of a smooth function $f: \mathbb{R}^n \to \mathbb{R}$ using $g^\sharp$ is given by $g^\sharp (df) = \left(\sum_{i=1}^n \partial_i \otimes \partial_i\right) (df) = \sum_{i=1}^n \partial_i \langle \partial_i, df \rangle = \sum_{i=1}^n \frac{\partial f}{\partial x^i} \partial_i.$ This is identical to the gradient of $f$ in the previous example.

Differential geometric setting for DNN

This section is to define mathematical objects that appear in deep learning, and understand them geometrically.

Let $X_0$ be a input space.
Let $X_1$ be a output space.
Let $\mathcal{F}$ be a space of functions from $X_0$ to $X_1$ .
Let $\mathcal{P}$ be a manifold of parameters. We use $\theta$ to denote a point in $\mathcal{P}$ .
Let $F$ be a model, which is a map $F: \mathcal{P} \to \mathcal{F}$ .
Let $L: \mathcal{F} \to \mathbb{R}$ be a loss function.

Since DNNs use a gradient descent method to optimize the parameters, we understand that the parameter space $\mathcal{P}$ must be equipped with a metric.

Remark The spaces $X_0$ , $X_1$ , and $\mathcal{P}$ are usually chosen as Euclidean spaces. Therefore, $\mathcal{F}$ is also a vector space (vector space of functions) Since a tangent space of a vector space can be identified with itself, all of their tangent spaces are canonically identified with themselves.

Remark Different metrics assigned to $\mathcal{P}$ induce different optimization algorithms. For example, when $\mathcal{P}$ is equipped with a Euclidean metric, the optimization algorithm is standard gradient descent. However, by assigning a spectral norm (of matrix space), we obtain some different optimization algorithms like muon or shampoo [3],[4],[5].

Diagram showing a model F pushing the 2-tensor g^sharp to the NTK over function space F — **Figure 7:** A model $F$ pushes the 2-tensor $g^\sharp$ to the NTK $F_*(g^\sharp)$ living over the function space $\mathcal{F}$ .

Neural Tangent Kernel

The strength of this framework is that we can understand NTK directly. NTK is fundamentally an approach that understands parameter changes not as parameter changes but as function changes [1],[2].

Definition (Neural Tangent Kernel) Let $F: \mathcal{P} \to \mathcal{F}$ be a model. Let $g$ be a metric on $\mathcal{P}$ . The NTK $\Theta$ at $\theta\in \mathcal{P}$ is $\Theta(\theta) = F_*(g^\sharp|_{\theta})$ .

Recall the previous definition. We can understand the gradient of using 2-tensor $g^\sharp$ . In the DNN setting, there is a gradient flow on $\mathcal{P}$ , and this gradient is with respect to the composition of $L \circ F$ . That is, we can understand this gradient flow by taking the differential of $L\circ F$ and compute with $g^\sharp$ . If we want to view this flow not on $\mathcal{P}$ but on $\mathcal{F}$ , we pushforward the whole situation to $\mathcal{F}$ , obtaining a vector flow by computing the differential of $L$ with the pushforward of $g^\sharp$ . In other words, NTK is what allows us to view the gradient flow in the sense of function space.

We also verify this definition is equivalent to the one in the literature by explicit calculation.

Example 19 By equipping $\mathcal{P}$ with a Euclidean metric, we have $F_*(g^\sharp) = F_*( \sum_{p=1}^{P} \partial_{\theta_p} \otimes \partial_{\theta_p})= \sum_{p=1}^{P} \partial_{\theta_p}F \otimes \partial_{\theta_p}F.$ From [2], we have $\Theta(\theta) = \sum_{p=1}^{P} \partial_{\theta_p}F \otimes \partial_{\theta_p}F.$ Therefore, they are identical.

References

[1] Arora, Sanjeev et al. On Exact Computation with an Infinitely Wide Neural Net. NeurIPS 2019.

[2] Arthur Jacot, Franck Gabriel, Clément Hongler, Neural Tangent Kernel: Convergence and Generalization in Neural Networks, arXiv:1806.07572

[3] Gupta, Vineet, et al. “Shampoo: Preconditioned stochastic tensor optimization” (2018)

[4] Jeremy Bernstein and Laker Newhouse. “Old optimizer, new norm: An anthology.” arXiv preprint arXiv:2409.20325 (2024).

[5] Jordan Keller, Muon: A Matrix Norm Optimizer for Deep Learning, https://kellerjordan.github.io/posts/muon/

[6] Lee, John M. Introduction to Smooth Manifolds, (2002)

[7] Lee, John M. Introduction to Riemannian Manifolds, (2018)