Terra Incognita

TorchStation Prototype V1 – GPUs panel

Christian S. Perone — Wed, 28 May 2025 15:04:15 +0000

I finally had some time over the holidays to complete the first panel of the TorchStation. The core idea is to have a monitor box that sits on your desk and tracks distributed model training. The panel shown below is a prototype for displaying GPU usage and memory. I’ll continue to post updates as I add more components. The main challenge with this board was power: the LED bars alone drew around 1.2A (when all full brightness and all lit up), so I had to use an external power supply and do a common ground with the MCU, for the panel I used a PLA Matte and 3mm. Wiring was the worst, this panel alone required at least 32 wires, but the panel will hide it quite well. I’m planning to support up to 8 GPUs per node, which aligns with the typical maximum found in many cloud instances. Here is the video, which was quite tricky to capture because of the camera metering of exposure that kept changing due to the LEDs (the video doesn’t do justice to how cool these LEDs are, they are very bright and clear even in daylight):

I’m using for the interface the Arduino Mega (which uses the ATmega2560) and Grove ports to make it easier to connect all of this, but I had to remove all VCCs from the ports to be able to supply from an external power supply, in the end it looks like this below:

                 ┌────────── PC USB (5V, ≤500 mA)
                 │
            +5 V │
                 ▼
     ┌─────────────────┐
     │  Arduino Mega   │  Data pins D2…D13, D66…D71 → LED bars
     └─────────────────┘
                ▲  GND (common reference)
                │
   ┌────────────┴──────────────┐
   │ 5V, ≥3 A switching PSU    │  ← external PSU
   └───────┬───────────┬───────┘
           │           │
           │ +5V       │ GND
           ▼           ▼
┌─────────────────────────────────┐
│ Grove Ports (VCC rail)          │ <– external 5V injected here
│ 8 × LED Bar cables              │
└─────────────────────────────────┘

The post TorchStation Prototype V1 – GPUs panel first appeared on Terra Incognita.

VectorVFS: your filesystem as a vector database

Christian S. Perone — Tue, 29 Apr 2025 12:44:57 +0000

PS: thanks for all the interest, here you are some discussions about VectorVFS as well:
Hacker News: discussion thread
Reddit: discussion thread

When I released EuclidesDB in 2018, which was the first modern vector database before milvus, pinecone, etc, I ended up still missing a piece of simple software for retrieval that can be local, easy to use and without requiring a daemon or any other sort of server and external indexes. After quite some time trying to figure it out what would be the best way to store embeddings in the filesystem I ended up in the design of VectorVFS.

The main goal of VectorVFS is to store the data embeddings (vectors) into the filesystem itself without requiring an external database. We don’t want to change the file contents as well and we also don’t want to create extra loose files in the filesystem. What we want is to store the embeddings on the files themselves without changing its data. How can we accomplish that ?

It turns out that all major Linux file systems (e.g. Ext4, Btrfs, ZFS, and XFS) support a feature that is called extended attributes (also known as xattr). This metadata is stored in the inode itself (ona reserved space at the end of each inode) and not in the data blocks (depending on the size). Some file systems might impose limits on the attributes, Ext4 for example requires them to be within a file system block (e.g. 4kb).

That is exactly what VectorVFS do, it embeds files (right now only images) using an encoder (Perception Encoder for the moment) and then it stores this embedding into the extended attributes of the file in the filesystem so we don’t need to create any other file for the metadata, the embedding will also be automatically linked directly to the file that was embedded and there is also no risks of this embedding being copied by mistake. It seems almost like a perfect solution for storing embeddings and retrieval that were there in many filesystems but it was never explored for that purpose before.

If you are interested, here is the documentation on how to install and use it, contributions are welcome !

The post VectorVFS: your filesystem as a vector database first appeared on Terra Incognita.

PyTorch 2 Internals – Talk

Christian S. Perone — Mon, 11 Dec 2023 12:52:32 +0000

Just sharing ~100 slides about PyTorch 2 internals focusing on recent innovations (Dynamo, Inductor, and ExecuTorch). I had a lot of fun preparing this and hope you’ll enjoy it. I’m planning to record it soon.

Download link for the slides are here.
Slides in SlideShare are here.

The post PyTorch 2 Internals – Talk first appeared on Terra Incognita.

Torch Titan distributed training code analysis

Christian S. Perone — Wed, 21 Aug 2024 10:07:05 +0000

I really like to peek into different ML codebases for distributed training and this is a very short post on some things I found interesting in Torch Titan:

Disable and control of Python’s garbage collector (GC): titan codebase disables the Python GC and then manually forces a collection in the beginning of every training step during the training loop. This makes sense, but I’m not sure what are the gains of doing it, I think doing every step can be too much and I’m not sure if taking control of GC would be worth for the gains you get by manually controlling it, especially depending on complexity of other dependencies you use, as this could cause unintended behavior that would be difficult to associate with the GC collection;

Custom GPU memory monitoring: titan has a custom class to monitor GPU memory that is quite nice, it resets peak stats and empty the CUDA caching allocator upon initialization. At every step then they collect the peak stats for both small and large pools by capturing the stats for active, reserved and also failed retries and number of OOMs. It is very common for people to just monitor max GPU usage externally from NVML, however, this ignores the fact that PyTorch uses a caching allocator and that you need to look at the internal memory management mechanism inside PyTorch. If you don’t do that, you will certainly be mislead by what you are getting from NVML;

Custom profiling context manager: they wrote a context manager for profiling, where they measure time it takes to dump the profiling data per rank. Interesting here that there is a barrier at the end, which makes sense, but this is often the pain point of distributed training with PyTorch + NCCL;

Measuring data loading: this is of minor interest, but I liked the idea of not iterating on data loader in the loop statement itself but manually calling next() to get the batches, that makes it easier to measure data loading, that they average at the end for each epoch;

Logging MFU (model FLOPS utilization): they also compute and log MFU, which is quite helpful;

Delete predictions before backward: titan also deletes the model predictions before the backward() call to avoid memory peaks. This can be quite effective since you really don’t need this tensor anymore and you can delete it immediately before the backward pass;

Reduction of NCCL timeout: after the first training step, they reduce the NCCL timeout from the default 10 min to 100 sec. This is nice if you have well behaved replicas code and don’t need to do anything more complex, but 100 sec is a very short timeout that I would be careful using, it might be a good fit for your load but if your replicas drift a bit more, then you will need to keep adding barriers to avoid timeouts that can be incredibly difficult to debug and cause a lot of headaches;

Distributed checkpointing with mid-epoch checkpoint support: this is a very cool implementation, it uses distributed checkpointing from PyTorch. They create some wrappers (e.g. for optimizer) where they implement the Stateful protocol to support checkpointing. They also use the StatefulDataLoader from torchdata to do checkpointing of mid-epoch data loader state;

Misc: there are of course other interesting things, but it is cool to mention that they also implemented a no frills LLaMA model without relying on thousands of different libs (it seems it became fashionable nowadays to keep adding dependencies), so kudos for that to keep it simple.

The post Torch Titan distributed training code analysis first appeared on Terra Incognita.

Memory-mapped CPU tensor between Torch, Numpy, Jax and TensorFlow

Christian S. Perone — Tue, 13 Aug 2024 09:51:36 +0000

This is just a fun experiment to answer the question: how can I share a memory-mapped tensor from PyTorch to Numpy, Jax and TensorFlow in CPU without copy and making sure changes done in memory by torch are reflected on all these shared tensors ?

One approach is shown below:

import torch
import tensorflow as tf
import numpy as np
import jax.numpy as jnp
import jax.dlpack

# Create the tensor and persist
t = torch.randn(10, dtype=torch.float32)
t.numpy().tofile("tensor.pt")

# Memory-map the file with PyTorch
t_mapped = torch.from_file("tensor.pt", shared=True, size=10, dtype=torch.float32)

# Memory-map it with numpy, the same tensor
n_mapped = np.memmap("tensor.pt", dtype='float32', mode='r+', shape=(10))

# Convert it to Jax, will reuse the same buffer
j_mapped = jnp.asarray(n_mapped)

# Convert it to dlpack capsule and load in TensorFlow
dlcapsule = jax.dlpack.to_dlpack(j_mapped)
tf_mapped = tf.experimental.dlpack.from_dlpack(dlcapsule)

Now the fun part begins, I will change the tensor in PyTorch and we will check what happens in the Numpy, Jax and TensorFlow tensors:

>>> t_mapped.fill_(42.0) # Changing only PyTorch tensorA
tensor([42., 42., 42., 42., 42., 42., 42., 42., 42., 42.])

>>> n_mapped # Numpy Array
memmap([42., 42., 42., 42., 42., 42., 42., 42., 42., 42.], dtype=float32)

>>> j_mapped # Jax Array
Array([42., 42., 42., 42., 42., 42., 42., 42., 42., 42.], dtype=float32)

>>> tf_mapped # TensorFlow Tensor

As you can see from above, changes in the torch tensor reflected back into Numpy, Jax and TensorFlow, that’s the magic of memmap().

The post Memory-mapped CPU tensor between Torch, Numpy, Jax and TensorFlow first appeared on Terra Incognita.

The geometry of data: the missing metric tensor and the Stein score [Part II]

Christian S. Perone — Tue, 12 Nov 2024 15:33:55 +0000

Credit: ESA/Webb, NASA & CSA, J. Rigby. / The James Webb Space Telescope captures gravitational lensing, a phenomenon that can be modeled using differential geometry.

Note: This is a continuation of the previous post: Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I], so if you haven’t read it yet, please consider reading as I won’t be re-introducing in depth the concepts (e.g., the two scores) that I described there already. This article became a bit long, so if you are familiar already with metric tensors and differential geometry you can just skip the first part.

I was planning to write a paper about this topic, but my spare time is not that great so I decided it would be much more fun and educative to write this article in form of a tutorial. If you liked it, please consider citing it:

Cite this article as: Christian S. Perone, "The geometry of data: the missing metric tensor and the Stein score [Part II]," in Terra Incognita, 12/11/2024, https://blog.christianperone.com/2024/11/the-geometry-of-data-part-ii/.

Introduction

Hyperbolic triangle. (2024, August 24). Wikimedia Commons.

I’m writing this second part of the series because I couldn’t find any formalisation of this metric tensor that naturally arises from the Stein score (especially when used with learned models), and much less blog posts or articles about it, which is surprising given its deep connection between score-based generative models, diffusion models and the geometry of the data manifold. I think there is an emerging field of “data geometry” that will be as impactful as information geometry (where the Stein score “counterpart”, the Fisher information, is used to construct the Fisher information metric tensor, the statistical manifold metric tensor – a fun fact, I used to live very close to where Fisher lived his childhood in north London). It is very unfortunate though, that the term “Geometric Deep Learning” has become a synonymous of Deep Learning on graphs when there is so much more about it to be explored.

As you will see later, this metric tensor opens up new possibilities for defining data-driven geometries that adapt to the data manifold. One thing that became clear to me is that score-based generative models tells data how to move and data tells score-based generative models how to curve, and given the connection of score-based models and diffusion, this is a very exciting area to explore and where we can find many interesting connections and the possibility to use the entire framework of differential geometry, which is a quite unique perspective in how we see these models and what they are learning.

I have made many simplifications to make this an educative article instead of just dumping formulas and definitions, the motivation is mainly because differential geometry and tensor calculus is usually not very common in ML outside of the information geometry domain, geometric deep learning and physics. The examples I will show below using a 2D gaussian distribution are one example – it is obviously easier to find geodesics without relying on optimisation, but I’m assuming this could be a much more complex model of score estimation and thus consider it a black-box instead of considering the analytical definition of a gaussian and carry on with simplifications.

Manifolds $M$ and Tangent Spaces $T_pM$

If you work in ML, you probably heard about the manifold hypothesis already, this hypothesis posits that “many high-dimensional data sets that occur in the real world actually lie along low-dimensional latent manifolds inside that high-dimensional space“. This doesn’t tell you, however, what a manifold actually is, especially the intuition, so I will give a basic intuition and define it while trying to focus on the tangent space, which is essential to understand the metric tensor that I will talk about it later. I won’t go through all components of a manifold and will try to keep it high level with a focus on what we need to understand the metric tensor.

So, imagine stretching and bending a piece of rubber without tearing (or gluing it :D). The resulting shape, no matter how complex, captures the essence of a smooth manifold. These manifolds are surfaces or higher-dimensional spaces that locally resemble flat Euclidean space, but globally can take on rich structure, just like the structure of our universe (e.g., see the image at the top of the post showing gravitational lensing). Now, formally, a smooth manifold is a topological space that is locally homeomorphic to Euclidean space and has a smooth structure. This means that around each point, we can find a neighbourhood that looks just like a piece of $\mathbb{R}^n$, and the transitions between these local views are smooth.

Manifold M and the tangent space TpM at the point p as two-dimensional objects. The geodesic γ(t) starts at p in the direction v. Imagem from Visualization of Nil, Sol, and SL2(R) geometries by Tiago Novello, Vinícius da Silva, Luiz Velho, Computers & Graphics, Volume 91, 2020.

Now, at each point $p$ on a manifold, we can imagine all possible directions in which we could move. These directions form a vector space called the tangent space, which serves as a local linear approximation of the manifold at that point. The notation used for it is usually $T_pM$, which means the tangent space $T$ of the manifold $M$ at the point $p$. You can see in the image on the left the point $p$ in the manifold and the cross section there showing the tangent space $T_pM$. You can also see a curve $\gamma(t)$ representing a geodesic, but I will talk about the geodesics a bit later once we have the metric tensor. Later on we will be using stochastic gradient descent (SGD) to optimize a discrete “curve” path made of multiple segments using the metric tensor we will define.
While the tangent space provides a linear approximation of the manifold at a point, it still doesn’t allow us to be able to define lengths, vectors or angles between them (we still cannot calculate an inner product), for this we will need the metric tensor, which we will talk about it below.

Metric tensor $g$: equipping the inner product $\langle u, v \rangle$

Imagine you’re an ant walking on the surface of an orange. To you, and to flat earthers especially (although ants have the license to believe in it), the world seems flat, yet you’re traversing a curved surface in a higher-dimensional space. At the very core of differential geometry, we have the metric tensor, which we call a tensor but it is actually a tensor field.
There is often a confusion between the metric tensor in differential geometry and the one in metric spaces or measure theory (here we are again in the overloading issues, which can get much worse if you start using tensor calculus notation with lower and upper indicies). In a metric space, a metric refers to a function that defines the distance between two points (e.g., the Euclidean distance).
Now, the metric tensor is a different animal, it is a tensor field on a manifold that defines an inner product $\langle \cdot, \cdot \rangle$ on the tangent space at each point, so that we can use it to compute distances and angles (locally, by integrating along curves). It is basically a generalisation of the idea of distance on curved spaces (as opposed to flat spaces), but we will get concrete with this later and it will be clear when you actually see the code, especially if you never used a metric tensor before (although you have been using one all the time, the identity metric tensor, which is the metric tensor that induces the dot product in Euclidean geometry).

The metric tensor is often denoted by $g$ symbol and it is defined as:

$$ g_p: T_pM \times T_pM \rightarrow \mathbb{R}, p \in M $$

Where we have the $g_p$ which is the metric tensor at the point $p$. You can also find the notation as $g_x$ which is at the point $x$, as I will use later because this will become data samples and we often denote data with $x$ in ML. We also have the tangent space $T_pM$ and manifold $M$ that we mentioned before and the $T_pM \times T_pM \rightarrow \mathbb{R}$ which tells us that the metric tensor takes two vectors from the tangent space as input and maps to the real numbers. The most important thing you need to understand here is that the $g_p$ can be different for each point $p$ (or constant everywhere as in Euclidean flat spaces). I think a good intuition to understand the metric tensor is to think that at every point in a curved space you have a curvature and this curvature can change depending on where you are in this curved space, therefore you need a “measuring stick” to be able to compute inner products, without that you cannot compute lengths because you don’t know how much each axis will stretch or contract. We will see that our new tensor derived from the Stein score will change at every point in space according to the learned data curvature, which is very interesting in my opinion.

One interesting parallel we can take here is the Fisher information metric (not to be confused with the Fisher information) where each point $p$ is actually a different parametrisation of a distribution in the statistical manifold. You can see a very nice animation tool that shows you a geodesic (we will talk about what this means soon) between two gaussian distributions that can have different parametrisation, so you can compute a distance between distributions in the statistical manifold using the Fisher information metric. This metric tensor is basically the core of information geometry, which is dealing with distribution parameters as the point $p$, allowing you to compute inner products, distances, lengths, geodesics, angles in the statistical manifold. For more information about it and the relationship of the Fisher information metric with optimization and its use in natural gradient please take a look at my slides about gradient-based optimization.

Now, I made this parallel because the form of the Fisher information is very similar to what we will do, except that in our metric tensor constructed from the Stein score we will define each point $p$ not as parameters of a distribution but as the data itself. The Fisher metric tensor will tell you the curvature in the statistical manifold, of the information geometry, and our metric tensor built from the Stein score will tell us the curvature of the data geometry.

Here is where we stand now in terms of concepts that we learned:

Example: Metric tensor $g$ in Euclidean space

The simplest metric tensor we can use as an example is the metric tensor in the Euclidean space $\mathbb{R}^n$, in this case it is defined as:

$$ g = I_n $$

Where $I_n$ is the $n \times n$ identity matrix for $\mathbb{R}^n$. Now, let’s use this metric tensor to compute the inner product and see where we end up. To compute the inner product, we just do:

$$ \langle u, v \rangle_g = u^T g v $$

Note that we are omitting here the point because this metric tensor is the same everywhere (it is constant everywhere), but one could write the inner product definition as $ \langle u, v \rangle_g(x)$ where $x$ is the point where the metric tensor is evaluated at (which can be different than $u$ and $v$ as we will see later in an example). So let’s plug now our metric tensor definition, which is the identity matrix, and therefore we will have an enormous simplification:

$$ \langle u, v \rangle_g = u^T I_n v = u^T v $$

The inner product is simply the dot product of the vectors $u$ and $v$, which we are very familiar with. The identity matrix does not change the computation, but it’s important to understand that the metric tensor in this case is $I_n$.

Note that the inner product immediately induces a norm as well:

$$ \|v\|_g = \sqrt{\langle v, v \rangle_g} $$

Thus, the norm $\|v\|_g$ represents the length of the vector $v$ in the geometry defined by the metric tensor $g$.

While in Euclidean space the metric tensor is the identity matrix and simply gives us the standard dot product, it becomes more interesting on curved or more complex manifolds. In these cases, the metric tensor varies from point to point and encodes the curvature and structure of the space. Distances and angles in such spaces are no longer as simple as in the Euclidean case, but the same principles apply: the metric tensor provides the framework for computing inner products, distances, and angles, adapted to the specific geometry of the manifold.

Diagram showing basis vectors ($e_1$, $e_2$, $e_3$ and the vector $u$ as a combination of them. Image from: Wang, Rui, et al. 2021. Geometric Algebra-Based ESPRIT Algorithm for DOA Estimation.

Now, if you want to understand a bit better how the identity matrix suddenly appears as the metric tensor for the Euclidean space, you need to understand basis vectors. They basically (or basically ? I know I should stop with these bad jokes) provide a set of directions that we use to describe any vector in a space. In 3D space ($\mathbb{R}^3$) for example, the standard basis vectors point along the x, y, and z axes and any vector can be represented as a combination of these basis vectors, using a certain amount of each, we just combine them. In the end, the intuitive view that you can have is that they act like a coordinate system, that allows us to describe and navigate that space.

The standard basis vectors for a 3D space are the following:

$$\mathbf{e}_1 = \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix}, \quad
\mathbf{e}_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}, \quad
\mathbf{e}_3 = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}$$

Each one is pointing to a direction, now you get where the Euclidean metric tensor (the identity matrix) comes from:

$$ g_{ij} = \delta_{ij} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} $$

Where $\delta_{ij}$ is the Kronecker Delta, that gives is the simple rule below that makes it convenient to represent the Euclidean orthonormal metric tensor:

$$ \delta_{ij} =
\begin{cases}
1 & \text{if } i = j \\
0 & \text{if } i \neq j
\end{cases} $$

One misconception that is often very common is to think that the Euclidean space is not a Riemannian manifold, but it actually is. A Riemannian manifold is basically a manifold that has more structure, such as a smoothly varying and positive definite metric tensor. The identity matrix used in Euclidean space meets this criteria, so the Euclidean manifold is indeed a Riemannian manifold when equipped with its standard positive-definite metric tensor (the identity matrix).

That’s enough for our Euclidean example, let’s jump now on how we define curves and geodesics on a manifold.

Curves $\gamma(t)$ and Geodesics

Curves, Length and Energy

Let’s shift now the focus to curves first. A curve on a manifold is a smooth path $\gamma(t)$ parameterized by $t$ that traces a path between two points on the manifold. The natural way to measure the length of a curve is by integrating the infinitesimal distances along the curve, however, now that we know that a metric tensor $g$ exists, we will integrate these infinitesimal distances using the inner product equipped with the metric tensor:

$$ L[\gamma] = \int_0^1 \sqrt{\underbrace{\langle \dot{\gamma}(t), \dot{\gamma}(t) \rangle_g}_{\text{Inner product}}} \, dt $$

The 0 to 1 integration you see in the equation means that 0 is the beginning of the curve and 1 is the end of it (as parametrized by $t \in \left[0, 1\right]$). This curve maps from the interval $\left[0, 1\right]$ into a manifold $M$, i.e., $\gamma(t) : \left[0, 1\right] \rightarrow M$. At each value of $t$, $\gamma(t)$ gives a point on the manifold. You might not be used to the dot at top of the $\dot{\gamma}$ curve, as it is not common in ML to use that notation, but this notation basically means that we are taking the derivative of the curve with respect to the parameter $t$:

$$ \dot{\gamma}(t) = \frac{d}{dt}\gamma(t) $$

Understanding the length formula is very simple, the squared root of the $\langle \dot{\gamma}(t), \dot{\gamma}(t) \rangle_g$ gives the instantaneous speed at each point along the curve and the integral is accumulating the “instantaneous distances” as you move along the curve from $t=0$ to $t=1$. This accumulation gives the total distance traveled, or the length of the curve, and given that we are using the metric tensor $g$, it will give us the length of the curve on the manifold, taking into consideration the local curvature along the way.

One important definition here that might help understand a bit better as well, is that the $L[\gamma]$ is a functional, which is a special type of mathematical object that maps functions to real numbers. More specifically, it’s a function that takes another function (or curve) as its input and outputs a scalar.

You now probably have a good idea of what the length is computing and how we built the understanding coming from the inner product definition equipped with the metric tensor. Let’s now talk about an even simpler concept which is the energy functional $E[\gamma]$ defined below:

$$ E[\gamma] = \int_0^1 \langle \dot{\gamma}(t), \dot{\gamma}(t) \rangle_g \, dt $$

This is very similar to the length, except that we don’t have the square root. The reason why energy is more convenient, and we will see later a concrete example when we start optimizing geodesics using energy, is that since energy depends quadratically on the velocity, it has better analytic properties (e.g. smoothness). Note bracket in the notation to denote that it is a functional (it is taking the $\gamma$ curve that is used for the integration).

A curve $\gamma(t)$ between two points in the manifold that parametrizes two gaussian distributions. Image adapted from: Miyamoto, H.K. et al. On closed-form expressions for the Fisher–Rao distance. Info. Geo. (2024).

Let’s think now in an interesting example from information geometry. Remember that we mentioned that in information geometry you are dealing with the statistical manifold ? We can define curves in this statistical manifold where each point $p$ actually represents the parameters of a distribution. A very simple example is a gaussian distribution, you can think that each point in the statistical manifold represents a valid parametrisation of the gaussian distribution. In the figure in the left, we can see the two points and their respective gaussians. Note as well that we can make curves between these distributions such as the $\gamma(t)$ shown in the figure connecting the two gaussians.

In Euclidean spaces, a shortest path is always a straight path between points. But on more complex and rich manifolds, this concept doesn’t apply anymore. The way we find the shortest path between points is by finding this curve that minimizes the length between points, which we call a “geodesic”, although a geodesic might not always globally be the shortest path and we will see later the limitations of minimizing the energy to find a geodesic between points.

Geodesics

The curved path is the shortest path. Sorry flat earthers.

Geodesics are a special kind of curve that represent the shortest path (again, not always globally) between two points on the manifold, much like a straight line in Euclidean space. Geodesics are defined by the metric tensor that we just introduced, and they are crucial for understanding the intrinsic geometry of the manifold.
Now, in data geometry, geodesics reflect the natural paths between data points, with distances defined by the data manifold itself. This helps us understand the “shape” of the data and explore relationships between points in a geometrically meaningful way. Note the difference here between the data manifold where the points represent data samples and the statistical manifold from information geometry where points represent parametrisations of a distribution. We will focus here in the data manifold, where we will use later the Stein score to build a metric tensor for this manifold.

The easiest way to understand geodesics is by imagining that you want to get from point A to point B using the least possible effort (or Energy, as we just talked in the previous sections), you’ll naturally follow what’s called a “great circle” – like how airplanes follow curved paths on long-distance flights that look strange on flat maps but are actually shortest paths. In differential geometry, a geodesic generalizes this idea to any curved surface or space. It’s the path that an object would follow if it were moving “straight ahead” without any external forces – just following the natural curvature of the space it’s in.

Geodesics as energy minimization

Light following curved geodesic due a massive star. Source: http://hyperphysics.phy-astr.gsu.edu/hbase/Relativ/grel.html#c5.

You can solve geodesic equations (that use Christoffel symbols) using many different methods. I will focus here, however, into how we can find a geodesic through an optimization perspective (because we are a ML community and we love optimization). My focus here is not to provide you an extensive collection of methods on how you can find geodesics, my focus is to give you the understanding of the geodesics that will show up later and the tools needed to find it. There are obviously many things you can do when you are dealing with very simple expressions, but when we take learned networks into the equation, things become much more expensive and difficult, for that reason I’m trying to provide a way to find geodesics assuming that the Stein score that we will be using later is a black box (e.g. was learned with a deep neural network), hence why the optimization perspective is a good fit here.

Let’s review the Energy functional that we introduced earlier:

$$ E[\gamma] = \int_0^1 \langle \dot{\gamma}(t), \dot{\gamma}(t) \rangle_g \, dt $$

All critical points of the energy functional correspond to geodesics, but these critical points can be of different types (minima, maxima, or saddle points), for our problem we won’t have issues with saddle points but it is good to know (if you are interested in learning more about it, Keenan has an amazing video on it, with the most beautiful slide deck you will ever find in differential geometry).

The goal of our optimization will be to find the curve $\gamma$ that minimizes this energy functional $E[\gamma]$, this would be:

$$ \theta^\ast = \arg \min_{\theta} E[\gamma_\theta] = \arg \min_{\theta} \left( \frac{1}{2} \int_{a}^{b} \langle \dot{\gamma}(t; \theta), \dot{\gamma}(t; \theta) \rangle_g \, dt \right) $$

where the $\theta$ are parameters of the curve (we will use a parametrized curve, actually a discretization of it to make it easier to understand). So what we want to do is basically find the curve/path that minimizes this energy. If we want to find the the geodesic between two points (which we will see later) we can just hold first and last points fixed and optimize the path between them. We will be using SGD to optimize this curve later, but note that we are not optimizing a neural network, we are optimizing parameters of a curve, of a path, we want to find a path, so keep that in your mind to avoid confusion later.

The missing metric tensor

Stein score function

Now that we have most of the introduction covered with all required tools, we can start the interesting part. So, the Stein score is the derivative of the log density with respect to the data (and not parameters as in the Fisher information matrix used in information geometry, please see my first post for more about it):

$$ s(x) = \nabla_x \log p(x) $$

The Stein score function measures how sensitive the log-density $\log p(x)$ is to changes in the data $x$. This is a very interesting quantity and it is plays a fundamental role in the Stein’s method. What is important to understand about the Stein score is that it is a field, so it gives you a direction towards data for every point as you can see in the image below:

The animation shows a 2D gaussian moving around and the effect of it in the Stein score vector field. Note how the Stein score points towards regions of high density.

You can also see how the Stein score changes when we also change the covariance of the 2D gaussian:

The animation shows a 2D gaussian with varying covariance matrix and the effect of it in the Stein score vector field.

You can derive the closed-form Stein score function by just differentiating the log density or you can also learn it. There is a deep and fundamental connection between score matching/score estimation and diffusion models that I won’t go into details here but the main point is that this score can be learned from data.

For the sake of explanation and understanding, I will assume that we have data $x \in \mathbb{R}^2$ and that this data is sampled from a 2D gaussian. The reason for this is that we can easily see what is going on, but do not limit your imagination to gaussians, the idea is to treat the score as a black box that could’ve been learned from the data.

The missing Stein metric tensor for the data manifold

The main question that we want to answer here is: how can we devise a metric tensor for the data manifold ?

This would be a very interesting metric tensor because it will allow us to compute distances between data samples, to compute geodesics between data samples, to measure angles in this data manifold, to compute inner products, etc.

One way to construct a metric tensor is to start from requirements that we need:

We need this metric tensor to contract space in the direction of the data (now you are starting to understand why the Stein score);
We want this metric tensor to be a Riemannian metric tensor (e.g. being positive definite), otherwise computing lengths and other properties becomes challenging;
The geodesics should be “attracted” or follow paths where the density (of data) is higher;

We have the Stein score function $s(x)$ (which can also be learned $s_{\theta}(x)$) that gives us the direction to the data and is proportional to where we are in the data space. So we need to construct a metric tensor from this building block that would allow us to contract space in direction of data. One way to do that is to use the outer product:

$$ u \otimes v = uv^T $$

If we use the outer product of the Stein score with itself, in a $x \in \mathbb{R}^2$ this will yield the following matrix:

$$ s(x) \otimes s(x) = \begin{bmatrix} s_1^2 & s_1 s_2 \\ s_2 s_1 & s_2^2 \end{bmatrix} $$

When we look from the matrix transformation perspective, when we transform a vector with this Stein score outer product (which is symmetric), we will get a transformation that will project the vector into the direction of the Stein score and it will scale the vector if the vector is parallel to the Stein score as you can see in the plot below:

Where T is the outer product of the Stein score ($s(x) = [2.0, 2.0]$) and $P$ is the vector we are transforming. You can see that the transformation result $TP$ was projected along the direction of $s(x)$ and scaled (the maximum scaling happens at when it is in exactly direction of the Stein score). Now, if we look at what happens when the vector is orthogonal to the Stein score you will see that there is contraction happening:

This is very helpful because it tells that we have a mechanism to contract the vector when it is orthogonal to the Stein score. But there is one problem: what we want is the opposite of this. One way to achieve this is to just invert the matrix as in:

$$ [s(x) \otimes s(x)]^{-1} $$

The problem is: outer products yield rank-1 matrices and we cannot invert rank-1 matrices as is. Now, if you read the first part of this article you are probably familiar with the Fisher information matrix and its use as a metric tensor in information geometry. The Fisher-Rao metric also uses the outer product:

$$ g_{ij} = \mathbb{E} \left[ \nabla_{\theta} \ln p(x | \theta) \otimes \nabla_{\theta} \ln p(x | \theta) \right] $$

However, what makes the rank-1 matrix be positive definite is the expectation over data because when you add a lot of directions in the outer product it becomes invertible and positive definite (under certain reasonable regularity assumptions of course).

Now, from the Bayesian perspective, the natural next step to do here would be to take expectation over the posterior, over the parameters of the model estimating the Score function (when learned) and that would probably make it positive definite. We can, however, adopt other approaches, one of them being “regularizing” the matrix, which in optimization is also called damping (please see Section 6 from K-FAC paper if you are interested), where add a small multiple of the identity matrix before inverting the matrix (as it is often done with the natural gradient). One interesting aspect is that if we just add the identity matrix we will get an interpolation between the Euclidean metric and the metric tensor that we are building that will contract the space in the direction of the Stein score, therefore yielding a very smooth interpolation between the two metric tensors. What I will do here is to add the identity matrix before taking the inverse and then we will see how it looks like, but be aware that you can also add a small multiple of the identity matrix and that will make contraction much stronger to the point of making geodesics going through singularities.

Here is what we have until now:

$$ g = [s(x) \otimes s(x) + I]^{-1} $$

Or if you are a Bayesian and want to integrate over the posterior (if you have a score model $s_{\theta}(x)$ instead of a score function), then the metric tensor becomes the one below (which will propagate the uncertainty to the geodesics as well, which is quite amazing):

$$ g = \int_{\Theta} \left[ s_{\theta}(x) \otimes s_{\theta}(x) + I \right]^{-1} p(\theta \mid \mathcal{D}) \, d\theta $$

I’m calling it $g$ because that is usually the name of the metric tensor in differential geometry. Let’s see how this transformation affects now a vector in the unity circle:

Now you can see that we are seeing the behavior that we wanted, it is contracting the vector $P$ when it is in the direction of the Stein score while keeping it the same scale if it is orthogonal. There is one more thing we can do before we start using and analyzing this metric tensor $g$, which is the efficient computation of the inverse.

Efficient inversion with Sherman-Morrison formula

The Sherman-Morrison formula equips us with a cheap way to compute the matrix inversion for rank-1 update to a matrix whose inverse has previously been computed:

$$ (A + u v^\top)^{-1} = A^{-1} – \frac{A^{-1} u v^\top A^{-1}}{1 + v^\top A^{-1} u} $$

In our case, since $A = I$, $A^{-1} = I$, $u = s(x)$, and $v^{\intercal} = s(x)^{\intercal}$, then we have this elegant formula for efficiently computing the metric tensor:

$$ g = I – \frac{s(x) s(x)^\top}{1 + \|s(x)\|^2} $$

This final formula ended up being similar to the Lorentz contraction, which is very interesting because we are also contracting length in the data manifold.

This metric tensor is very interesting because it basically defines the geometry of the data manifold so we can find geodesics in the data manifold, which are basically paths across data samples on the curved data manifold. The other interesting aspect of it is that it is telling us one very interesting aspect of score matching and score-based models (and also diffusion models): they are learning the building block of the metric tensor from the data manifold.

The fact that we can just plug the Stein score and efficiently compute the metric tensor for the data manifold is also very interesting because as I said earlier, there are multiple ways of computing or estimating the Stein score. This metric tensor puts the data as coordinates in the data manifold, just as the Fisher-Rao metric puts parameters as coordinates of the statistical manifold. I find this super exciting and with a lot of different connections with other concepts that could yield faster and high quality data samplers for score-based models, faster optimization algorithms, etc.

Now that we have our metric tensor $g$ and we can compute it efficiently, let’s see some concrete cases of using it.

Optimizing geodesics on the data manifold

You can use many different parametrized curves but I will simplify here and use a discretization of a curve to find geodesics in the data manifold. To do that, I will use a path $\gamma(t; \theta)$ that has multiple segments and then compute the energy $E[\gamma]$ using the metric tensor $g$ at each midpoint of these segments. Then we I’m going to do is to use Adam to minimize this energy and plot the animation of this procedure. What we want to do minimize the following objective:

$$ \theta^\ast = \arg \min_{\theta} E[\gamma_\theta] = \arg \min_{\theta} \left( \frac{1}{2} \int_{a}^{b} \langle \dot{\gamma}(t; \theta), \dot{\gamma}(t; \theta) \rangle_g \, dt \right) $$

Where we want to optimize the parameters of the discrete curve $\gamma(t; \theta)$ to minimize the energy functional that we discussed in the introduction of this article. For the sake of example, I’m sampling from a 2D gaussian distribution that has a diagonal covariance matrix, so I can calculate the Stein score analytically and construct the metric tensor $g$ with it. Keep in mind that just like I mentioned earlier, I’m using a parametric distribution but you can learn that from data and replace the Stein score with an estimator of it.

Deriving the Stein score from a multivariate Gaussian

It is often illuminating and interesting to look how the Stein score function looks like for Gaussians. The derivation can be a bit cumbersome but the end result is very simple.

Start from the definition of the Stein score:
$$ \nabla_x \log p(x) $$
Take the gradient of the $\log p(x)$ with respect to $x$:
$$\nabla_x \log p(x) = -\frac{1}{2} \nabla_x \left( (x – \mu)^\top \Sigma^{-1} (x – \mu) \right)$$
Expand the quadratic term:
$$ (x – \mu)^\top \Sigma^{-1} (x – \mu) = \sum_{i,j} (x_i – \mu_i) \Sigma_{ij}^{-1} (x_j – \mu_j) $$
Differentiate with respect to $x$:
$$\nabla_x (x – \mu)^\top \Sigma^{-1} (x – \mu) = 2 \Sigma^{-1} (x – \mu)$$
And the Score function is:
$$\nabla_x \log p(x) = -\Sigma^{-1} (x – \mu)$$

At the end the Score function is very simple, and it is even more illuminating when you think from a standard normal, that is simply $\nabla_x \log p(x) = -\Sigma^{-1} x$, which is the inverse of the covariance matrix multiplied by $x$. This basically scaling $x$ according to the covariance of the distribution. If you are familiar with the Mahalanobis Distance, it is easy to see how you can rewrite it in terms of the Stein score.

Visualizing the Geodesic optimization (Energy minimization)

Now that we have the Sten score $s(x) = \nabla_x \log p(x)$ we also have the metric tensor $g$ as we defined earlier and we can then define a curve and optimize its parameters by minimizing the energy functional $E[\gamma_\theta]$ where we compute the inner product using our derived metric tensor. Below you can see an animation of how this optimization looks like, this was done using a path with 60 points (the blue line). This path is initialized with a straight path between the two dots (green and yellow). This red straight line would be the shortest distance for the Euclidean metric, but as you will see, it is not the shortest distance in the data manifold. In the data manifold, the geodesic will bend towards the center of the Gaussian, where the data is, and therefore the shortest distance between two points is a curved geodesic and not a straight line anymore.

https://blog.christianperone.com/wp-content/uploads/2024/11/anim_simple_gaussian.mp4

As you can see in the animation above, the path was bended to follow the curvature of the data, to pass through regions of space where there was contraction so that the energy and distances are lower through these paths.

Geodesics can start to get a bit non-intuitive when you have regions of high curvature as we can see below if we increase the correlation in the covariance matrix of the Gaussian and creating a more narrower distribution (showing anisotropy):

https://blog.christianperone.com/wp-content/uploads/2024/11/anim_gaussian_08.mp4

Note now how the path coming from the left is bending to take advantage of the curvature at the bottom of the plot and then we can see that the geodesic starts to be “attracted” by the regions of high data density. If we make the data follow even narrower distribution, you can clearly see the effects of anisotropy in the metric tensor:

https://blog.christianperone.com/wp-content/uploads/2024/11/anim_gaussian_095.mp4

Note that more at the end of the video you will be able to see that many segments of the path are being pulled to where the high data density is, this is happening because at that region of the space the space is highly compressed so the distances become smaller and therefore energy becomes lower, so the geodesic is trying to fit as many segments as possible in that region to reduce energy.

Understanding the Energy landscape

To understand why geodesics are following these paths, we can visualize the Energy landscape. It is quite tricky to visualize it since it depends on the direction of the displacement when computing the energy. We can, however, visualize the landscape for the initial position of the path (the red line shown above) and we can vary the data distribution to see how that changes energy:

What we are visualizing in this animation above is the Energy at each point of the space:

$$ E[\gamma] = \int_0^1 \langle \dot{\gamma}(t), \dot{\gamma}(t) \rangle_g \, dt $$

Note that we are using the path from -4 to 4 in the x-axis and fixing the y axis in -2 for the path (just as you can see in the red path shown in the animations we have shown earlier). Now you can understand why the path bends in the middle when we have a standard normal, you can see in the beginning of the animation that the Energy is higher (reds) exactly in the middle of the path, so the path bends that part to the center of the data to reduce its total energy. Please note that the displacement is zero in the y-axis because we are keeping the y dimension fixed. That’s the limitation of plotting it this way.

This is the energy landscape for a displacement of [1.0, 0.0], this is how the first step of optimization look like with the standard 2D gaussian (the white line is the path that we calculate the energy):

Plot of the energy landscape for a standard 2D gaussian. The white line is where we initialized our path. This plot is for a displacement of [1.0, 0.0].

Now with the line over the energy heatmap you can clearly see why the middle of the line bent to the center, it is because the energy in the middle and below the 0 limit of the y-axis is much higher.

Some final thoughts

I hope you enjoyed this part II of the post, I will try to do a follow up soon with more uses of the metric tensor. One very interesting use would be in Langevin sampling, which seems very helpful because now that we have a metric tensor, we can use it in the Riemann Manifold Langevin Dynamics (RMLD). There are still A LOT of topics to explore with this metric tensor because it gives us a very important building block of the data manifold. You can also simplify a lot it by just using the diagonal of the metric tensor (as it is often done with the Fisher metric as well). I think that this metric tensor can help build more efficient sampling algorithms, provide manifold interpolations and help understand better our data manifolds. There are also very deep connections with diffusion and score-based generative models due to the building block we use to construct the metric tensor that I will be exploring more in the next part of this series. It is also important to note that there are also a lot of connections with physics as well as differential geometry is one of the most important frameworks used to model spacetime, so there is a lot to explore !

Changelog

17 Nov 2024: added more details abou the bayesian integration.

Citation

The post The geometry of data: the missing metric tensor and the Stein score [Part II] first appeared on Terra Incognita.

Notes on Gilbert Simondon’s “On the Mode of Existence of Technical Objects” and Artificial Intelligence

Christian S. Perone — Thu, 09 Jan 2025 15:59:43 +0000

Happy new year ! This is the first post of 2025 and this time it is not a technical article (but it is about philosophy of technology )

Gilbert Simondon (1924-1989). Photo by LeMonde.

This is a short opinion article to share some notes on the book by the French philosopher Gilbert Simondon called “On the Mode of Existence of Technical Objects” (Du mode d’existence des objets techniques) from 1958. Despite his significant contributions, Simondon still (and incredibly) remains relatively unknown, and it seems to me that this is partly due to the delayed translation of his works. I realized recently that his philosophy of technology aligns very well with an actionable understanding of AI/ML. His insights illuminated a lot for me on how we should approach modern technology and what cultural and societal changes are needed to view AI as an evolving entity that can be harmonised with human needs. This perspective offers an alternative to the current cultural polarization between technophilia and technophobia, which often leads to alienation and misoneism. I think that this work from 1958 provides more enlightening and actionable insights than many contemporary discussions of AI and machine learning, which often prioritise media attention over public education. Simondon’s book is very dense and it was very difficult to read (I found it more difficult than Heidegger’s work on philosophy of technology), so in my quest to simplify it, I might be guilty of simplism in some cases.

Culture positioning itself as a defensive system against technology

One of the key points that Simondon talks about is regarding how culture sees technology. According to Simondon, culture has constituted itself as a “defense system” against technology. Well, it doesn’t take much effort for us to understand that this seems to be getting worse lately and especially after the post-LLM world in ML, with more and more folks joining the chorus of apocalyptic views. If we look back, however, history shows that this has always been the case with technology. Culture ignores the human reality within the technical reality, it sees it as something artificial that is opposite to the natural world. Simondon says that:

Culture behaves toward the technical object as man toward a stranger, when he allows himself to be carried away by primitive xenophobia. Misoneism directed against machines is not su much a hatred of novelty as it is a rejection of a strange or foreign reality. However, this strange or foreign being is still human, and a complete culture is one which enables us to discover the foreign or strange as human.

– Gilbert Simondon, On the Mode of Existence of Technical Objects (p10), 1958.

The Leader of the Luddites. Published in May 1812 by Messrs. Walker and Knight, Sweetings Alley, Royal Exchange.

If we look at where we are today, with some people advocating to decelerate data center construction, it resembles the English Luddite movement of the 19th century where workers destroyed textile machinery due to concerns relating to worker pay and output quality. The cause of this, according to Simondon is alienation, and the most powerful cause of alienation is the misunderstanding of the machine and technology, which is not an alienation caused by technology but by the non-knowledge of its nature and its essence. This alienation, inadequate and confused rapport between consumer, manufacturer and worker should be replaced by a awareness of the mode of existence of technical objects, to integrate them back into culture and its values. There are also endless examples of how culture is acting in a defensive way against technology recently, take for example the resistance of use of LLMs in education, it is often seen as fostering diminished critical thinking and threatening the traditional educational values rather than enhancing them. Even Stephen Hawking said that “the development of full artificial intelligence could spell the end of the human race” in a BBC interview, whatever “full” means.

Culture has turned us blind, it reduced technology to a set of neutral instruments at the service technocratic will, posing as if it is a simple tool with an utility, fostering this polarized view of technology and Artificial Intelligence as either technophilia (uncritical love of AI and technology) or technophobia (fear or rejection of AI and technology). These polarized views end up masking the rich human efforts and natural forces shaping technical objects as mediators between man and nature.

Automatism and the mythical representation of the robot

Coloured engraving from Joseph Racknitz’s 1789 pamphlet which attempted to reveal the secret workings of William Kempelen’s alleged chess-playing automaton “The Turk”.

The defensive rejection of technical objects by culture, as it recognizes certain objects like aesthetic objects but sees technical objects as, according to Simondon, “structureless world of things that have no signification but only a use, a utility function”, makes people who have knowledge of technical objects and that appreciate their signification, to seek justification of their judgment by granting these technical objects a status of sacred and embedded in this mythical idea of the “robot”, as explained by Simondon below:

We would like to show, precisely, that the robot does not exist, that it is not a machine, no more so than a statue is a living being, but that it is merely a product of the imagination and of fictitious fabrication, of the art of illusion. The notion of the machine as it currently exists in culture, however, incorporates to a great extent this mythical representation of the robot. An educated man would neverdare to speak of objects or figures painted on a canvas as genuine realities, having interiority, good or ill will. However, this same man speaks of machines as threatening man, as if he attributed a soul and a separate, autonomous existence to them, conferring on them the use of sentiment and intention toward man.

– Gilbert Simondon, On the Mode of Existence of Technical Objects (p11), 1958.

This exposes also that culture has two contradictory attitudes toward technical objects. On one hand, Simondon argues, we threat technical objects as pure assemblages of matter with a mere utility, and in the other hand it also supposes that these objects are also robots and “that they are animated by hostile intentions toward man, or that they represent a permanent danger of aggression and insurrection against him.”. Simondon attributes this contradiction to a very interesting fact, that has roots in automatism concepts for it considers the level of perfection of a machine as proportional to its degree of automatism. Simondon argues that automatism is, however, a low degree of technical perfection because when you make a machine automatic you are also sacrificing a wide range of possible usages. Now, the interesting aspect is that what we actually see when the degree of technicity is raised on machines, it doesn’t correspond to higher levels of automatism but to a certain margin of indeterminacy and this margin is what allows the machine to be sensitive to outside information. Simondon says that “(…) a purely automatic machine completely closed in on itself in a predetermined way of operating would only be capable of yielding perfunctory results.” and that a machine with a high degree of technicity is an open machine. You can make many links here with AI in regards to the level of indeterminacy.

From tool bearer to spectator

Old Spitalfields hand loom for silk weaving fitted with jacquard machine, 1810. Science Museum Photographic Archive. London Science museum has one of these if you want to visit.

From the seventeenth century to the eighteenth century, there was a clear continuous pace of evolution of technical objects according to Simondon, where instruments and tools became better (he cites the example of the gears and screw threads being cut better in the eighteenth century than in the seventeenth century) and this gave the idea of continuity of progress. This evolution did not cause any upheaval, it improved fabrication processes without disruption and the craftsman was able to preserve the habitual method but now with better results, causing a sensation of optimism in progress without anxiety. But when the transformations provoking a break within the rhythms of everyday life appear, the situation changes. In the seventeenth century, when they exchanged their tools for new tools, whose manipulation was the same, they have the feeling of being more precise and skilful, but in the nineteenth century, where we started to witness the birth of the complete technical individuals (do not read “individual” here as a human individual, this we will discuss later, it comes from the “individuation” concept that is more nuanced, just to avoid thinking that Simondon is saying that technical objects are humans), the confusion sets in.

Simondon mentions that as long as when these individuals were merely replacing animals, like in the case of steam engine that did not replace humans, this perturbation wasn’t followed by frustration, but when the automatic weaving loom, forging press, etc, this is what workers destroy during riots, because they are their rivals as the machine is no longer just an engine, but a bearer of tools like humans were. Simondon explain this as below:

(…) From then on the most positive, most direct aspect, of the first notion of progress, is no longer experienced. The progress of the eighteenth century is a progress experienced by an individual through the force, speed, and precision of his gestures. The progress of the nineteenth century can no longer be experienced by the individual because it is no longer centralized with the individual as the center of command and perception in the adapted action. The individual becomes the mere spectator of the results of the functioning of the machines, or the one who is responsible for the organization of technical ensembles putting the machines to work.

– Gilbert Simondon, On the Mode of Existence of Technical Objects (p162-163), 1958.

The most important aspect that Simondon mentions in my point of view is that in the nineteenth century, man doesn’t experience the progress as a worker, he experiences it as an engineer or a user. Simondon also mentions that at the end of the first half of the 19th century, poets keenly felt progress to be the general march of humanity, with its “charge of risk and anxiety”, bearing something “triumphant and crepuscular“.

Alienation, not only from means of production

One interesting realization from Simondon is related to the alienation caused not only by changes in means of production. After the nineteenth century revolution, the production of alienation, as grasped by Marxism as having its root in the relation of the worker with the means of production (because now the mans are not their property anymore), according to Simondon, this doesn’t only derive from the relation of property or non-property between worker and the machine or instrument of work. Simondon says that there is a more profound relation:

(…) that of the continuity between the human individual and the technical individual, or of the discontinuity between these two beings. The reason why alienation arises is not solely because in the nineteenth century the human individual who works is no longer the owner of his means of production, whereas in the eighteenth century the craftsman was the owner of his instruments of production and of his tools. Alienation does indeed emerge the moment the worker is no longer the owner of his means of production, but it does not emerge solely because of this rupture in the link of property. It also emerges outside of all collective relation to the means of production, at the physiological and psychological level of the individual properly speaking. The alienation of man in relation to the machine does not only have a socio-economic sense; it also has a physio-psychological sense; the machine no longer prolongs the corporeal schema, neither for workers, nor for those who possess the machines.

– Gilbert Simondon, On the Mode of Existence of Technical Objects (p165), 1958.

What Simondon is saying here is that this alienation of capital is not alienation with respect to labor, but rather with respect to the technical object. What labor lacks is not what capital possesses, and what capital lacks is not what labor possesses, he says. I think that this same profound psychological nuance emerged recently in AI as well, it doesn’t matter if you are a big company that is developing technology or if you are a user of technology, the technical individual, as Simondon says, “… is not of the same era as the labor that enacts it and the capital that frames it”.

Technology and Artificial Intelligence are not just tools

Green Town. Friedensreich Hundertwasser (1928–2000). Venice, 1978.

Just like Heidegger in his “The Question Concerning Technology” (Die Frage nach der Technik), Simondon doesn’t see technology just as a tool or simply means to an end as it is seen today by some groups. Technology has always been target of philosophical reflection as an instrument, as an economic reality and utility and this misses its essence. Technical objects are much more complex than this instrumental reduction, and we need philosophical thought to understand its essence. In Simondon’s philosophy, he describes two important concepts: concretization and individuation. I will try to make a summary of them, but both concepts have a lot of nuances and there are entire articles just working on the exact definition of these concepts.

Concretization: concretization is about how technical objects evolve to become more integrated and self-coherent systems over time. Simondon often separate abstract and concrete technical objects, a concrete technical object is one where all its parts work together in mutual synergy, rather than just being assembled as separate components. He gives also the example of the combustion engines: early internal combustion engines had separate systems for cooling (radiator), power generation (engine block), and lubrication (oil system). Over time, these evolved so that components served multiple functions (e.g. cooling fins became structural elements, oil helped cool and lubricate, etc). The engine became more “concrete” as its parts developed multiple integrated functions rather than remaining separate abstract elements. Concretization can be seen as a specific mode of evolution of technical objects. One important aspect here as well is that this process represents the technical and functional maturation of an object, where it increasingly resembles a natural system by being self-sustaining and autonomous. We will talk more about this later, on how technical objects doesn’t have to be seen as opposite to nature and how nuanced the notion of artificiality can be.

Individuation: describes the process by which entities (not only technical, but also biological) come into being as unique, coherent individuals within their associated milieu/environment. It is a dynamic process that involves the resolution of tensions or potentials between the entity and its environment. Note that in the Simondon’s individuation concept, this is always a on-going process, it is never complete, beings are always in a process of becoming. Individuation is also always relational, technical object achieves individuation when they establish a reciprocal relationship with its environment (e.g. as a turbine operating efficiently within a water flow system).

Now, I think the main revolutionary aspect of the Simondon’s philosophy is that these concepts bridge the technical-natural worlds. Both natural and technical objects go under the same similar processes of development and this makes the usual separation between “natural” and “artificial” much more blurry. Rather than opposing nature and technology, we can start to see that technology is an extension of nature, it is the extension of the human agency in nature, but not only that, these technical objects co-exist and co-evolve with us, they shape the way we see and interact with the nature and also change our culture and this is for me the most important point on why they are not just tools, they are mediators of our interaction with nature, and this mediation is not passive as it is actively influencing both the natural world and human culture.

When we start to understand technical objects through the lens of individuation and concretization, it is clear that Machine Learning/AI are not just a tool or a collection of tools with some utility or simply means to an end, they are part of our relationship with the world and have their own mode of existence that co-evolve together and change our culture. AI is shaping our future culture, society and our cognition, we cannot reduce it as simply tools of the human will.

Small detour: concretization makes inductive study sensible

Another very interesting part of the Simondon’s book in the chapter “The process of concretization” is when Simondon talks about the concretization and inductive study. He says that technical concretization makes primitively artificial objects increasingly similar to natural objects. The consequences of this concretization are not only human and economical but also intellectual:

(…) since the mode of existence of the concretized technical object is analogous to that of natural spontaneously produced objects, one can legitimately consider them as one would natural objects; in other words, one can submit them to inductive study. They are no longer mere applications of certain prior scientific principles. By existing, they prove the viability and stability of a certain structure that has the same status as a natural structure, even if it might be schematically different from all natural structures. The study of the functioning of concrete technical objects bears scientific value, since its objects are not deduced from a single principle; they are testimony to a certain mode of functioning and compatibility that exists in fact and has been built before having been planned: this compatibility was not contained in each of the separate scientific principles that served to build the object; it was discovered empirically; one can work backward from the acknowledgement of this compatibility to the separate sciences in order to pose the problem of the correlation of their principles and ground a science of correlations and transformations that would be a general technology or mechanology.

This is for me a very enlightening and it is precisely why there is scientific value in inspecting and understanding deep neural networks. For example, even though we know that LLMs have very different structures than biological structures, these objects were not deduced from a single principle and we can legitimately consider them as one would consider natural objects and submit then to inductive study as we do today.

Nuances of artificiality in Simondon’s integrated view

The widely understood meaning of “artificial” often refers to something that is man-made, unnatural, or opposed to the natural world. In Simondon’s philosophy, he offers a more integrated and dynamic perspective on artificiality. Simondon does not view artificial objects as fundamentally opposed to nature. Instead, he sees technology and artificial creations as extensions of natural processes. For him, technical objects are products of human creativity, which itself is part of nature. For Simondon, artificiality can be seen as a stage in the evolution of technical objects. As we mentioned before, an “abstract” technical object might seem artificial because it has not yet undergone concretization, so it is not well integrated in its milieu (environment). As these technical objects evolve and adapt, their artificiality diminishes, as they become more aligned with natural principles and better integrated into human and environmental systems. There is a very interesting example about the double flower in Simondon’s book:

(…) Artificiality is not a characteristic denoting the fabricated origin of the object in opposition to spontaneous production in nature: artificiality is that which is internal to man’s artificializing action, whether this action intervenes on a natural object or on an entirely fabricated one; a flower, grown in a greenhouse, which yields only petals (a double flower) without being able to engender fruit, is the flower of an artificialized plant: man diverted the functions of this plant from their coherent fulfillment, to such an extent that it can no longer reproduce except through procedures such as grafting, requiring human intervention. Rendering a natural object artificial leads to the opposite results to that of technical concretization: the artificialized plant can only exist in a laboratory for plants, the greenhouse, with its complex system of thermal and hydraulic regulations. (…)
Conversely, technical concretization makes the primitively artificial object increasingly similar to a natural object. (…)

– Gilbert Simondon, On the Mode of Existence of Technical Objects (p57), 1958.

As you can see, his notion of artificiality is not rooted in the fabrication origin of an object. Artificiality can also be understood in terms of the alienation between humans and technical objects. When technology is poorly understood or remains “opaque”, it can seem foreign, unnatural, or even threatening. Simondon says that this perception of artificiality arises from a cultural failure to understand the internal logic and evolution of technical objects, rather than from any intrinsic quality of the objects. For Simondon, this dichotomy (of seeing technical objects as unnatural or artificial) is misleading because technical objects emerge from natural processes through human creativity, so technical objects are not an “other” to nature but a continuation of it through human agency. Please do not confuse that Simondon is saying that technical objects are natural living beings, Simondon is very clear on differentiation of its fundamental modes of individuation and relationship to the environment. What is important to understand here is its continuity, just as living beings evolve over time, technical objects undergo an evolutionary process through human innovation and the lines between the artificial and the natural and much more blurry than our current culture understands today.

The pedagogical proposal to technological literacy

What Simondon proposes to address the problem of alienation, when seen as arising mainly from the lack of knowledge of the mode of existence of technical objects and also how they evolve, is the reintegration of technology in culture and education. The cultural fear or rejection of technology arises often from a misunderstanding of the nature of technical objects and their role in our human life. Simondon argues that technology is not something alien or opposed to humanity, it is an extension of human creativity and a natural product of human activity.

Humans and technology are part of a co-evolutionary process where both influence and shape one another. Worrying about technology, as an example in “taking jobs” overlooks this broader interdependence. Jobs displaced by technology are not the destruction of human purpose but the transformation of human activity. Instead of fearing job displacement, society should focus on reorganizing and redefining work in ways that allow humans to collaborate with machines to create new opportunities.

It is clear that just reskilling is probably not enough for the next years and there will be a lot of challenges arising from AI, but I think that as a Machine Learning community we are not doing enough in terms of education of the public towards technology, which is the only way that people would have to understand what can happen to their careers in the future and how they can adapt, without an in-depth understanding of technology, the public is left at the hand of disinformation and our polarized cultural legacy.

Hope you enjoyed these notes as I enjoyed writing and reading Simondon’s work, I think it is an essential philosophy for the next decades to come.

– Christian S. Perone

Updates

03/Feb 2025: thanks Pablo Borba for catching some errors.

The post Notes on Gilbert Simondon’s “On the Mode of Existence of Technical Objects” and Artificial Intelligence first appeared on Terra Incognita.

Generalisation, Kant’s schematism and Borges’ Funes el memorioso – Part I

Christian S. Perone — Sun, 09 Jun 2024 20:16:21 +0000

Introduction

Portrait of Immanuel Kant by Johann Gottlieb Becker, 1768.

One of the most interesting, but also obscure and difficult parts of Kant’s critique is schematism. Every time I reflect on generalisation in Machine Learning and how concepts should be grounded, it always leads to the same central problem of schematism. Friedrich H. Jacobi said that schematism was “the most wonderful and most mysterious of all unfathomable mysteries and wonders …” [1], and Schopenhauer also said that it was “famous for its profound darkness, because nobody has yet been able to make sense of it” [1].

It is very rewarding, however, to realize that it is impossible to read Kant without relating much of his revolutionary philosophy to the difficult problems we are facing (and had always been) in AI, especially regarding generalisation. The first edition of the Critique of Pure Reason (CPR) was published more than 240 years ago, therefore historical context is often required to understand Kant’s writing, and to make things worse there is a lot of debate and lack of consensus among Kant’s scholars, however, even with these difficulties, it is still one of the most relevant and worth reading works of philosophy today.

I’m perplexed that there are only sparse works about Kant’s philosophy in the ML community that are not related to ethics. Don’t get me wrong, ethics is as important as it ever was, but it won’t solve the riddle of generalisation. There is much more than ethics in philosophy.

In this article, I will do my best to show how Kant is very relevant to Machine Learning while avoiding doing exegesis of the CPR, however, I will leave a lot of references if you are interested in dedicating (which I hope you will) some more time to read. I think that as a Machine Learning community, we really need to pay more attention to philosophy. Neuroscience definitely has its place, but we need to remember that the brain is an organic solution to a problem, while the mind is a general solution, we cannot get distracted by looking at only one aspect of intelligence alone and keep doing correlation of embeddings with brain activity to argue we are in the right direction.

Situating schematism in Kant

I won’t go through the entire CPR here, there are many texts helping to understand it, one very good introduction to it is the Routledge Philosophy Guidebook by Sebastian Gardner [2], which is a fine introduction and companion to read the CPR. I will try, however, to explain what is required to understand and appreciate Kant’s ideas, but I will of course take shortcuts for that. I’m no Kant expert, I just love his writings and I think there is still much we can learn from.

Philosophy, showing Kant’s name and Konigsberg below, London Published for the Encyclopaedia Londinensis. 1825.

The first thing you need to understand is that for Kant, there are forms of thought which cannot be learned or derived by looking at the world, these are the “categories” or “pure concepts of understanding”. This can become clear with the example [3] of a billiard table with one red billiard ball on it. Our understanding of the concept “red” comes from seeing things that are red that have similar attributes to it. However, we cannot learn about “oneness” by looking at the world, because we cannot learn the notion of oneness by looking at a lot of single things, this is presupposing the exact concept we’re expecting to learn and we are guilty of circularity.

Kant lists twelve categories, and then in the transcendental deduction chapter of the CPR, he proceeds by showing that these pure concepts of understanding are prerequisites to any empirical perception. In order to connect these abstract categories with the empirical world, Kant introduced the concept of schematism. Schematism is the process that mediates between pure, a priori concepts (the categories) and the sensory, empirical data we encounter. Without this intermediary step, the categories remain empty, and sensory data remains unordered. This mediation is crucial as it operationalises how abstract concepts apply to concrete experiences.

Kant posits that for each category there exists a corresponding “schema” or procedural rule that bridges the gap between the category and the sensory information. For example, the category of causality is tied to a schema that dictates how we perceive sequences of events and infer cause and effect. Without such schemata, abstract categories would have no way to manifest in any recognisable way in our perception and experience of the empirical world.

According to Kant, a schema is a “representation of a universal procedure of imagination in making an image for a concept” [1]. Kant thinks that no particular image of a triangle can be adequate to the general concept of a triangle, hence the need for the schematism and the “homogeneity” between what is intuited by sensibility and the concepts of the understanding. One thing that is very interesting here is that Kant attributes this procedure to imagination and it is impossible not to link imagination with generative models. It seems still not clear to me (and to many others as I can tell), however, how Kant envisioned this role of imagination into being able to go from concept to particular experiences, from general to particular. Even though we can imagine all sorts of triangles by knowing the concept of a triangle, it doesn’t seem possible that we are conceiving all possible triangles to “match” a particular triangle (although some authors proposed this explanation to schematism).

A coincidental connection with DeepDreams (or inceptionism)

There is a very nice paper [4] from Jessica J. Williams with the title “The Shape of a Four-Footed Animal in General”: Kant on Empirical Schemata and the System of Nature, where the author backs the argument with a lot of historical references that Kant had in mind scientific illustrations in his discussions of empirical images in the schematism chapter. The author cites the example of the Renaissance physician and naturalist Leonhart Fuchs who developed a new kind of scientific realism that focused on depicting the essential characteristics of specimens. The author in [4] describes a clear example: images of plants as simultaneously bearing fruits and flowers. These images were not naturalistically realistic, but were much more informative and besides being images (individual intuitive representations), they were used to communicate general information.

Neural net “dreams”— generated purely from random noise, using a network trained on places by MIT Computer Science and AI Laboratory.

One very interesting connection of this idea of conveying information about a concept, in the same way as depicting images of plants simultaneously bearing fruits and flowers, is when we look at the (now old technique) of inceptionism (or DeepDreams) from Google in 2015, which I experimented with in 2015 as well after it was released with some images from Codex Seraphinianus. In inceptionism you basically invert the optimization goal, you can start with pure noise or with an existing image and then you do gradient ascent to maximize a particular layers activations. As you can see in the images, it is hard not see that what is happening here is the same as in the scientific illustrations mentioned by the authors of [4] where we don’t have naturalistically realistic images but we have a lot of information (as much as it is possible through optimization constraints and image contraints) about the concept of what is a building, what is a dog, etc.

To be continued

I found these ideas quite interesting, and there are a lot of interesting questions on the table right now, as we don’t really know how generalization works. What is the link of imagination with generalisation and the mediation of concepts and experiences ? What is the connection of generative models to schematism ? I believe that solving the riddle of Kant’s schematism is deeply tied to generalisation. Kant, however, left us in a very difficult situation (even to understand his solution to the problem).

To be continued in Part II once I get more time

Cite this article as: Christian S. Perone, "Generalisation, Kant’s schematism and Borges’ Funes el memorioso – Part I," in Terra Incognita, 09/06/2024, https://blog.christianperone.com/2024/06/schematism/.

References

[1] Pendlebury, M. (1995). Making Sense of Kant’s Schematism. Philosophy and Phenomenological Research, 55(4), 777–797. https://doi.org/10.2307/2108332

[2] Gardner, Sebastian (1999). Routledge Philosophy Guidebook to Kant and the Critique of Pure Reason. Routledge.

[3] Bowie, Andrew (2003). Introduction to German Philosophy: From Kant to Habermas. Polity.

[4] Williams, Jessica J. (2020). “The Shape of a Four-Footed Animal in General”: Kant on Empirical Schemata and the System of Nature. Hopos: The Journal of the International Society for the History of Philosophy of Science 10 (1):1-23.

The post Generalisation, Kant’s schematism and Borges’ Funes el memorioso – Part I first appeared on Terra Incognita.

Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I]

Christian S. Perone — Tue, 26 Sep 2023 12:51:06 +0000

Different gaussian curvature surfaces. Image by Nicoguaro.

We are so used to Euclidean geometry that we often overlook the significance of curved geometries and the methods for measuring things that don’t reside on orthonormal bases. Just as understanding physics and the curvature of spacetime requires Riemannian geometry, I believe a profound comprehension of Machine Learning (ML) and data is also not possible without it. There is an increasing body of research that integrates differential geometry into ML. Unfortunately, the term “geometric deep learning” has predominantly become associated with graphs. However, modern geometry offers much more than just graph-related applications in ML.

I was reading the excellent article from Sander Dieleman about different perspectives on diffusion, so I thought it would be cool to try to contribute a bit with a new perspective.

A tale of two scores

Fisher information, metric and score

R.A. Fisher at his calculator in 1958 (courtesy of the Fisher Memorial Trust).

There are two important quantities that are widely known today and that keep popping out basically everywhere. The first one is the fisher information matrix $ \mathbf{F}$ (or FIM):

$$\mathbf{F}_\theta = \mathop{\mathbb{E}} \left[ \nabla_\theta \log p_\theta(y \vert x) \, \nabla_\theta \log p_\theta(y \vert x)^T \right] \,$$ with $y \sim p_\theta (y \vert x)$ and $x \sim p_{\text{data}}$. Note that where $y$ comes from is very important and often a source of confusion. $y$ is from the model’s predictive distribution (and this is quite interesting because it means you don’t need labels to estimate $ \mathbf{F}$ as well). The FIM is used in many places, such as Cramér-Rao bound, continual learning, posterior approximation, optimization, bayesian prior, KL divergence curvature, etc. Note that there is a lot of debate about the FIM vs empirical FIM and their different properties that I will skip going over here (I discussed this in the optimization context in this presentation if you are interested).

Dr. C.R. Rao, during the Indian Statistical Institute (ISI) days.

The fisher information matrix is also used in information geometry as a Riemannian metric where it is called Fisher-Rao metric (there are other names for it as well, which can be quite confusing). In this statistical manifold, where coordinates are parametrizing probability distributions, the metric (which equips the manifold) induces a inner product and allows us to compute norms and distances for distributions. Information geometry was pioneered by the late C. R. Rao and further developed and popularized by Shun-ichi Amari (who wrote some fine books about it).

We will talk more about the statistical manifold and what the metric actually does more intuitively later, but for now, note that the FIM uses the score, or what we can call, the Fisher score:

$$\mathbf{s}(\mathbf{\theta}) = \nabla_\mathbf{\theta} \log p(\mathbf{x} \vert \mathbf{\theta})$$

This score is the gradient of the log-likelihood w.r.t. its parameters $\theta$, so it is telling us the steepness of the likelihood, with the FIM meaning the variance of this score. The FIM is also equivalent to the negative expectation of the Hessian matrix, which points its significance as a curvature at a parameter point, hence its appearance as a metric tensor as well (to be precise, as a metric tensor field).

The other score, as in score-based models (aka Stein score)

Now, there is another score, which is the one used in score-based models and score matching, which is often called Stein score:

$$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}\vert \mathbf{\theta})$$

Note that even though it looks similar and has a similar name to the previous score we showed, this is a very different score function. It doesn’t give you the gradients for distribution’s parameters but gradients w.r.t. data. It has been shown that we can estimate this score function from data even in absence of ground truths to this quantity. Yang Song has a nice article explaining motivation and recent developments.

The main point is that once you have this score function, you have a very powerful gradient field that tells you how samples should move in data space. You can then sample from the data distribution using Langevin sampling, which is basically SGD with noise to avoid collapse to a minima.

The missing metric

If the Fisher score gives the building block to the metric tensor for the statistical manifold, which metric can we build with this (Stein) score and which manifold does it belongs to ? It is surprising that we still don’t seem to have a clear formalization for this yet, at least I wasn’t able to find much about it. You can find some works about diffusion models on Riemannian manifolds but not about using the estimated (through modern deep learning models) score to build a Riemannian metric.

There is a nice quote from the physicist John Wheeler about Einstein’s relativity:

Space-time tells matter how to move and matter tells space-time how to curve.

– John Wheeler

It is very interesting that we can build a metric using this estimated score function, with the same mathematical framework used in the theory of relativity, where the quote can be modified to our case as:

Diffusion models tells data how to move and data tells Diffusion models how to curve.

I will start to explore the topic with some examples in a series of posts, but here is a glimpse of a geodesic using the stein score as metric tensor where a Gaussian is curving the data manifold and creating this structure where the shortest distance from two points is not a straight line anymore:

This is a very interesting connection, seeing diffusion and score-based models as a metric tensor field can give us very interesting tools to explore data distances, geodesics, norms, etc, from the data manifold itself. We are still in the statistical domain, but the manifold is not the statistical manifold anymore where Riemannian coordinates parametrize distributions, it is a manifold where coordinates are the samples themselves. I think this connection of the score with the metric tensor field is a unexplored domain that is definitely very fertile, it can give us a much deeper understanding not only of data but also about our sampling algorithms.

The inner product induced by the score metric is the following:

$$\langle \delta_{P}, \delta_{Q} \rangle_{g_x}$$

where the metric tensor $g_x$ is:

$$g_x = \nabla_{\mathbf{x}} \log p(\mathbf{x}\vert \mathbf{\theta})^{T} \nabla_{\mathbf{x}} \log p(\mathbf{x}\vert \mathbf{\theta})$$

So the inner product becomes:

$$\langle \delta_{P}, \delta_{Q} \rangle_{g_x} = \delta_{P} g_x \delta_{Q}$$

Note that we are using the (Stein) score as building block for our metric tensor $g_x$, and this score is replaced by the estimated one parametrized by a deep neural network, so notation can become a nightmare because the base point where the metric tensor is evaluated is already used as lower index, so it can become $g^{\theta}_x$ to denote that this metric tensor is parametrized by $\theta$ (to make things worse, in diff geometry, indices positions also has an important meaning).

Hope you like the idea and please provide feedback and keep an eye in the next posts of this series.

Updates

27 Sept 2023: added more details about the metric tensor definition using the (Stein) score;
3 Jun 2024: changes to improve clarity.

– Christian S. Perone

Cite this article as: Christian S. Perone, "Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I]," in Terra Incognita, 26/09/2023, https://blog.christianperone.com/2023/09/thoughts-on-riemannian-metrics-and-its-connection-with-diffusion-score-matching-part-i/.

The post Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I] first appeared on Terra Incognita.

Large language model data pipelines and Common Crawl (WARC/WAT/WET)

Christian S. Perone — Sat, 03 Jun 2023 19:17:05 +0000

Erik Desmazieres’s “La Bibliothèque de Babel”. 1997.

We have been training language models (LMs) for years, but finding valuable resources about the data pipelines commonly used to build the datasets for training these models is paradoxically challenging. It may be because we often take it for granted that these datasets exist (or at least existed? As replicating them is becoming increasingly difficult). However, one must consider the numerous decisions involved in creating such pipelines, as it can significantly impact the final model’s quality, as seen recently in the struggle of models aiming to replicate LLaMA (LLaMA: Open and Efficient Foundation Language Models). It might be tempting to think that now, with large models that can scale well, data is becoming more critical than modeling, since model architectures are not radically changing much. However, data has always been critical.

This article provides a short introduction to the pipeline used to create the data to train LLaMA, but it allows for many variations and I will add details about other similar pipelines when relevant, such as RefinedWeb (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only) and The Pile (The Pile: An 800GB Dataset of Diverse Text for Language Modeling). This article is mainly based on the pipeline described in CCNet (CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data) and LLaMA’s paper, both from Meta. CCNet was developed focusing on the data source that is often the largest one, but also the most challenging in terms of quality: Common Crawl.

The big picture

The entire pipeline of CCNet (plus some minor modifications made by LLaMA’s paper) can be seen below. It has the following stages: data source, deduplication, language, filtering, and the “is-reference” filtering which was added in LLaMA. I will go through each one of them in the sections below.

Visual overview of the CCNet pipeline with some modifications done in LLaMA. Click to enlarge.

Let’s dive into it !

Common Crawl

Common Crawl (or CC) data is the data coming from a non-profit organization of the same name that does massive crawling of websites and releases this archive under permissive terms. This is by no means an easy feat, consider the tasks of filtering spam, deciding which URLs to crawl, crawling massive amounts of data from different servers, data formats, etc. That’s why you should consider donating if you use it.

Common Crawl provides different archival formats that you can use and this format evolved over time. Nowadays they are available in 3 main different formats (besides the index): WARC, WAT, and WET.

WARC/WAT/WET formats

WARC: the WARC format is the largest one as it is the least processed version of the crawl process, it is the raw data and has a very clever format that records the HTTP response headers, so you can even get information about the server being used on each host. This format is seldom used for NLP as it is really huge and has data that is not used for LLMs training. However, this is the primary data format of CC so it is very rich and might be useful for multimodal datasets, that’s why I think that WARC and WAT (described below) might start to see more uses in the next years.

WAT and WET: these are secondary data sources of CC as they are processed data. These two are the ones that are often used for training LMs and here is where different pipelines start to diverge as well. These formats contain different types of records with the WAT having more metadata than WET and also HTML tags content and links. WET is mainly only a textual format.

If you want to see real examples of WARC/WAT/WET, take a look at this link as I omitted examples here to keep things short, but they are very interesting formats that are worth a look if you want to use them or understand how to load and parse them.

Now, CCNet (CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data) uses the WET format which is purely textual, and that is where we will focus, however, there are some other pipelines that use WAT with the argument that to extract high-quality textual data you have to go to WAT instead of WET (bypassing the CommonCrawl processing to extract text). One example of a pipeline that doesn’t use the WET files is The Pile (The Pile: An 800GB Dataset of Diverse Text for Language Modeling), where they use jusText. They mentioned that they can extract higher-quality text than using WET files.

You probably realized that we just started with CC and there are already multiple options to extract data from it. Another recent pipeline called RefinedWeb (used in Falcon) also uses WARC directly and skip the CC pipeline for text extraction (the one which generates the WET files). RefinedWeb, however, uses trafilatura instead of jusText for text extraction.

URL Filtering

Although it is not mentioned in CCNet, many pipelines do URL filtering using public available blocklists of adult/violent/malware/etc websites. In RefinedWeb for example, they filter URLs using a blocklist of 4.6M domains and also use a word-based filtering of the URL. You can be very creative here and aggregate multiple blocklists from different sources.

Deduplication

Let’s now talk about deduplication, which can be a controversial matter. In Deduplicating Training Data Makes Language Models Better you can have an idea of what they found. There is, however, Pythia (Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling) that said that “… deduplication of our training data has no clear benefit on language modeling performance.” [emphasis added]. Therefore I think it is still an open debate, but given excellent results from LLaMA, I wouldn’t leave deduplication aside for any new model, but we will probably see more works about it in the near future.

Let’s discuss now how deduplication works on CCNet. CC snapshots are big, if we look at the size of WET files for March/April 2023, WET is 8.7 TiB and WAT is 21.1 TiB (both compressed already!). The first thing that CCNet does is to break these WET snapshots into 5GB shards that are saved in JSON where each entry corresponds to a crawled web page.

Erik Desmazieres’s “La Bibliothèque de Babel”. 1997.

The next step after sharding is paragraph normalization, as deduplication happens at the paragraph level. They normalize each paragraph by lower-casing it, replacing numbers with a placeholder, and removing all Unicode punctuation (you can also replace them) and accent marks. Next, they compute the SHA1 of each paragraph and use the first 64 bits to deduplicate them. After that, there is the option to do deduplication by comparing among all shards or a fixed number of shards, more details in their paper if you are interested in this comparison.

It is interesting to note that on the RefinedWeb dataset, they seem to be much more aggressive by employing fuzzy deduplication and using “strict settings” that led to the removal rates “far higher than other datasets have reported” (as a baseline, CCNet reported that duplicated data accounted for 70% of the text). This can certainly have a significant impact on dataset diversity.

Another important aspect of deduplication is described in CCNet paper: this step removes a lot of boilerplate (e.g. navigation menus, cookie warnings, and contact information). It also removes English content from pages in other languages and makes language identification which we will discuss next, more robust.

Here is an overview of the process:

As you can see, it starts by stripping spaces, then lower cases it and replaces the digits with a placeholder (e.g. zero). After that it removes (or replaces) Unicode punctuation, performs a SHA1 hashing, and uses the first 8 bytes for deduplication comparisons (paragraph level). Do not confuse the deduplication process with training data, this is done only to compute the final hash and deduplicate data, not to use this data to train the model.

Now, on RefinedWeb, they followed what was done in Gopher as well which is to remove repetitions before filtering for deduplication by removing any document with excessive line, paragraph, or n-gram repetitions. After that, they used a deduplication pipeline using MinHash (On the resemblance and containment of documents) which they found to be very effective at removing SEO templates (placeholder SEO text repeated across websites, etc). They also did exact deduplication, but since CC is enormous, they did something similar to what was mentioned as an alternative in CCNet, where it is first sharded and deduplication happens at individual shard level.

Language

Let’s take a look now at language identification, scoring, and filtering. In CCNet they employed fastText (Bag of Tricks for Efficient Text Classification), which was trained with data from Wikipedia, Tatoeba, and SETimes. fastText supports 176 languages and reports a score for each one of them.

In CCNet, if the score for the most probable language is not higher than 0.5 (50%), they discard the web page, otherwise, the language is classified as being of the most probable language identified.

Note that although the LLaMA dataset filtered non-English data from the CC dataset, it was trained with other datasets that had other languages (e.g. Wikipedia). In my experience, LLaMA is impressively good at other languages as well (e.g. Portuguese).

RefinedWeb pipeline, just like CCNet, also employed fastText to identify languages. One important distinction here is that the RefinedWeb pipeline uses a different threshold of 0.65 instead of 0.5 and they have switched the order of deduplication and language identification, they do language identification before the deduplication, as opposed to CCNet which does the reverse.

LM Filtering

At this point, we have deduplicated data, and language was identified and filtered. It doesn’t mean, however, that the quality is good. That’s the reason why CCNet does another filtering step: they use the perplexity from a language model trained on the target domain language, as they found this score to be a relatively good proxy for quality. They train a 5-gram Kneser-Ney model on the Wikipedia of the same language as the target domain and then use these models to compute per-paragraph perplexity.

With the perplexities in hand, you need to find a threshold. What CCNet paper describes is that they computed 3 parts of equal size (head, middle, tail) from the distribution of perplexities of each language (as different languages showed very different perplexity distributions), however, there is an important excerpt from the paper that I quote below:

(…) Some documents despite being valid text ends up in the tail because they have a vocabulary very different from Wikipedia. This includes blog comments with spokenlike text, or very specialized forums with specific jargon. We decided to not remove content based on the LM score because we think that some of it could be useful for specific applications. (…)

This means that it really depends on your application, as you might be ending up removing important data by blind thresholding just using an LM trained on Wikipedia. In RefineWeb, they avoided using LM for filtering and relied only on simple rules and heuristics. They used a pipeline very similar to the one used in Gopher, where outliers are filtered “... in terms of overall length, symbol-to-word ratio, and other criteria ensuring the document is actual natural language“, they remark that this needs to be done per language as well, the usual issue with using heuristics too tied to the language characteristics.

“Is reference” filtering

This section is not present in CCNet but it was used as an extra step in the LLaMA dataset so I decided to add it here as well. This step is also not very well described in LLaMA’s paper, but it seems that a simple linear classifier was trained (not sure with which features) to classify pages used as references in Wikipedia v.s. randomly sampled pages, then pages classified as non-reference were discarded.

I think that this step might look simple at first hand, but it can have a significant impact on dataset quality depending on the threshold used. It seems to me that for LLaMA’s dataset, the LM filtering was more conservative to avoid removing relevant data and then they added this extra step to deal with remaining quality issues, but this is me hypothesizing.

Addendum: RefinedWeb diagram

RefinedWeb paper has a very nice Sankey diagram of their pipeline:

Image from: The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. Guilherme Penedo et al. 2023. https://arxiv.org/abs/2306.01116.

This is a very informative figure that tells us how much of the data was discarded. I was personally impressed by the amount of data removed during the repetition removal step.

Closing remarks

Erik Desmazieres’s “La Bibliothèque de Babel”. 1997.

I hope you enjoyed this article, the main goal was to give a brief overview of the steps and decisions you need to take before being able to train a large language model (LLM). There are obviously many other important aspects, such as proportions of different datasets (mixing), tokenization, etc. Given that CC dataset is usually the largest dataset in LLM training, I decided to focus on the pipelines that directly deal with this particular set before tokenization.

Many design decisions on data pre-processing pipelines are made with performance in mind, as we are dealing with large chunks of data from CC dataset. It seems to me that you can find a better balance by investing a little more of the computing budget on the data side, especially when you consider the cost of LLMs training. It is, however, very difficult to anticipate what would be the impact of different decisions taken in the data pipeline on LLMs after training, that’s why small experiments, manual data inspection, and exploratory data analysis are paramount to get an understanding of what is going on.

In summary, every company has the dataset it deserves. It is a long-term investment that requires substantial experimentation, engineering effort, attention to detail, and good intuition to make bets under uncertainty. But it is an investment that pays off in the long run.

– Christian S. Perone

Changelog

6 June 2023: updated with more information made available by RefinedWeb paper.
3 June 2023: first version published.

Cite this article as: Christian S. Perone, "Large language model data pipelines and Common Crawl (WARC/WAT/WET)," in Terra Incognita, 03/06/2023, https://blog.christianperone.com/2023/06/appreciating-llms-data-pipelines/.

The post Large language model data pipelines and Common Crawl (WARC/WAT/WET) first appeared on Terra Incognita.

Terra Incognita

TorchStation Prototype V1 – GPUs panel

VectorVFS: your filesystem as a vector database

PyTorch 2 Internals – Talk

Torch Titan distributed training code analysis

Memory-mapped CPU tensor between Torch, Numpy, Jax and TensorFlow

The geometry of data: the missing metric tensor and the Stein score [Part II]

Introduction

Manifolds \(M\) and Tangent Spaces \(T_pM\)

Metric tensor \(g\): equipping the inner product \(\langle u, v \rangle\)

Example: Metric tensor \(g\) in Euclidean space

Curves \(\gamma(t)\) and Geodesics

Curves, Length and Energy

Geodesics

Geodesics as energy minimization

The missing metric tensor

Stein score function

The missing Stein metric tensor for the data manifold

Efficient inversion with Sherman-Morrison formula

Optimizing geodesics on the data manifold

Deriving the Stein score from a multivariate Gaussian

Visualizing the Geodesic optimization (Energy minimization)

Understanding the Energy landscape

Some final thoughts

Changelog

Citation

Notes on Gilbert Simondon’s “On the Mode of Existence of Technical Objects” and Artificial Intelligence

Culture positioning itself as a defensive system against technology

Automatism and the mythical representation of the robot

From tool bearer to spectator

Alienation, not only from means of production

Technology and Artificial Intelligence are not just tools

Small detour: concretization makes inductive study sensible

Nuances of artificiality in Simondon’s integrated view

The pedagogical proposal to technological literacy

Generalisation, Kant’s schematism and Borges’ Funes el memorioso – Part I

Introduction

Situating schematism in Kant

A coincidental connection with DeepDreams (or inceptionism)

To be continued

References

Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I]

A tale of two scores

Fisher information, metric and score

The other score, as in score-based models (aka Stein score)

The missing metric

Updates

Large language model data pipelines and Common Crawl (WARC/WAT/WET)

The big picture

Common Crawl

WARC/WAT/WET formats

URL Filtering

Deduplication

Language

LM Filtering

“Is reference” filtering

Addendum: RefinedWeb diagram

Closing remarks

Changelog