The mathematics behind the “curse of dimensions” | Written by Maxim Wolf | April 2024

Machine Learning


Delve deeper into the concept of the “curse of dimensionality” and understand the mathematics behind all the amazing phenomena that occur in higher dimensions.

Maxim Wolf
Towards data science
Image from Dall-E

Handling high-dimensional vectors is not just commonplace in the field of machine learning. It's essential. This is illustrated by the architecture of popular models such as Transformers. For example, BERT uses 768-dimensional vectors to encode the tokens of the input sequence it processes to better capture complex patterns in the data. The use of 768-dimensional vectors is quite surprising considering that our brains have a hard time visualizing anything beyond his three dimensions.

Some machine learning and deep learning models excel at these high-dimensional scenarios, but they also have many challenges. In this article, we explore the concept of the “curse of dimensionality,” describe some interesting phenomena associated with it, delve into the mathematics behind these phenomena, and discuss the general implications for machine learning models.

Please note that a detailed mathematical proof related to this article is available on my website as a supplementary extension to this article.

People often assume that geometric concepts that are familiar in three dimensions work similarly in higher dimensional spaces. this is not the case. As dimensionality increases, many interesting and counterintuitive phenomena occur. The “curse of dimensionality” is a term invented by Richard Berman (famous mathematician) to refer to all these surprising effects.

What's special about higher dimensions is how the “volume” of space (more on this in a moment) increases. exponentially. Gets the (one-dimensional) tick marks from 1 to 10. This line has 10 integers. Extending this to two dimensions is a square with 10 × 10 = 100 points with integer coordinates. Now, let's consider “only” 80 dimensions. It should already exist. 10⁸⁰ points That is the number of atoms in the universe.

In other words, as the dimension increases, the volume of the space increases exponentially, so that the data becomes: becoming increasingly sparse.

High-dimensional space is “empty”

Consider this another example. Suppose you want to calculate the farthest distance between two points in a unit hypercube (each side of length 1).

  • in 1 dimensional (A hypercube is a line segment from 0 to 1), and the maximum distance is simply 1.
  • in 2D (The hypercube forms a square), and the maximum distance is the distance between the diagonals. [0,0] and [1,1]is √2 calculated using the Pythagorean theorem.
  • Extending this concept, n-dimensionaldistance between points [0,0,…,0] and [1,1,…,1] is √n. This formula arises because each additional dimension adds 1 squared to the sum under the square root (also due to the Pythagorean theorem).

Interestingly, as the number of dimensions n increases, the maximum distance within the hypercube increases by a factor of O(√n). This phenomenon indicates the following: diminishing returns effect, the increase in dimensional space is proportionally smaller as the spatial distance increases. This effect and its implications are discussed in detail in the next section of this article.

Let's dig deeper into the concept of distance that we started exploring in the previous section.

We got our first glimpse of how high-dimensional space roughly represents the concept of distance. meaningless. But what does this actually mean and is it possible to visualize this phenomenon mathematically?

Consider an experiment using the same n-dimensional unit hypercube defined earlier. First, we randomly sample many points within this cube to generate a dataset. That is, it effectively simulates a multivariate uniform distribution. Then sample another point (the “query” point) from that distribution, Distance from nearest and farthest in dataset.

The corresponding Python code is:

def generate_data(dimension, num_points):
''' Generate random data points within [0, 1] for each coordinate in the given dimension '''
data = np.random.rand(num_points, dimension)
return data

def neighbors(data, query_point):
''' Returns the nearest and farthest point in data from query_point '''
nearest_distance = float('inf')
farthest_distance = 0
for point in data:
distance = np.linalg.norm(point - query_point)
if distance < nearest_distance:
nearest_distance = distance
if distance > farthest_distance:
farthest_distance = distance
return nearest_distance, farthest_distance

You can also plot these distances.

Distance between nearest and farthest points as n increases (Image by author)

Using a logarithmic scale, Relative The difference in distance between nearest and farthest neighbor tends to decrease as dimension increases.

This is very unintuitive behavior. As we discussed in the previous section, the volume of the space increases exponentially, so the points become very sparse with each other, but at the same time, Relative The distance between points becomes smaller.

The concept of nearest neighbor disappears

This very concept is distance As spatial dimension increases, it becomes less relevant and distinguishable. As you can imagine, machine learning algorithms that rely solely on distance, such as kNN, run into problems.

Next, let's talk about other interesting phenomena. For this, N ball. An n-ball is an n-dimensional generalization of a ball. An n sphere of radius R is a set of points at a distance of at most R from the center of space 0.

Consider a radius of 1. Ball 1 is a segment. [-1, 1]. 2 The ball is a disk bounded by the unit circle, and its equation is x² + y² ≤ 1. 3 The ball (usually called a “ball”) has the equation x² + y² + z² ≤ 1. , whose definition can be extended to any dimension.

The question here is: What is the volume of this ball? This is not a trivial question and requires quite a bit of mathematics, which I won't go into in detail here. However, all the details can be found in my post on the volume of the N ball on my website.

After a lot of fun (integral calculations) we can prove that the volume of n balls can be expressed as: Here, Γ indicates the gamma function.

For example, if R = 1 and n = 2, Γ(2) = 1, so the volume is πR². This is actually the “volume” of the two balls (also known as the “area” of the circle in this case). ).

However, in addition to being an interesting mathematical challenge, the volume of an n-ball also has some very surprising properties.

As the dimension n increases, the volume of the n ball converges to 0.

This is true for all radii, but let's visualize this phenomenon using some values ​​of R.

Volume of N balls for different radii with increasing dimensions (image by author)

As you can see, it just converges to 0, but it increases at first and then decreases to 0. When R = 1, the ball with the largest volume is 5 balls, and the value of n that reaches the maximum shifts to . Go to the right as R increases.

Below are the initial values ​​of the volumes of unit n balls up to n = 10.

Volume of unit n-ball for different values ​​of n (image by author)

The volume of a higher-dimensional unit sphere is concentrated near the surface.

In small dimensions, the volume of the ball looks very “uniform”, but in higher dimensions this is not the case.

spherical shell

Consider n balls of radius R and another ball of radius R-dR where dR is very small. The part of the n-ball between these two balls is called the “shell” and corresponds to the part near the surface of the ball (see 3D visualization above). We can calculate the ratio between the volume of the entire ball and the volume of just this thin shell.

Ratio (total volume/thin shell volume) as n increases (image by author)

As you can see, it converges to 0 very quickly. In high-dimensional space, almost all volumes are near the surface. For example, when R = 1, dR = 0.05, and n = 50, approximately 92.3% of the volume is concentrated in the thin shell. This indicates that in higher dimensions the volume is in the “corner”. This is also related to the distortion of the concept of distance that we saw earlier.

Note that the volume of the unit hypercube is 2ⁿ. The unit sphere is essentially “empty” in very high dimensions, whereas the unit hypercube, in contrast, gains exponentially more points. Again, this shows that when n is large, the idea of ​​a “nearest neighbor” of a point loses its validity, since there are very few points within distance R of the query point q.

The curse of dimensionality is closely related to the overfitting principle. Because the volume of space increases exponentially with dimension, very large datasets are required to adequately capture and model high-dimensional patterns. To make matters worse, the number of required samples continues to increase. exponentially It has dimensions to overcome this limitation.This scenario is characterized by relatively few data points despite its many characteristics, especially prone to overlearning.

Occam's razor suggests that In general, simple models are better than complex ones. This is because there is less chance of overfitting. This principle is particularly relevant in high-dimensional contexts (where the curse of dimensionality comes into play), as it facilitates reducing model complexity.

Applying Ockham's razor principle to high-dimensional scenarios may mean reducing the dimensionality of the problem itself (through methods such as PCA, feature selection, etc.). Reduces some effects of dimensional curse. Simplifying the model structure or feature space helps manage sparse data distributions and makes distance metrics meaningful again. For example, dimensionality reduction is very common. preliminary stage Before applying the kNN algorithm. A more recent method is ANN (Approximate Nearest Neighbor Method) also emerges as a method to deal with high-dimensional scenarios.

Image by Dall-E

Having outlined the challenges of high-dimensional settings in machine learning, there are also challenges such as: some advantages!

  • Can be strengthened at a higher level linear separabilitymaking techniques such as kernel methods more effective.
  • Additionally, deep learning architectures especially skilled Helps navigate and extract complex patterns from high-dimensional spaces.

As always in machine learning, this is a tradeoff: To take advantage of these benefits, you need to balance the increased computational demands with the potential improvements in model performance.

We hope this gives you an idea of ​​what “weird” geometry can be in higher dimensions, and the many challenges it poses for machine learning model development. We found that in high-dimensional spaces, data is not only very sparse, but also tends to be concentrated in the corners, making distance useless. If you'd like to learn more about n-ball and mathematical proofs, I recommend checking out the expanded version of this article on my website.

The “curse of dimensionality” outlines significant limitations in high-dimensional spaces, but it is exciting to see modern deep learning models becoming increasingly adept at how to navigate these complexities. That's interesting. Consider, for example, embedding models and his latest LLM, which utilizes very high-dimensional vectors to more effectively identify and model text patterns.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *