database - About curse of dimensionality -


My question is about this topic, I'm reading a little bit. Basically my understanding is that all points in high dimensions are very close to each other.

I suspect that this means that the calculation of distance (for example, Euclidean) in a normal way is valid or not. If it is still valid, it would mean that while comparing the vectors in high dimensions, the two most similar ones are not very different from one-third, even if this third is completely unrelated.

Is that correct? Then in this case, how can you tell if you have a match or not? Although

Actually, though the distance measurement is still correct, however, when you have "real world" data , Which is noisy, then it becomes meaningless.

The effect we talk about is that in a single dimension, between two points, a high distance covered with all the other dimensions quickly from a small distance, this is the reason that in the end, all the points Some distance ends from that distance. Here's a good example:

We say that we want to classify the data based on their value in each dimension. We just say that we split each dimension once (which is a series of 0..1). The values ​​in [0, 0.5] are positive, the values ​​in [0.5, 1] ​​are negative. With this rule, in 3 dimensions, 12.5% ​​of the space is covered. In 5 dimensions, it is only 3.1% in 10 dimensions, it is less than 0.1%.

So in every dimension we still give half of the total price limit! Which is quite high but it all ends up to 0.1% of total space - the difference between these figures is bigger in each dimension, but it is negligible to the whole place.

You can go ahead and say in each dimension, if you deduct only 10% of the limit then you allow value in [0, 0.9]. You still end up with less than 35% of the entire space involved in 10 dimensions. In 50 dimensions, this is 0.5%. So you see, the wide range of data in each dimension becomes a very small part of your search location.

That's why you need a diminished reduction, where you basically ignore the differences on less informative axes.


Comments