The image shows the data in 2D. They are clustered as expected.
Each ID is ten times present (a so-called Group). For each group, I go through all 10 descriptions and search for the nearest 10. This means that when grouped together for all 10 requests, only products of the same group should be returned. But this does not happen.
Here is an example of this for the first data point in Group 7:
7 7
7 7
7 7
7 7
7 7
7 3
7 7
7 7
7 7
7 4
Group 3 and 4 are far far away from Group 7 and Group 7 is clustered so only results of Group 7 are expected.
EDIT:
Funny enough, when projecting the data to 3 dimensions but only using the first two elements of the vector I get the following representation:
Now the results actually make sense, but only when you are projecting to 3 dimensions and then using the data as it would be only two dimensions, ignoring the 3rd coordinate and thus losing a lot of information.
Hi @micartey - I’m curious. Is this with “real” vector embeddings from a language models? Or arbitrary, low-dimensional vectors (like [7, 7]?
I ask as I’ve seen this behaviour before. I am not sure why this happens - but I suspect it’s some sort of side effect of ANN/HNSW. But I think the ANN search works well with “real” vector embeddings.
One way to test whether it works with real embeddings would be to build a say, 300D array with floats between 0 and 1, or even to actually generate vectors yourself with a model. Then test it by brute force search using a Numpy array or something vs using Weaviate.
Hi @micartey thanks for explaining! Sounds like I definitely didn’t understand your question haha. Sorry about that.
So let me try to understand your issue better - is the nearText algorithm not fetching the right vectors with the smallest distances? As in, if you had a Numpy array with the same vectors, would the results be very different?
I guess I’m trying to figure out whether the problem is:
In producing the embedding;
With looking at the reduced dimensions with UMAP; or
The projection to 3 and 2 dimensions looks as expected, but the distances returned by the database when searching seem to be wrong.
If you could tell me how I can retrieve all the data without altering them (no projections just the “real” vector), I can calculate the distances using Euclid and tell you more. I just need to know how I can get the full / real vector of my data points. If the results are different, it is most likely ANN.
I would be interested to see what is going on. Please do keep in mind that ANN’s recall isn’t perfect (this is the ‘A’ in ANN for approximate).
Hmm. Actually, that reminds me that the distance metric should be dot if you are using the default model for text2vec-cohere (the default is cosine). I wonder if this is causing issues.
(Source: Multilingual Embed Models)
It definitely shouldn’t be Euclid . So if you do get the vectors manually pls compare the dot products.
Following the documentation, setting the distance to dot is not the issue as I only used English data. Sadly, I could not test it as my cohere trail key just hit the monthly rate limit
But the issue is the same with Open AI.
To give you a relative score:
Cohere: 40,95 %
OpenAI: 45,09 %
I expected at least 90% due to the visualizations in both 2D and 3D.
Just two points are outside their clusters. A cluster has 10 elements and thus, normally, the first 10 elements of any point in a cluster, should be from the same cluster (first response is trivial as it is the data point itself). This behavior is data independent as far as I could tell.
Two things - the default text2vec-coheremodel is the multilingual one, so I think you still should be using dot products afaik, as the vectors aren’t normalized.
But disregarding that, it’s really hard to comment on what is going on without looking at the source data. These models are comparing their semantic (word) similarity, so it’s hard to comment on why this is happening without looking at whether this is being caused by the ANN algorithm having imperfect recall, or that the descriptions don’t really reflect their clustering.
For example, the description could be similar even though they have vastly different colors, IDs and categories, because they will be just small parts of the whole text that is being vectorized.
It is a dataset I generated with GPT-4 for a research paper I have to write in my 4. semester (nothing big). I generated 10 categories with 10 product descriptions each. Important is that a product description is NOT containing the category as a word. I want to test how good embeddings + vector databases are in understanding the semantic meaning of sentences and investigate how good they are for content-based recommendation systems based on product descriptions. Judging from the plotted data in both 2D and 3D I would tend to say: really good, but that can’t be when querying the data.
Therefore, I am only vectorizing the descriptions (as in the schema I posted before).
I don’t mind sending you the dataset as well as the code via mail if you want to take a look at it.
Just don’t want to post it here due to fear of plagiarism.
And then evaluate their similarities using numpy/scipy. It doesn’t look like a huge dataset so it should be definitely viable from a compute perspective.
I exported the dataset, wrote a little python script and the results are as expected – just like in the 2D representations and hence different from what the database returns for a query.
Although, I have to add that I used Euclid for the exported dataset. As far as I understood, Euclid is not used to the efficiency and the “curse of dimensionality” ?
Is it sensible to project the data to 2D and then use Euclid or use Euclid directly – if not, why?
Using the angle between vectors doesn’t take into account how far away a point is on that line, and just differences in angle.
From my point of view, it is hardly usable if the data is only 50% correct due to “efficiency”. I’d rather have less accuracy by projecting the data to 3D or 2D and use sth like Euclid to get the 90-95%. But there are a lot of more clever people than me, so why is that not being done?
Projecting the data to 2/3D first and evaluating a distance is not a good method due to the loss of information.
Using the wrong (e.g. Euclidean when the model is trained on cosine) distances on vectors will produce incorrect results.
I’ll try my best with longer explanations
Distance metrics
Models (like OpenAI’s ada-002) that produce embeddings are trained to produce vectors that make sense in the context of the distance metric. In other words, the distance metric (like cosine) is the target that these models are trained on.
Changing the measurement distance after the vectors are produced would be like having athletes run a 100m race and then determining the winner by a different metric like the runner’s heights.
Projecting Dimensions
In terms of projecting a vector to 2D or 3D, the problem is that you lose way too much information. Imagine representing the geography of Earth, or even a colour, in 1D. You will lose a lot of information because you can only capture one aspect, like how dark, how blue, how yellow, something is. This happens when you reduce a 300, or 1500 numbers into two.
So all these algorithms have to make choices about how much information to lose and where. Algorithms like PCA choose to lose least amount of information, and others like UMAP or t-SNE choose to make it easier to visualise clusters.
But none of them are representative of “real” distances, in the way that no representation of a colour in 1D will be.
I had the time to test cosine similarity on the dataset for cohere, which also uses cosine by default. The result was ~97% using python and only ~52% using weaviate while operating on the “real” vector. No projection to lower dimensions etc…