Hi,
I’ve been using the semantic_search utility that comes with the sentence-transformers library and results have been pretty decent. Yesterday I switched to weaviate and pure similarity search is performing a lot worse i.e. recall is much lower. Since recall Following this Indexes | Weaviate - vector database I tried to fiddle with efConstruction and ef but it doesn’t seem like those have any effect. Not on recall and not even on performance. Everything is just unchanged.
Any hints would be much appreciated!
Here’s my class:
class_obj = {
"class": "Passage",
"properties": [
{
"dataType": ["text"],
"name": "text",
"moduleConfig": {
"text2vec-huggingface": {
"skip": False,
"vectorizePropertyName": False
}
}
},
{
"dataType": ["text"],
"name": "docid",
"moduleConfig": {
"text2vec-huggingface": {
"skip": True,
"vectorizePropertyName": False
}
}
},
{
"dataType": ["text"],
"name": "passage_id",
"moduleConfig": {
"text2vec-huggingface": {
"skip": True,
"vectorizePropertyName": False
}
}
},
],
"vectorIndexConfig": {
"distance": "dot",
"ef":128,
"efConstruction": 512
},
"invertedIndexConfig": {
"stopwords": {
"preset": "none",
"additions": ["aber", "alle", ...],
"removals": []
}
},
"vectorizer": "text2vec-huggingface",
"moduleConfig": {
"text2vec-huggingface": {
"model": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
"options": {
"waitForModel": True
},
"vectorizeClassName": False
}
}
}
client.schema.delete_all()
client.schema.create_class(class_obj)
Hi @lnatspacy that’s a really good question - I don’t know the answer as to why you’re seeing that, but I’ve passed it onto the team.
In the meantime - what values have you tried by the way?
Hi,
Thank you!
I’ve tried 64, 128, 512, 1024. I also increased maxConnections to 128.
Nothing seemed to help
Okay. That’s a bit beyond my knowledge (sorry!), but hopefully someone can get back to you soon.
1 Like
Hey @lnatspacy, are you increasing efConstruction as well with the increased maxConnections? efConstruction controls the size of the queue for nearest neighbor edges so I think that if it is limited to say 32 and maxConnections is 128 you won’t get an optimized set for the 128 – rather just the new neighbors are less likely to trigger the pruning.
Hi @lnatspacy , are you able to provide the number of objects you have stored in this class?
In a small dataset, an ef of 128 vs 1024 may produce the same results so it is possible the issue is not the with the vector index parameters but the weaviate module / sentence transformers model config. Do you have an example of a query + results working well with sentence transformers but not in Weaviate?
Hi everyone,
Thank you for your responses!
@CShorten I’ve done both. I have incresed efConstruction with and without also increasing maxConnections, there didn’t seem to be a difference.
My dataset is currently indeed very small, it’s only 26k docs.
@trengrj My searches consist of natural language questions. An example could be “Welche Paragraphen regeln Mord”.
I should also note how I’m currently meassuring recall (which is a bit clumsy):
I have pairs of questions and a target docid. What I’m doing is, simply running those questions and then seeing for how many questions the target doc is in the result set at. Note: I’m always limiting to 100 results and the set consists of ~90 questions.
These are the results:
_sbert_semantic_utility@5 0.6590909090909091
_sbert_semantic_utility@10 0.7613636363636364
_sbert_semantic_utility@20 0.8522727272727273
_sbert_semantic_utility@100 0.9090909090909091
--------------------
similarity_search@5 0.4772727272727273
similarity_search@10 0.5795454545454546
similarity_search@20 0.7045454545454546
similarity_search@100 0.8068181818181818
--------------------
hybrid_search@5 0.6477272727272727
hybrid_search@10 0.7954545454545454
hybrid_search@20 0.875
hybrid_search@100 0.9090909090909091
--------------------
These numbers don’t change at all, regardless of what I put in the vectorIndexConfig.
For hybrid search, I’m using an alpha of 0.55, which seems to work best for my curent test set.
It’s possible that my dataset is just so small, that none of the config changes will have any meaningfull effect. I was just a little surprised about the relatively poor performance of similarity search, compared to an exact nearest neighbour approach. Obviously, I was expecting some difference, but this seemed a little much).
I’m happy about the results with hybrid search though, you guys have done a good job on that!
For hybrid search, I’m putting a lemmatized version of the text in a sepearte property that I skip for vectorization. It would be great if this was supported by weaviate natively.
Any idea? I haven’t managed to make any progress on figuring this one out