.near_text results are not satisfactory (distance scores too close)

I am using locally setup Weaviate. And I have class with schema as

class_obj = {
  "class": "Test3",
  "vectorizer": "text2vec-openai",
  "moduleConfig": {
    "text2vec-openai": {
      "model": "ada",
      "modelVersion": "002",
      "type": "text"
    }
  }
}

and ingested data. I have data in csv file, so I generated json object and added to class in batches.
Now I want to use .near_text() to get relevant results from the data using the query:

result = (
    client.query
    .get("Test3", ["title", "abstract", "number", "_additional {distance}"])
    .with_near_text({"concepts": ["machine learning"]})
    .with_additional(['certainty'])
    .do()
)

But I am not satisfied with the results generated by Weaviate. My data doesn’t consists of any information related to ‘machine learning’. But still I got results with distance <0.25 and certainity > 0.8. I am using ‘cosine’ similarity here.
I should not have got any results, but it fetched all the data of the file with approximately same similarity.
Note: I tried with several keywords(appeared in my data) but the score for the more similar vs irrelevant data is 0.17 vs 0.24 which can be considered as relevant data.
Please provide me some support in this regard.

Hi @amani-acog - the similarity value comes from the model (in this case OpenAI), and is not to do with Weaviate.

Additionally, the number itself would not necessarily tell you if two values are “sufficiently” similar, as there is a degree of judgment involved. I have seen our users choose whatever threshold works best for them, based on their domain knowledge and intuition.

So I would say that if a particular threshold seems too large, you can always reduce it to one that suits your data.

I hope that helps,
JP

1 Like

Seconding @jphwang’s answer. I’ve added this as an FAQ.