.near_text results are not satisfactory (distance scores too close)

amani-acog · June 15, 2023, 7:42am

I am using locally setup Weaviate. And I have class with schema as

class_obj = {
  "class": "Test3",
  "vectorizer": "text2vec-openai",
  "moduleConfig": {
    "text2vec-openai": {
      "model": "ada",
      "modelVersion": "002",
      "type": "text"
    }
  }
}

and ingested data. I have data in csv file, so I generated json object and added to class in batches.
Now I want to use .near_text() to get relevant results from the data using the query:

result = (
    client.query
    .get("Test3", ["title", "abstract", "number", "_additional {distance}"])
    .with_near_text({"concepts": ["machine learning"]})
    .with_additional(['certainty'])
    .do()
)

But I am not satisfied with the results generated by Weaviate. My data doesn’t consists of any information related to ‘machine learning’. But still I got results with distance <0.25 and certainity > 0.8. I am using ‘cosine’ similarity here.
I should not have got any results, but it fetched all the data of the file with approximately same similarity.
Note: I tried with several keywords(appeared in my data) but the score for the more similar vs irrelevant data is 0.17 vs 0.24 which can be considered as relevant data.
Please provide me some support in this regard.

jphwang · June 20, 2023, 12:37pm

Hi @amani-acog - the similarity value comes from the model (in this case OpenAI), and is not to do with Weaviate.

Additionally, the number itself would not necessarily tell you if two values are “sufficiently” similar, as there is a degree of judgment involved. I have seen our users choose whatever threshold works best for them, based on their domain knowledge and intuition.

So I would say that if a particular threshold seems too large, you can always reduce it to one that suits your data.

I hope that helps,
JP

dandv · June 20, 2023, 1:42pm

Seconding @jphwang’s answer. I’ve added this as an FAQ.

Topic		Replies	Views
nearText algorithm not returning expected value compared to cosine similarity General	1	142	June 20, 2024
Similarity search returns chunks that all have exactly the same distance value Support bug	3	838	November 29, 2023
Why is near_text differs from near_vector results for the same encoded text? Support	1	185	May 24, 2024
Query score 0 Support python	1	148	December 24, 2024
nearText operaion isn't work Support technical	8	231	December 13, 2024

.near_text results are not satisfactory (distance scores too close)

Related topics