Hi community,
I would like to use a hybrid metric (sparse+dense) to compute similarity between 2 sentences, but struggles with using the hybrid search of Weaviate. Is it possible to compute the similarity between 2 sentences only using where_filter and with_hybrid ?
When I do this:
code_content = "CNT16421"
clean_user_def_mem_recall ="La biodiversité, c'est l'ensemble des espèces vivantes qui existent sur la planète et qui vivent ensemble en interdépendance."
where_filter = {
"path":["code"],
"operator": "Equal",
"valueString": code_content
}
## Use hybrid search to compute similarity score between theoretical and user memory recall definitions
payload = {"text_list": [clean_user_def_mem_recall]}
clean_user_def_mem_recall_embed_list = await asyncio.create_task(post_query_main(api_sent_embed_address_mr, payload))
search = client.query\
.get("MemoryRecallLeitner", ["code","memoryRecallDefinition"])\
.with_where(where_filter)\
.with_additional('score')\
.with_hybrid(clean_user_def_mem_recall, alpha=0.75, vector=clean_user_def_mem_recall_embed_list[0])\
.with_limit(1)\
.do()
I got the following results:
[{'_additional': {'score': '0.016393442'}, 'code': 'CNT16421', 'memoryRecallDefinition': "La biodiversité, c'est l'ensemble des espèces vivantes qui existent sur la planète et qui vivent ensemble en interdépendance."}]
But, if I modify the query:
code_content = "CNT16421"
clean_user_def_mem_recall ="C'est la biodiversité quoi."
where_filter = {
"path":["code"],
"operator": "Equal",
"valueString": code_content
}
## Use hybrid search to compute similarity score between theoretical and user memory recall definitions
payload = {"text_list": [clean_user_def_mem_recall]}
clean_user_def_mem_recall_embed_list = await asyncio.create_task(post_query_main(api_sent_embed_address_mr, payload))
search = client.query\
.get("MemoryRecallLeitner", ["code","memoryRecallDefinition"])\
.with_where(where_filter)\
.with_additional('score')\
.with_hybrid(clean_user_def_mem_recall, alpha=0.75, vector=clean_user_def_mem_recall_embed_list[0])\
.with_limit(1)\
.do()
I still got the same score: [{'_additional': {'score': '0.016393442'}, 'code': 'CNT16421', 'memoryRecallDefinition': "La biodiversité, c'est l'ensemble des espèces vivantes qui existent sur la planète et qui vivent ensemble en interdépendance."}]
Whereas, if I use pure dense search on both input queries I got naturally different scores:
code_content = "CNT16421"
clean_user_def_mem_recall ="La biodiversité, c'est l'ensemble des espèces vivantes qui existent sur la planète et qui vivent ensemble en interdépendance."
where_filter = {
"path":["code"],
"operator": "Equal",
"valueString": code_content
}
## Use hybrid search to compute similarity score between theoretical and user memory recall definitions
payload = {"text_list": [clean_user_def_mem_recall]}
clean_user_def_mem_recall_embed_list = await asyncio.create_task(post_query_main(api_sent_embed_address_mr, payload))
nearVector = {"vector": clean_user_def_mem_recall_embed_list[0]}
search = client.query\
.get("MemoryRecallLeitner", ["code","memoryRecallDefinition"])\
.with_where(where_filter)\
.with_additional('certainty')\
.with_near_vector(nearVector)\
.with_limit(1)\
.do()
gave this result: [{'_additional': {'certainty': 0.999999612569809}, 'code': 'CNT16421', 'memoryRecallDefinition': "La biodiversité, c'est l'ensemble des espèces vivantes qui existent sur la planète et qui vivent ensemble en interdépendance."}]
Whereas:
code_content = "CNT16421"
clean_user_def_mem_recall ="C'est la biodiversité quoi."
where_filter = {
"path":["code"],
"operator": "Equal",
"valueString": code_content
}
## Use hybrid search to compute similarity score between theoretical and user memory recall definitions
payload = {"text_list": [clean_user_def_mem_recall]}
clean_user_def_mem_recall_embed_list = await asyncio.create_task(post_query_main(api_sent_embed_address_mr, payload))
nearVector = {"vector": clean_user_def_mem_recall_embed_list[0]}
search = client.query\
.get("MemoryRecallLeitner", ["code","memoryRecallDefinition"])\
.with_where(where_filter)\
.with_additional('certainty')\
.with_near_vector(nearVector)\
.with_limit(1)\
.do()
Gave this result: [{'_additional': {'certainty': 0.8760087788105011}, 'code': 'CNT16421', 'memoryRecallDefinition': "La biodiversité, c'est l'ensemble des espèces vivantes qui existent sur la planète et qui vivent ensemble en interdépendance."}]
Could anyone give me advice on how adequately compute hybrid score between 2 sentences, knowing that the reference sentence is in the Weaviate database (localized with where_filter ) while the second one is the query provided on input of the Weaviate search?