Filter and retrieve distinct documents

SoftwearEnginear · July 3, 2024, 9:57am

Understand that there is a similar thread that is opened already.

I tried using group results but that is not the functionality that I am looking for. It only group the results after retrieval, while I require grouping results when retrieving the documents.

I am required to retrieve the top distinct k documents from the database. For example, if the user requested for 5 documents and requires to be distinct, it should return top 5 distinct documents.

The current implementation retrieves top 5 document chunks, that can be from a repeated document.

This was what I tried:

from weaviate.classes.query import GroupBy
import weaviate

try:
  client = weaviate.connect_to_local()

  group_by = GroupBy(
    prop="file_name",
    objects_per_group=1,
    number_of_groups=5 # Assuming user requested for max 5 distinct documents
)

  collection = client.collections.get("my_collection")
  question = "Tell me all the files that are related to XXX."
  response = collection.query.hybrid(
    query=question,
    group_by=group_by,
    limit=5,
    alpha=0.2,
)

  for obj in response.objects:
    print(obj.properties['file_name']
  
  print(f"Number of docs retrieved: {len(response.objects)}")

finally:
  client.close()

I was expecting 5 distinct documents, but the results had less than 5 documents.

DudaNogueira · July 3, 2024, 7:06pm

Hi!

Maybe, based on your dataset, all your 5 results are contained on those grouped properties?

That a reason I see that it would only show a few groups.

Can you provide with some example dataset and the expected outcome?

Thanks!

SoftwearEnginear · July 4, 2024, 2:25am

I’m not sure if you can open them, but I opened via Google Scholar.
I tested with the following 7 documents:

Code used for testing:

from weaviate.classes.query import GroupBy, MetadataQuery
import weaviate

try:
  client = weaviate.connect_to_local()

  group_by = GroupBy(
    prop="file_name",
    objects_per_group=1,
    number_of_groups=5 # Assuming user requested for max 5 distinct documents
)

  collection = client.collections.get("my_collection")
  question = "Tell me all the files that are related to mobile phones and telecommunications."
  response = collection.query.hybrid(
    query=question,
    group_by=group_by,
    limit=5,
    alpha=0.2,

    return_metadata=MetadataQuery(
      distance=True,
      certainty=True,
      score=True,
      explain_score=True
    ),
)

  for i, obj in enumerate(response.objects):
    print(f"{i+1}. {obj.properties['file_name']}") # It only printed 1 document: Mobile phones, communities and social networks.pdf
    print(obj.metadata)
  print(f"Number of docs retrieved: {len(response.objects)}") # Printed 1

finally:
  client.close()

I also tested by commenting out the group_by to see which documents the document chunks were retrieved from:

1. Mobile phones, communities and social networks.pdf
2. Mobile phones, communities and social networks.pdf
3. Mobile phones, communities and social networks.pdf
4. Mobile phones, communities and social networks.pdf
5. Mobile phones, communities and social networks.pdf

Expected outcome:
My intention is that if a document chunk has already been retrieved X times based on the defined objects_per_group, retrieval from that document should stop and proceed to the next document. Retrieved document chunks should prioritize top scores.

What I am trying to do is akin to this SQL statement in relational database:

SELECT file_name, content, score
FROM (
  SELECT file_name, content, score,
     ROW_NUMBER() OVER (PARTITION BY file_name ORDER BY score DESC) as rank
  FROM documents
) AS ranked_docs
WHERE rank = 1
ORDER BY score DESC
LIMIT 5;

Topic		Replies	Views
[Question] How to retrieve all documents in weaviate? Support	2	273	August 14, 2024
Is there any alternatives for extracting the distinct count in the aggregate function General bug , developer-experience	5	117	December 26, 2024
Return distinct result Support developer-experience	3	994	October 17, 2023
Cross-reference queries Support	1	432	June 5, 2023
AggregateReturn Support technical	2	21	May 12, 2025

Filter and retrieve distinct documents

Related topics