Understand that there is a similar thread that is opened already.
I tried using group results but that is not the functionality that I am looking for. It only group the results after retrieval, while I require grouping results when retrieving the documents.
I am required to retrieve the top distinct k documents from the database. For example, if the user requested for 5 documents and requires to be distinct, it should return top 5 distinct documents.
The current implementation retrieves top 5 document chunks, that can be from a repeated document.
This was what I tried:
from weaviate.classes.query import GroupBy
import weaviate
try:
  client = weaviate.connect_to_local()
  group_by = GroupBy(
    prop="file_name",
    objects_per_group=1,
    number_of_groups=5 # Assuming user requested for max 5 distinct documents
)
  collection = client.collections.get("my_collection")
  question = "Tell me all the files that are related to XXX."
  response = collection.query.hybrid(
    query=question,
    group_by=group_by,
    limit=5,
    alpha=0.2,
)
  for obj in response.objects:
    print(obj.properties['file_name']
  
  print(f"Number of docs retrieved: {len(response.objects)}")
finally:
  client.close()
I was expecting 5 distinct documents, but the results had less than 5 documents.
             
            
              
              
              
            
            
           
          
            
            
              Hi!
Maybe, based on your dataset, all your 5 results are contained on those grouped properties?
That a reason I see that it would only show a few groups.
Can you provide with some example dataset and the expected outcome?
Thanks!
             
            
              
              
              
            
            
           
          
            
            
              Iām not sure if you can open them, but I opened via Google Scholar.
I tested with the following 7 documents:
- Analysis of Telecommunication Markets of India, Singapore and Thailand
- Developing location-based mobile advertising in Singapore
- Effects of Organisational Culture on Employees Performance: Case of Singapore Telecommunication
- Exploring factors affecting the adoption of mobile commerce in Singapore
- Mobile phones, communities and social networks
- Telecommunications, information and development - the Singapore experience
- The political economy of telecommunications in Malaysia and Singapore
Code used for testing:
from weaviate.classes.query import GroupBy, MetadataQuery
import weaviate
try:
  client = weaviate.connect_to_local()
  group_by = GroupBy(
    prop="file_name",
    objects_per_group=1,
    number_of_groups=5 # Assuming user requested for max 5 distinct documents
)
  collection = client.collections.get("my_collection")
  question = "Tell me all the files that are related to mobile phones and telecommunications."
  response = collection.query.hybrid(
    query=question,
    group_by=group_by,
    limit=5,
    alpha=0.2,
    return_metadata=MetadataQuery(
      distance=True,
      certainty=True,
      score=True,
      explain_score=True
    ),
)
  for i, obj in enumerate(response.objects):
    print(f"{i+1}. {obj.properties['file_name']}") # It only printed 1 document: Mobile phones, communities and social networks.pdf
    print(obj.metadata)
  print(f"Number of docs retrieved: {len(response.objects)}") # Printed 1
finally:
  client.close()
I also tested by commenting out the group_by to see which documents the document chunks were retrieved from:
1. Mobile phones, communities and social networks.pdf
2. Mobile phones, communities and social networks.pdf
3. Mobile phones, communities and social networks.pdf
4. Mobile phones, communities and social networks.pdf
5. Mobile phones, communities and social networks.pdf
Expected outcome:
My intention is that if a document chunk has already been retrieved X times based on the defined objects_per_group, retrieval from that document should stop and proceed to the next document. Retrieved document chunks should prioritize top scores.
What I am trying to do is akin to this SQL statement in relational database:
SELECT file_name, content, score
FROM (
  SELECT file_name, content, score,
     ROW_NUMBER() OVER (PARTITION BY file_name ORDER BY score DESC) as rank
  FROM documents
) AS ranked_docs
WHERE rank = 1
ORDER BY score DESC
LIMIT 5;