How to get unique results based on references

Description

Hi all! I have a question how to create search query in a correct way. I have two classes in schema:

  1. Meta data for the document
    Document {
    docname: …,
    author: …,

    }
  2. chunks for the documents with reference:
    Chunk {
    textBody: …,
    refToDocument: reference to the Document object
    }
    So, I’m searching through the chunks well and getting good results, but… the problem happens when I want to get results for 30 documents - because I’m searching in the chunks - ‘limit’ field is useless here. Only one way I see - do the search with max limit, then “page” results manualy and divide it by 30 documents blocks. It’s some ugly and potentialy resource unoptimized solution. Is there another solution exists by Weaviate stuff?

Server Setup Information

  • Weaviate Server Version: 1.24.1
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: Python, 4.5.1

Hi @Spun ! Welcome to our community! :hugs:

Let me know if I understood it correctly.

Do yo want to search on the chunks of specific 30 documents you have?

If that’s the case, you can search the chunks and filter using it’s cross references.

You will get all the chunks, specifically from those 30 documents, that are the closest to your query.

Let me know if this helps :slight_smile:

No, I have another scenario. For example I have 100K documents, each has 10 chunks(so, total 1M chunks). Search “animals” - got 100K results - 100K chunk texts(could be lesser if chunk contains two or more different sentences about animals) which contains something interesting about animals - but there would be few chunks reffer to doc1, one chunk refers to doc2, 10 chunks refer to doc3 and so on(I don’t know exact number of documents - could be 10K or 40K or 60K). And I need only doc1, …, doc30 (not all the 10K or more documents) from the results(so I need chunks that refers to the first 30 unique document objects only, and of course I can’t predict which documents should be there). Then I will need doc31,…,doc60. But in my case ‘limit’ and ‘offset’ options work with chunks only not with documents. That’s why I’m looking for solution similar to one I could do in SQL DBs.
As I wrote previously I see only one solution - get ALL(there could be hundreds of thousands) results from the search and parse them by Python(or any other client) stuff to create “cache” for the “animals” and work with that cache until search request will change. As for me it’s a bad solution(

Hm, thinking little more and seeing another solution:

//pseudo code
limit=30
offset = 0
doc_num=0
docs = []
while (doc_num < 30):
  results = search_animals_in_chunks(limit, offset) #weaviate search inside
  doc_num += add_unique_documents_from_chunks(results, docs) #parse referrences here
  offset++

What do you think about this solution?

So, I’ve also could upgrade previous code with after parameter instead of offset. Looks like it more suitable for my needs. But seems like it doesn’t work with query searches, needs to check.

Sad, but after and sort doesn’t work with query searches(bm25, context, hybrid) :frowning:

Hi @DudaNogueira ! Seems like I’ve created the topic in the wrong forum section. Could you please move it to General?
Thanks )