Feature discussion: parent-document retrieval and scoring across chunked objects
Description
I would like to open a discussion around parent-document retrieval and scoring over chunked objects.
A common pattern when using Weaviate is to split large documents into chunks/pages because the full document may be too large to store, embed, or retrieve as a single object. Each chunk object then has a property such as doc_id pointing to the parent document.
In our use case, which is mainly legal and regulatory search, users are usually not looking for the most relevant isolated chunks. They are looking for the most relevant parent documents: judgments, regulations, contracts, reports, legal commentary, and so on.
From my current understanding, groupBy is helpful for presentation, but it works after retrieval. The search itself still happens at chunk/object level first, and only the retrieved chunks are then grouped by doc_id.
The capability I would like to discuss is whether Weaviate could support, now or in the future, some form of native retrieval and scoring over all chunks sharing the same parent identifier, such as doc_id.
I fully understand that this may be a complex search/indexing problem rather than a small operator-level feature. It may touch indexing, scoring, aggregation, query execution, distributed search semantics, and API design. So I am not presenting this as something simple or obvious to add. I am sharing it as a retrieval problem that we encounter in practice and that I think may also be relevant to other document-heavy use cases.
Current behavior, as I understand it
Assume we have a Chunk collection where each object represents one page or passage of a larger document:
{
"doc_id": "document_123",
"page": 1,
"content": "..."
}
Search currently happens at the individual chunk/object level.
groupBy can group retrieved chunks by doc_id, but the grouping happens after retrieval. This means that parent documents are only considered if one or more of their chunks were already retrieved by the initial chunk-level search.
This is useful, and we do use similar post-processing approaches, but it does not fully solve cases where the relevance of a document is distributed across multiple chunks.
Example: keyword / BM25 search
Suppose a user searches for documents containing:
A B C
A relevant parent document may contain all three terms, but distributed across different chunks:
doc_id = document_123
chunk 1 contains A
chunk 2 contains B
chunk 7 contains C
No single chunk contains A, B, and C together.
At the parent-document level, document_123 is relevant because the document as a whole contains all three terms. But at chunk level, this is difficult to express natively.
Today, one workaround is to use a separate DBMS/search index to compute the matching doc_ids, then pass those IDs back into Weaviate as a filter. This can work, but it means duplicating part of the search logic outside Weaviate and can become awkward when the list of matching parent documents is large.
Example: semantic / vector / hybrid search
A similar issue can appear in semantic or hybrid retrieval.
A query may cover two separate topics, conditions, or legal criteria. A document may be highly relevant because it covers both dimensions, but on different pages:
doc_id = document_123
chunk 1 matches topic/condition X
chunk 8 matches topic/condition Y
Another document may only cover topic/condition X, but mention it repeatedly across several chunks.
At chunk level, both documents may look similarly relevant because individual chunks only capture partial relevance. But at parent-document level, the first document may be more complete because it covers more dimensions of the query.
So the issue is not only whether exact terms are present across chunks. It is also that chunk-level retrieval can fail to capture document-level completeness when relevance is distributed across several parts of the same parent document.
Existing workaround and limitation
I know that one possible workaround is to retrieve a larger pool of chunks first, then apply post-search logic in the application layer.
For example, developers can:
-
use
groupBy, or group retrieved chunks manually bydoc_id; -
aggregate chunk scores per parent document;
-
apply custom scoring logic to decide which parent documents should be returned;
-
return the best supporting chunks for each selected document.
This is broadly the kind of approach we currently use.
However, unless I am missing something, this still depends on the initial chunk-level retrieval pool. The parent-document scoring only happens after Weaviate has already selected which chunks to retrieve.
As a result, it does not fully solve the cases described above:
-
a document may be missed if its relevance is distributed across chunks that individually do not rank highly enough;
-
a document that covers several different query dimensions across different pages may be under-ranked;
-
another document that repeats only one part of the query across several chunks may appear similarly or more relevant;
-
for keyword/BM25-like use cases, developers may still need an external DBMS or search index to compute matching
doc_ids before querying Weaviate.
So while post-search grouping and custom aggregation are useful, they are not quite the same as search-time parent-document retrieval and scoring. The question is whether Weaviate could eventually support retrieval/ranking where the parent document is treated as the logical searchable unit, even though the stored/indexed objects are chunks.
Possible desired behavior
One possible direction could be a parent-document retrieval mode where a property such as doc_id defines the logical search unit.
For example, the query could specify something conceptually like:
parent_property = doc_id
and Weaviate would retrieve/rank parent documents using all chunks belonging to the same parent.
Some potentially useful modes could be:
-
Boolean/keyword matching across all chunks of a parent document.
-
BM25 scoring where the logical searchable unit is the parent document, even though the stored objects are chunks.
-
Hybrid search where lexical and vector signals are aggregated at parent-document level.
-
Semantic/vector search where multiple chunks from the same parent can jointly contribute to the parent document’s score.
-
Returning parent documents while optionally also returning the best supporting chunks.
For example, instead of only returning the top chunks, Weaviate could perhaps return something conceptually like:
{
"doc_id": "document_123",
"score": 0.87,
"supporting_chunks": [
{
"page": 1,
"reason": "matches condition X"
},
{
"page": 8,
"reason": "matches condition Y"
}
]
}
This is only a conceptual example, not a proposed API.
Why this would be useful
This would be very useful for legal, regulatory, scientific, financial, and enterprise document search.
In these domains, documents are often long and must be chunked for technical reasons. But the user’s mental model is usually document-level retrieval:
“Find me the most relevant judgments, contracts, regulations, reports, or scientific papers.”
not:
“Find me the most relevant isolated chunks.”
Chunking is often an implementation detail. The desired search result is still the parent document.
Native parent-document retrieval/scoring could reduce the need for external search systems, client-side aggregation, large doc_id filters, or maintaining duplicate document-level indexes.
Possible complexity
I understand this may be difficult to implement efficiently, especially in a distributed system. It may require parent-aware indexing, aggregation before final ranking, and clear scoring semantics for how multiple child chunks should contribute to a parent score.
There may also be reasons why this does not fit Weaviate’s architecture, or why it should remain application-layer logic. I may also be missing an existing recommended pattern.
So I am not presenting this as a simple missing feature, but rather as a retrieval capability that would be valuable in our use case and possibly in other document-heavy use cases.
Version
We are currently using Weaviate version: YOUR_VERSION_HERE
Question
Has Weaviate considered this type of parent-document retrieval/scoring, or is there an existing recommended pattern that better addresses this problem?
Thanks again for the work on Weaviate.