Choosing a schema for Chunked documents

TweedBeetle · October 31, 2023, 3:05pm

Let’s say I would like to create a vector of Books, each of which have multiple Chapters.

I would like to perform embedding similarity searches on entire books but also on individual chapters.

Would it be better to Create a single Book class that has a property of type object that contains an array of Chapters?

Or would it make more sense to Create both a Book and a Chapter class and have Books contain an array of refs to Chapters?

I am not sure what considerations I need.

Any help appreciated!

mjsteele · October 31, 2023, 7:15pm

I believe you’d need to create a separate chapter class so that you can do vector search on it. That is what I’m doing with Document class and a DocumentParagraph class and it works well. I search on DocumentParagraph and pull metadata by traversing the cross-reference to Document

sebawita · November 2, 2023, 12:49am

Hi @mjsteele,

In your case I would consider one of the following options

Option A - BookChunks

The simplest solution would be to introduce a bit of repetition.

So, you could divide you book into chunks of text (i.e. you could split your chunks per paragraph, or every x characters)

Then you could create a single collection that would contain:

title
book_id
chapter_number
text_chunk
But only vectorize on text_chunk.

Then you could create a new object per chunk, and add it to the BookChunks collection. Which would later allow you to search across all chunks of text, and you could bring back the relevant results. And since each BookChunk object would contain the title and the chapter info, you could get all the necessary info in one place

If you would like to avoid multiple results coming back from per book/chapter, you could use group_by (see the docs for examples).

    .with_group_by(
        ["book_id", "chapter"],
        groups=5,
        objects_per_group=10
    )

The negative side to this approach is, that you would insert the same title many times, so that would take a bit more space. But in exchange, you would get a simpler solution to follow, and since you won’t be doing any reference joins, queries should be faster too.

Option B - Books + Chunks

The other approach is, what you proposed yourself. But I would skip chapters and go straight from books to text chunks.
Then your book object could contain all the metadata about the book, like a title and author, etc. While each chunk could have a reference to the parent book object.

This approach would save you from data repetition, but you would need to perform reference joins, which will make the solution more complex.

You might still get multiple results per book, so again you could use group_by to merge the results per book.

Summary

There might be other possible solutions, which might depend on your exact needs.
But I hope this helps you find the right path

Topic		Replies	Views
Have multiple vectors for a single object in the same index? Support	5	266	April 29, 2025
Return "unique file" when search large documents General	2	545	June 12, 2023
Best way to vectorize and store a large document in Weaviate? General	6	1858	August 18, 2023
How to design a schema with reference Support	3	232	March 13, 2024
Multi Vector Search in a single class Support integration , developer-experience	7	1314	July 14, 2024

Choosing a schema for Chunked documents

Option A - BookChunks

Option B - Books + Chunks

Summary

Related topics