Hi @mjsteele,
In your case I would consider one of the following options
Option A - BookChunks
The simplest solution would be to introduce a bit of repetition.
So, you could divide you book into chunks of text (i.e. you could split your chunks per paragraph, or every x characters)
Then you could create a single collection that would contain:
- title
- book_id
- chapter_number
- text_chunk
But only vectorize on text_chunk.
Then you could create a new object per chunk, and add it to the BookChunks
collection. Which would later allow you to search across all chunks of text, and you could bring back the relevant results. And since each BookChunk object would contain the title and the chapter info, you could get all the necessary info in one place
If you would like to avoid multiple results coming back from per book/chapter, you could use group_by (see the docs for examples).
.with_group_by(
["book_id", "chapter"],
groups=5,
objects_per_group=10
)
The negative side to this approach is, that you would insert the same title many times, so that would take a bit more space. But in exchange, you would get a simpler solution to follow, and since you won’t be doing any reference joins, queries should be faster too.
Option B - Books + Chunks
The other approach is, what you proposed yourself. But I would skip chapters and go straight from books to text chunks.
Then your book object could contain all the metadata about the book, like a title and author, etc. While each chunk could have a reference to the parent book object.
This approach would save you from data repetition, but you would need to perform reference joins, which will make the solution more complex.
You might still get multiple results per book, so again you could use group_by to merge the results per book.
Summary
There might be other possible solutions, which might depend on your exact needs.
But I hope this helps you find the right path