Best way to vectorize and store a large document in Weaviate?

alt-glitch · August 10, 2023, 8:19pm

Hi!
I am going to be vectorising and storing web pages ideally want to do fast and accurate search. Imitate a “google-like” search as best as possible. Since web pages tend to be on the longer side of text documents, what are some tips I should follow when vectorising these documents?
Here are some pointers which I am confused about:

I want to get search results on the web pages (and not the documents), should I just vectorize the whole document at once or split it into chunks.
Splitting a web page into chunks (documents) , storing their embeddings and then using weaviate’s hybrid searching on them gives me the relevant “documents”, many of which are the chunks of the same webpage. I believe using the autocut feature leads to pruning some web pages which might be relevant but also shows some web pages which aren’t (because their document chunk happened to be slightly relevant). How might I overcome this?
What are some steps I could follow to reduce the latency of weaviate searching? Does using my own vectors make it faster?

I’d really appreciate any pointers and tips I can get to be able to get the “parent” document from the search results on document chunks.

DudaNogueira · August 16, 2023, 9:12pm

hi @alt-glitch !

Were you able to find something here?

It would be interesting to shade some light on this with our team.

Thanks!

alt-glitch · August 17, 2023, 4:46am

Hi!
As of now I’m still exploring and have a couple of architectures in mind. If Weaviate team can help me out with this, I could still use the help.
Otherwise, I’m happy to discuss/talk it out with you all in private as I’m still a novice and am tinkering with what works and what doesn’t. Right now, I’m trying to cross reference smaller chunks of a large document with a general URI. But I’m also thinking of creating an evaluation pipeline to see what works better.

Happy to chat further.
Thanks for your response

jphwang · August 17, 2023, 3:02pm

Hi @alt-glitch

I saw this from your workshop question. I have a very early draft of a chunking unit - would you be willing to take a look at the draft, with the understanding that it’s a work-in-progress?

It would also be very useful if you can answer a few questions afterwards.

jphwang · August 17, 2023, 3:04pm

Just on 3 - inputting your own vector for search would mean Weaviate doesn’t have to convert your question into a vector, so the query itself would be faster.

But then you would have to generate the vector - so it wouldn’t change the overall time required.

alt-glitch · August 18, 2023, 2:12pm

I’m happy to take a look and chat more!
Let me know how we can connect

jphwang · August 18, 2023, 2:27pm

Hi @alt-glitch I just recognized your haha from the workshop!

The easiest would be through our community Slack. I’m JP (Weaviate) there - if you’re not there already, you can join here: Slack.

Cheers!

Topic		Replies	Views
Assistance Needed to Improve Weaviate's Vector Search Performance General	2	450	March 6, 2025
Advice Needed on Optimizing Vector Search in Weaviate Support	1	276	September 6, 2024
Vectorize big amounts of Data locally Support	1	336	June 7, 2024
Query and display PDFs with queries Support	1	695	May 26, 2023
Retrieving “Adjacent” Chunks for Better Context Support	12	1382	February 19, 2025

Best way to vectorize and store a large document in Weaviate?

Related topics