I am going to be vectorising and storing web pages ideally want to do fast and accurate search. Imitate a “google-like” search as best as possible. Since web pages tend to be on the longer side of text documents, what are some tips I should follow when vectorising these documents?
Here are some pointers which I am confused about:
I want to get search results on the web pages (and not the documents), should I just vectorize the whole document at once or split it into chunks.
Splitting a web page into chunks (documents) , storing their embeddings and then using weaviate’s hybrid searching on them gives me the relevant “documents”, many of which are the chunks of the same webpage. I believe using the autocut feature leads to pruning some web pages which might be relevant but also shows some web pages which aren’t (because their document chunk happened to be slightly relevant). How might I overcome this?
What are some steps I could follow to reduce the latency of weaviate searching? Does using my own vectors make it faster?
I’d really appreciate any pointers and tips I can get to be able to get the “parent” document from the search results on document chunks.
hi @alt-glitch !
Were you able to find something here?
It would be interesting to shade some light on this with our team.
As of now I’m still exploring and have a couple of architectures in mind. If Weaviate team can help me out with this, I could still use the help.
Otherwise, I’m happy to discuss/talk it out with you all in private as I’m still a novice and am tinkering with what works and what doesn’t. Right now, I’m trying to cross reference smaller chunks of a large document with a general URI. But I’m also thinking of creating an evaluation pipeline to see what works better.
Happy to chat further.
Thanks for your response
I saw this from your workshop question. I have a very early draft of a chunking unit - would you be willing to take a look at the draft, with the understanding that it’s a work-in-progress?
It would also be very useful if you can answer a few questions afterwards.
Just on 3 - inputting your own vector for search would mean Weaviate doesn’t have to convert your question into a vector, so the query itself would be faster.
But then you would have to generate the vector - so it wouldn’t change the overall time required.
I’m happy to take a look and chat more!
Let me know how we can connect
Hi @alt-glitch I just recognized your haha from the workshop!
The easiest would be through our community Slack. I’m
JP (Weaviate) there - if you’re not there already, you can join here: Slack.