I am trying to create a local weaviate database where I store document chunks (which I call passages) and their related vectorized form (using any huggingFace model). I plan to store millions of passages.
I understand I cannot use the HF_API because there are way too many passages and I do not have a paid account. I also understand how to insert manually the vector, using a local downloaded version from HF, with something like this
with collection_obj.batch.dynamic() as batch:
for p in corpus.passages:
props = p.metadata
props['passage'] = p.text
batch.add_object(
properties=props,
uuid=generate_uuid5(props),
vector=self.model.embed_passage(p)
)
This works well, but it is veeeery slow! I don’t quite get the weaviate batchin system because I feel that what is happening is that I am embedding one passage at a time… isn’t there a way to parallelize also the embedding step instead of iteratively embedding one passage at a time inside the batch? I cannot pre-vectorize the whole corpus beforehand because, again, there are too many vectors, and I run out of memory.
What is the advice for SCALING UP the “manual vectorization” with weaviate? Beyond the usual tutorial examples…
Thanks for the advice!