Vectorize big amounts of Data locally

I am trying to create a local weaviate database where I store document chunks (which I call passages) and their related vectorized form (using any huggingFace model). I plan to store millions of passages.

I understand I cannot use the HF_API because there are way too many passages and I do not have a paid account. I also understand how to insert manually the vector, using a local downloaded version from HF, with something like this

with collection_obj.batch.dynamic() as batch:
    for p in corpus.passages:
        props = p.metadata
        props['passage'] = p.text
        batch.add_object(
        properties=props,
        uuid=generate_uuid5(props),
        vector=self.model.embed_passage(p)
         )

This works well, but it is veeeery slow! I don’t quite get the weaviate batchin system because I feel that what is happening is that I am embedding one passage at a time… isn’t there a way to parallelize also the embedding step instead of iteratively embedding one passage at a time inside the batch? I cannot pre-vectorize the whole corpus beforehand because, again, there are too many vectors, and I run out of memory.

What is the advice for SCALING UP the “manual vectorization” with weaviate? Beyond the usual tutorial examples…

Thanks for the advice!

hi @daza-science !! Welcome to our community! :hugs:

Yes! Vectorizing a lot of data concurrently will require beeffy hardware.

Apart from enabling GPU/CUDA, it is about balancing the load with mode nodes, AFAIK.

There are ways of running your own models on hardware as a service. We recently published one article on exactly that:

One project/community that is interesting to look at, also, is Ollama:

That on running your own models :slight_smile:

We have a new recipe on using Weaviate with ollama:

Let me know if this helps :slight_smile: