Vectorize big amounts of Data locally

daza-science · June 5, 2024, 2:24pm

I am trying to create a local weaviate database where I store document chunks (which I call passages) and their related vectorized form (using any huggingFace model). I plan to store millions of passages.

I understand I cannot use the HF_API because there are way too many passages and I do not have a paid account. I also understand how to insert manually the vector, using a local downloaded version from HF, with something like this

with collection_obj.batch.dynamic() as batch:
    for p in corpus.passages:
        props = p.metadata
        props['passage'] = p.text
        batch.add_object(
        properties=props,
        uuid=generate_uuid5(props),
        vector=self.model.embed_passage(p)
         )

This works well, but it is veeeery slow! I don’t quite get the weaviate batchin system because I feel that what is happening is that I am embedding one passage at a time… isn’t there a way to parallelize also the embedding step instead of iteratively embedding one passage at a time inside the batch? I cannot pre-vectorize the whole corpus beforehand because, again, there are too many vectors, and I run out of memory.

What is the advice for SCALING UP the “manual vectorization” with weaviate? Beyond the usual tutorial examples…

Thanks for the advice!

DudaNogueira · June 7, 2024, 6:18pm

hi @daza-science !! Welcome to our community!

Yes! Vectorizing a lot of data concurrently will require beeffy hardware.

Apart from enabling GPU/CUDA, it is about balancing the load with mode nodes, AFAIK.

There are ways of running your own models on hardware as a service. We recently published one article on exactly that:

One project/community that is interesting to look at, also, is Ollama:

That on running your own models

We have a new recipe on using Weaviate with ollama:

Let me know if this helps

Topic		Replies	Views
Text2vec-openai Batch API Support integration , wcs , python	1	376	July 8, 2024
Local Embed vs Weaviate Module Support	6	1376	October 19, 2023
weaviate.exceptions.UnexpectedStatusCodeException: batch response! Unexpected status code: 400, with response body: {'code': 400, 'message': 'parsing body body from "" failed, because json: cannot unmarshal array into Go struct field Object.objects.vector Support python	1	752	March 15, 2024
Indexing embeddings taking too long. What am I doing wrong? Support	4	1778	September 27, 2024
Should I choose Weaviate for my first project? Support	2	474	December 19, 2023

Vectorize big amounts of Data locally

Related topics