Using a local reranker-transformers reduces performance by 100x

Description

Using Weaviate running from docker-compose
Image: cr.weaviate.io/semitechnologies/weaviate:1.25.1
Reranker: cr.weaviate.io/semitechnologies/reranker-transformers:cross-encoder-ms-marco-MiniLM-L-6-v2

Server Setup Information

  • Weaviate Server Version: 1.25.1
  • Deployment Method docker-compose
  • Multi Node? No
  • Client Language and Version: Python 3.12.3

Any additional Information

When running

collection = clientv4.collections.get(collection_name)
hybrid_documentsv4 = collection.query.hybrid(
    query=user_input,
    limit=4,
    query_properties=["text", "key"],
    rerank=Rerank(prop="text", query=user_input),
    return_metadata=MetadataQuery(score=True)
)

responses take 50000+ ms

When I disable reranking:

collection = clientv4.collections.get(collection_name)
hybrid_documentsv4 = collection.query.hybrid(
    query=user_input,
    limit=4,
    query_properties=["text", "key"],
    return_metadata=MetadataQuery(score=True)
)

responses take 500 ms (100x faster)

My reranker-transformers docker container uses a max of about 109% of 1 of 10 CPU core and 1.6GB RAM when executing for the 50 seconds. Even if I run parallel python threads there is no improvement in speed of reranking. And I cannot get the CPU and RAM usage to take more from my host.

I have confirmed my Docker machine can access up to 10 cores and 14 GB of RAM and there is no resoruce contention while doing the hybrid-search. This is purely and issue with the container or how weaviate client does the reranking it seems. I am unsure of how to improve my performance. 50 seconds is dreadfully slow!

Hi!

I believe that for improving performance you will need a CUDA enabled hardware and make sure it will run with it.

Let me know if that helps.

Thanks!

I had a similar issue. Same docker container. In my case, the memory usage of the container climbed a bit with each rerank call. Eventually the host machine needed to use swap, which reduced performance. Then the host machine crashed. Took some time to diagnose as I couldn’t even ssh to the machine when out of memory!
So there must be a memory leak in the reranker container somewhere. I note from the source on Github that the CrossEncoder class uses the threadpool. There are some bugs filed with sentence-transformers to do with memory leaks and the threadpool.

hi @Nicholas_Miller !!

Welcome to our community!!

Do you mean this code, right?

So in your findings, this could be something from upstream?

Could you open a issue so we can keep track of that? Also, mentioning this thread.

Thanks for helping us on that!