[Question] reranker image seems to fail when use gpu?

cyc00518 · September 29, 2024, 7:28am

Description

Problem with Docker Configuration and CUDA for Reranker-Transformers

I followed the tutorial and used the following Docker configuration:

reranker-transformers:  
  # Set the name of the inference container
  image: semitechnologies/reranker-transformers:baai-bge-reranker-v2-m3
  container_name: reranker-transformers-container 
  network_mode: "bridge"
  volumes:
    - .:/usr/src/app
  ports:
    - 50051:8080
  restart: always
  environment:
    ENABLE_CUDA: 0  # set to 1 to enable
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]

After successfully starting the service, when I set ENABLE_CUDA to 1, calling the service results in the following error:

INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]
ERROR:    Something went wrong while scoring the input.
Traceback (most recent call last):
  File "/app/app.py", line 55, in read_item
    return await cross_encoder.do(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/crossencoder.py", line 56, in do
    return await asyncio.wrap_future(self.executor.submit(self._rerank, item))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/crossencoder.py", line 53, in _rerank
    return self._perform_rerank(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/crossencoder.py", line 45, in _perform_rerank
    return self._batch_rerank(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/crossencoder.py", line 37, in _batch_rerank
    scores = self.model.predict(sentences)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py", line 375, in predict
    self.model.to(self._target_device)
  File "/usr/local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2883, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1174, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in convert
    return t.to(
           ^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 305, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

However, if I set ENABLE_CUDA to 0, there is no issue.

Why does this issue only happen with the reranker? I am asking because I used the transformers-inference image with ENABLE_CUDA: 1 without any problems:

t2v-transformers:
  image: semitechnologies/transformers-inference:baai-bge-m3-onnx
  container_name: ccpg-t2v-transformers-container # Separate from JiaoDa naming
  network_mode: "bridge"
  volumes:
    - .:/usr/src/app
  ports:
    - 50049:8080
  restart: always
  environment:
    ENABLE_CUDA: 1  # set to 1 to enable
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]

Additionally, I would like to know how to configure these services to load pre-downloaded models from the local machine.

Server Setup Information

GPU hardware: GH200 (aarch64)

DudaNogueira · September 30, 2024, 2:55pm

hi @cyc00518 !!

In order to have pre loaded models you will need to build a custom model.

Here is the link that explains that in detail:

Also, can you yu try running this service with the following docker, from here?

Thanks!

cyc00518 · October 1, 2024, 12:53am

Hi, @DudaNogueira

Thanks for your reply!

Actually, it’s the same when using docker image semitechnologies/reranker-transformers:cross-encoder-ms-marco-MiniLM-L-6-v2

  reranker-transformers:  # Set the name of the inference container
    image: semitechnologies/reranker-transformers:cross-encoder-ms-marco-MiniLM-L-6-v2
    container_name: reranker-transformers-container 
    network_mode: "bridge"
    volumes:
      - .:/usr/src/app
    ports:
      - 50051:8080
    restart: always
    environment:
      ENABLE_CUDA: 1 # set to 1 to enable
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities: [ gpu ]

  File "/usr/local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 305, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I think it might be caused by torch. Maybe when building the reranker image, you should use the image from NVIDIA’s nvcr.io/nvidia/pytorch to ensure that torch is installed correctly.

Because the current image is unable to properly detect the GPU, as shown below.

Furthermore, I also do the same test on embedding model image,
Although I set ENABLE_CUDA: 1, and the calling does not raise any errors, in reality, the GPU is still not being utilized.

 t2v-transformers:
    image: semitechnologies/transformers-inference:baai-bge-m3-onnx
    container_name: t2v-transformers-container 
    network_mode: "bridge"
    ports:
      - 50049:8080
    restart: always
    environment:
      ENABLE_CUDA: 1 # set to 1 to enable
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities: [ gpu ]

I test it on both GPU card A5000 and H200, and the results were the same.

If I’m wrong, please correct me. Thank you!

DudaNogueira · October 4, 2024, 8:49am

hi @cyc00518 !

I believe the best course of action here is opening a github issue on that repository:

Topic		Replies	Views
Using a local reranker-transformers reduces performance by 100x Support python	3	475	July 15, 2024
Questions about using a local model with the reranker module In Python 3.10 Support python	0	38	July 7, 2025
Cannot enable “reranker-transformers” module and Use "Ranking" feature Support python	2	64	July 18, 2025
What endpoints are required for a custom reranker? Support integration	6	1028	November 14, 2023
How can I use a multilingual model for reranking? General	3	593	February 12, 2024

[Question] reranker image seems to fail when use gpu?

Description

Problem with Docker Configuration and CUDA for Reranker-Transformers

Server Setup Information

Related topics