Custom text2vector container gets killed because kernel out of memory

Description

The custom vectorizer container gets killed, probably by Linux OOM daemon. How do I manage resources of weaviate+custom model on a development machine?

This is the way I define and start my two containers:

weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.24.6
    command:
      - "--host=0.0.0.0"
      - "--port=8080"
      - "--scheme=http"
    ports:
      - "8080:8080"
      - "50051:50051"
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: unless-stopped
    environment:
      LOG_LEVEL: debug 
      ENABLE_CUDA: 0
      LIMIT_RESOURCES: true
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: true
      PERSISTENCE_DATA_PATH: /var/lib/weaviate
      CLUSTER_HOSTNAME: finland
      ENABLE_MODULES: text2vec-transformers
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://t2v-e5-mistral:8080
    depends_on:
      t2v-e5-mistral:
        condition: service_healthy

  t2v-e5-mistral:
    build:
      context: /home/mema/llms/e5-mistral-7b-instruct  
      dockerfile: Dockerfile
    image: e5-mistral-7b-instruct 
    environment:
      ENABLE_CUDA: '0'
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8080/docs"]
      interval: 30s
      timeout: 10s
      retries: 2
      start_period: 10s

The custom t2v-e5-container is built with the following Dockerfile:

FROM semitechnologies/transformers-inference:custom
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
RUN MODEL_NAME=intfloat/e5-mistral-7b-instruct ./download.py

I then spin up the two containers with a “docker compose up -d weaviate”. As weaviate’s service definition has a depends on t2v-e5-mistral this will start and when ready also weaviate starts without any problems in the logs. but after some variable time the custom model (t2v-e5-mistral) gets killed. Here follows a “docker compose logs -f t2v-e5-mistral” log of three starts and subsequent kills. As you can see the last two get killed immediately.

t2v-e5-mistral-1  | INFO:     Started server process [7]
t2v-e5-mistral-1  | INFO:     Waiting for application startup.
t2v-e5-mistral-1  | INFO:     CUDA_PER_PROCESS_MEMORY_FRACTION set to 1.0
t2v-e5-mistral-1  | INFO:     Running on CPU
Loading checkpoint shards: 100%|██████████| 6/6 [00:06<00:00,  1.13s/it]
t2v-e5-mistral-1  | INFO:     Application startup complete.
t2v-e5-mistral-1  | INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
t2v-e5-mistral-1  | INFO:     127.0.0.1:59906 - "GET /docs HTTP/1.1" 200 OK
t2v-e5-mistral-1  | INFO:     192.168.160.4:52866 - "GET /meta HTTP/1.1" 200 OK
t2v-e5-mistral-1  | INFO:     127.0.0.1:47926 - "GET /docs HTTP/1.1" 200 OK
t2v-e5-mistral-1  | INFO:     127.0.0.1:56692 - "GET /docs HTTP/1.1" 200 OK
... skipped a dozen identical messages...
t2v-e5-mistral-1  | INFO:     127.0.0.1:54584 - "GET /docs HTTP/1.1" 200 OK
t2v-e5-mistral-1  | INFO:     127.0.0.1:44640 - "GET /docs HTTP/1.1" 200 OK
t2v-e5-mistral-1  | Killed
t2v-e5-mistral-1  | INFO:     Started server process [7]
t2v-e5-mistral-1  | INFO:     Waiting for application startup.
t2v-e5-mistral-1  | INFO:     CUDA_PER_PROCESS_MEMORY_FRACTION set to 1.0
t2v-e5-mistral-1  | INFO:     Running on CPU
Loading checkpoint shards:  83%|████████▎ | 5/6 [00:05<00:01,  1.22s/it]Killed
t2v-e5-mistral-1  | INFO:     Started server process [7]
t2v-e5-mistral-1  | INFO:     Waiting for application startup.
t2v-e5-mistral-1  | INFO:     CUDA_PER_PROCESS_MEMORY_FRACTION set to 1.0
t2v-e5-mistral-1  | INFO:     Running on CPU
Loading checkpoint shards:  83%|████████▎ | 5/6 [00:05<00:01,  1.15s/it]Killed

The model I am using for this container is on Huggingface at this URL: intfloat/e5-mistral-7b-instruct · Hugging Face

The development machine is a 64GB/16cores Linux server whose normal usage is depicted in this nmon snapshot:


and as soon as I launch the t2v-e5-mistral service container I see the free memory rapidly going to zero and then going up again after the container is killed.
In this other snapshot you can see two CPU spikes as I launched the container twice trying to make the picture when the memory was very low:

and can see the memory is almost depleted.

Would be very grateful if you could suggest some environment variables or other means to keep weaviate/custom text2vector containers within limits of my dev machine. Thanks in advance

Hi!

This is interesting. Running those on CPU really takes some memory.

Is this happening even with only one object?

Or does it happen with on a big batch?

It happens also with very few.

I have now decided to build my own vectorizer container on which I have a lot more control. It is much slimmer that the standard and has a lot more debugging output which I am controlling with a DEBUG_LEVEL environment variable with DEBUG,INFO and ERROR levels outputting their stuff. Also the average response time is around 600 milliseconds not 10 seconds :slight_smile:

Is it based on our Dockerfile?

It would be nice to take a look, maybe we can improve our own container too.

Ele Duda, como você está? No I wrote it from scratch and am quite happy with it.

This is the Dockerfile:

FROM python:3.11-slim-bookworm as base
RUN apt-get update && \
    apt-get install -y iputils-ping curl && \
    rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install -r requirements.txt

FROM base as application
WORKDIR /app
COPY llm_fast/*.py llm_fast/
COPY llm_fast/llmlib llm_fast/llmlib/
COPY resources resources/
COPY .env ./


FROM application as runtime
RUN useradd -ms /bin/bash llmuser
RUN \
    chown -R llmuser:llmuser /app \
    && mkdir -p /home/llmuser/.cache \
    && chown -R llmuser:llmuser /home/llmuser/.cache

Of course I could use a smaller base image but I like the slim debian bases. I also install curl to be able to run an health check. The last part as you can see is about running the container as a non privileged user and also preparing a mountpoint for a permanent cache of the downloaded models.

The following is the simple docker compose file:

services:
  llm_fast:
    image: llm_fast:v0.5
    command: gunicorn -w 1 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8086 --timeout 300 llm_fast.vectorize:app
    ports:
      - "8086:8086"
    volumes:
      - models_cache:/home/llmuser/.cache
    user: "1000:1000"

volumes:
  models_cache:

while the app is a FastAPI app with an app.add_event_handler(“startup”, startup_event) to preload all desired models and then when this is complete the worker will start the uvicorn process listening for connections.

If you wish more details just let me know and if you need it we can have a videocall. Take care

1 Like