Vectorization failed 404 http://host.docker.internal:11434/api/embed

Kieran_Sears · October 31, 2024, 2:14pm

Description

Running windows subsystem for linux (WSL2) with docker desktop running the containerization show from windows. I have ollama started with a model, works just fine when testing it with ollama run llama3.1.

I cook it up with:

docker run -d --gpus=all --name ollama --restart always -v ollama:/root/.ollama --add-host=host.docker.internal:host-gateway -p 11434:11434 ollama/ollama:0.3.10

my docker compose has env vars set to look at my .env file:

OLLAMA_URL=http://host.docker.internal:11434/
OLLAMA_MODEL=llama3.1:latest
OLLAMA_EMBED_MODEL=llama3.1

This works as expected, as I can start up Verba on port 8000 and select docker deployment in the UI. The “Chat” tab page has “0 documents embedded by llama3.1:latest” so it’s definitely connecting and reading the right model, else this would show a connection error.

But going onto the “Import Data” tab and trying to add and import a simple txt file containing “Why is the sky blue”, throws up:

✘ No documents imported 0 of 1 succesful tasks
ℹ FileStatus.ERROR | why_oh_why.txt | Import for why_oh_why.txt failed:
Import for why_oh_why.txt failed: Batch vectorization failed: Vectorization
failed for some batches: 404, message='Not Found',
url=URL('http://host.docker.internal:11434/api/embed') | 0

I even tried adding ollama to the same network as docker compose (docker network connect verba_default ollama) and got to the same point, but with “http://ollama:11434/api/embed” failing in the same way.

I jumped into the code to start debugging the OllamaEmbedder:

    async def vectorize(self, config: dict, content: list[str]) -> list[float]:

        model = config.get("Model").value

        data = {"model": model, "input": content}
        
        async def on_request_end(session, trace_config_ctx, params):
            print(f"Ending request:\n   method: {params.method}\n   url: {params.url}\n   headers: {params.headers}")

        trace_config = aiohttp.TraceConfig()
        trace_config.on_request_end.append(on_request_end)

        async with aiohttp.ClientSession(trace_configs=[trace_config]) as session:
            async with session.post(self.url + "/api/embed", json=data) as response:
                response.raise_for_status()
                data = await response.json()
                embeddings = data.get("embeddings", [])
                return embeddings

And I was confused as can be on the printout changing the Method to GET:

Ending request:
   method: GET
   url: http://host.docker.internal:11434/api/embed
   headers: <CIMultiDict()>

But maybe that’s down to my poor understanding of Python and these async libraries / middleware changing things as it goes through?

Either way when I use curl from the verba-verba-1 container I’m able to get the embeddings just fine:

curl http://host.docker.internal:11434/api/embed -d '{"model": "llama3.1","input": "Why is the sky blue?"}'

So, now I’m at a loss on what else to try. Any ideas?

Server Setup Information

Vera commit: 59a46d06e382dc88cc90d9d217e7c5a2a8f950dc
Deployment Method: local docker compose
OS: Windows + WSL2

DudaNogueira · October 31, 2024, 8:00pm

hi @Kieran_Sears !!

Welcome to our community

I was just playing around with Verba + Ollama all in docker

I am not sure exactly how WSL2 plays with windows + docker, but can you try running everything in docker?

One thing to note: Whenever you start Verba, your ollama must have the models available, otherwise they will not be listed in Verba. Verba will connect to Ollama at startup and read all available models.

Here is how I am doing:

first, create a docker-compose.yaml file like this:

---

services:
  verba:
    image: semitechnologies/verba
    ports:
      - 8000:8000
    environment:
      - WEAVIATE_URL_VERBA=http://weaviate:8080
      - OLLAMA_URL=http://ollama:11434
      - OLLAMA_MODEL=llama3.2
      - OLLAMA_EMBED_MODEL=llama3.2

    volumes:
      - ./data:/data/
    depends_on:
      weaviate:
        condition: service_healthy
    healthcheck:
      test: wget --no-verbose --tries=3 --spider http://localhost:8000 || exit 1
      interval: 5s
      timeout: 10s
      retries: 5
      start_period: 10s

  weaviate:
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    image: semitechnologies/weaviate:1.25.10
    ports:
      - 8080:8080
      - 3000:8080
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    healthcheck:
      test: wget --no-verbose --tries=3 --spider http://localhost:8080/v1/.well-known/ready || exit 1
      interval: 5s
      timeout: 10s
      retries: 5
      start_period: 10s
    environment:
      OPENAI_APIKEY: $OPENAI_API_KEY
      COHERE_APIKEY: $COHERE_API_KEY
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_MODULES: 'e'
      CLUSTER_HOSTNAME: 'node1'

  ollama:
    image: ollama/ollama:0.3.14
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - 11434:11434
      
volumes:
  weaviate_data: {}
  ollama_data: {}
...

Now, let’s make sure we have the model we selected (in this case, llama3.2) available:

docker compose exec -ti ollama ollama pull llama3.2

You can check if the model is listed here:
http://localhost:11434/api/tags

ok, now we can start everything up:

docker compose up -d

Now proceed to import a document at verba, that should be running at:
http://localhost:8000/

A little after the import start, you should see ollama eating up resources:

Obs: llama was quite slow to vectorize and while doing with large documents, it was crashing

Let me know if this helps!

THanks!

DudaNogueira · October 31, 2024, 8:05pm

So now, for example, if you want to add a new model, you should:

For example, adding nomic-embed-text:

docker compose exec -ti ollama ollama pull nomic-embed-text

docker compose restart verba

You should now see both models listed in Verba

DudaNogueira · October 31, 2024, 8:15pm

Ps: While vectorizing large documents, I have faced this error:

github.com/ollama/ollama

Failed to acquire semaphore" error="context canceled"

opened 02:16AM - 12 Jun 24 UTC

closed 04:35PM - 07 Aug 24 UTC

travisgu

bug needs more info

### What is the issue? I am using embedding with AnythingLLM for RAG. I found… the embedding service always failed for several minutes calls. The error log is showing this every time. I am not sure why the context was cancelled. Please kindly help. Below is the debug log: ``` [GIN] 2024/06/11 - 11:09:07 | 200 | 2m28s | 10.100.34.236 | POST "/api/embeddings" time=2024-06-11T11:09:07.830+08:00 level=DEBUG source=sched.go:304 msg="context for request finished" time=2024-06-11T11:09:07.830+08:00 level=DEBUG source=sched.go:255 msg="after processing request finished event" modelPath=C:\Users\admin_env\.ollama\models\blobs\sha256-ada9f88e89df0ea53c31fabf8b1e7c8c0c22fa95ab3a3cad4cdd86103ce9f3d3 refCount=119 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=14500 tid="17876" timestamp=1718075347 DEBUG [update_slots] slot released | n_cache_tokens=52 n_ctx=2048 n_past=52 n_system_tokens=0 slot_id=0 task_id=14500 tid="17876" timestamp=1718075350 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/embedding" remote_addr="127.0.0.1" remote_port=51069 status=200 tid="16400" timestamp=1718075350 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=14503 tid="17876" timestamp=1718075350 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=14504 tid="17876" timestamp=1718075350 [GIN] 2024/06/11 - 11:09:10 | 200 | 2m30s | 10.100.34.236 | POST "/api/embeddings" time=2024-06-11T11:09:10.289+08:00 level=DEBUG source=sched.go:304 msg="context for request finished" time=2024-06-11T11:09:10.290+08:00 level=DEBUG source=sched.go:255 msg="after processing request finished event" modelPath=C:\Users\admin_env\.ollama\models\blobs\sha256-ada9f88e89df0ea53c31fabf8b1e7c8c0c22fa95ab3a3cad4cdd86103ce9f3d3 refCount=118 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=14504 tid="17876" timestamp=1718075350 time=2024-06-11T11:09:11.910+08:00 level=ERROR source=server.go:836 msg="Failed to acquire semaphore" error="context canceled" time=2024-06-11T11:09:11.910+08:00 level=DEBUG source=sched.go:304 msg="context for request finished" time=2024-06-11T11:09:11.911+08:00 level=INFO source=routes.go:401 msg="embedding generation failed: context canceled" time=2024-06-11T11:09:11.911+08:00 level=DEBUG source=sched.go:255 msg="after processing request finished event" modelPath=C:\Users\admin_env\.ollama\models\blobs\sha256-ada9f88e89df0ea53c31fabf8b1e7c8c0c22fa95ab3a3cad4cdd86103ce9f3d3 refCount=117 [GIN] 2024/06/11 - 11:09:11 | 500 | 2m32s | 10.100.34.236 | POST "/api/embeddings" time=2024-06-11T11:09:11.911+08:00 level=ERROR source=server.go:836 msg="Failed to acquire semaphore" error="context canceled" time=2024-06-11T11:09:11.911+08:00 level=DEBUG source=sched.go:304 msg="context for request finished" time=2024-06-11T11:09:11.911+08:00 level=INFO source=routes.go:401 msg="embedding generation failed: context canceled" time=2024-06-11T11:09:11.911+08:00 level=ERROR source=server.go:836 msg="Failed to acquire semaphore" error="context canceled" [GIN] 2024/06/11 - 11:09:11 | 500 | 2m32s | 10.100.34.236 | POST "/api/embeddings" time=2024-06-11T11:09:11.911+08:00 level=DEBUG source=sched.go:304 msg="context for request finished" time=2024-06-11T11:09:11.911+08:00 level=DEBUG source=sched.go:255 msg="after processing request finished event" modelPath=C:\Users\admin_env\.ollama\models\blobs\sha256-ada9f88e89df0ea53c31fabf8b1e7c8c0c22fa95ab3a3cad4cdd86103ce9f3d3 refCount=116 time=2024-06-11T11:09:11.911+08:00 level=DEBUG source=sched.go:255 msg="after processing request finished event" modelPath=C:\Users\admin_env\.ollama\models\blobs\sha256-ada9f88e89df0ea53c31fabf8b1e7c8c0c22fa95ab3a3cad4cdd86103ce9f3d3 refCount=115 time=2024-06-11T11:09:11.911+08:00 level=INFO source=routes.go:401 msg="embedding generation failed: context canceled" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.41

Which I believe may be some docker env variable that need to be set.

Kieran_Sears · November 2, 2024, 9:48pm

Hey @DudaNogueira , thanks for the warm welcome!

I found my issue, and it’s painfully straight forward.

I was running it all within docker, I just had verba and weaviate on my docker compose, but the ollama image running independently in it’s own container rather than in the same docker compose file. As I say, this can be done by ensuring the ollama container is put onto the same network as verba. My ollama does have a model in it (Llama3.1), which I’ve yet to benchmark but seems to run slick considering I set it up with GPU acceleration (see docker image for details on how), I’ll test it with larger files now I’ve gotten it working. Lord knows if the embedding works as expected, but considering it’s producing content I can’t see why it wouldn’t.

The solution

But the issue was having a trailing forward slash at the end of my OLLAMA_URL environment variable. I did think it was strange my curl command could hit my endpoint but the service couldn’t? So, after removing it:

- OLLAMA_URL=http://host.docker.internal:11434/
+ OLLAMA_URL=http://host.docker.internal:11434

I had my issue disappear, with even the logs giving the expected POST method:

Ending request:
   method: POST
   url: http://host.docker.internal:11434/api/embed
   headers: <CIMultiDict()>

Which to me is still quite infuriating that it changed the method just because the resource wasn’t found? If someone can point to the part where it’s written in the standard that this should happen, then please add it as a reply on here!

Thank you kindly for your prompt reply by the way! If you wanted to have more of a play around you could find out about how to set up the ollama image in docker compose with the GPU acceleration, that would absolutely speed things up as it’s practically instantaneous on my machine that runs a NVIDIA GFORCE RTX 3070.

I’m quite new to contributing to open source, but I’d like to prevent anyone else from making such a rookie error that was difficult to debug. Should I open a PR that:

updates the readme with a note to say avoid tailing slashes.
update the config to validate env var URLs and improve error handling so logs show when something is amiss.
update the config to just strip any tailing slashes automagically.

Would like to know your thoughts?

Kind regards!

DudaNogueira · November 4, 2024, 12:38pm

Ohhhhh.

I believe we can improve this in Verba, properly joining the url paths here:

github.com

weaviate/Verba/blob/59a46d06e382dc88cc90d9d217e7c5a2a8f950dc/goldenverba/components/embedding/OllamaEmbedder.py#L16


      
          from goldenverba.components.interfaces import Embedding
          from goldenverba.components.types import InputConfig
          from goldenverba.components.util import get_environment
          
          
          class OllamaEmbedder(Embedding):
          
              def __init__(self):
                  super().__init__()
                  self.name = "Ollama"
                  self.url = os.getenv("OLLAMA_URL", "http://localhost:11434")
                  self.description = f"Vectorizes documents and queries using Ollama. If your Ollama instance is not running on {self.url}, you can change the URL by setting the OLLAMA_URL environment variable."
                  models = get_models(self.url)
          
                  self.config = {
                      "Model": InputConfig(
                          type="dropdown",
                          value=models[0],
                          description=f"Select a installed Ollama model from {self.url}. You can change the URL by setting the OLLAMA_URL environment variable. ",
                          values=models,
                      ),

and here

github.com

weaviate/Verba/blob/59a46d06e382dc88cc90d9d217e7c5a2a8f950dc/goldenverba/components/generation/OllamaGenerator.py#L38


      
              )
          
          async def generate_stream(
              self,
              config: Dict,
              query: str,
              context: str,
              conversation: List[Dict] = [],
          ) -> AsyncGenerator[Dict, None]:
              model = config.get("Model").value
              url = f"{self.url}/api/chat"
              system_message = config.get("System Message").value
          
              if not self.url:
                  yield self._error_response("Missing Ollama URL")
                  return
          
              messages = self._prepare_messages(query, context, conversation, system_message)
              data = {"model": model, "messages": messages}
          
              try:

Please, feel free to open an issue so we can tackle this. We are always open to contributions, specially those that improve DX.

This could prevent this issue:

from urllib.parse import urljoin
base_url = "https://example.com/"
relative_path = "/api/v1/users"
joined_url = urljoin(base_url, relative_path)

There are other models that can also suffer from this issue.

As I use a mac, I only run models on CPU.
I will eventually get my hands on a proper GPU to play around with with some load

If you are opening this issue, make sure to link to this thread for context!

Thanks!

Topic		Replies	Views
Local Ollama Verba-Weaviate on docker odd errors when using chat Support	1	398	November 1, 2024
Locally running RAG pipeline with Verba and Llama3 with Ollama Support developer-experience , python	11	956	November 13, 2024
Vectorization failed for some batches: 500, message='Internal Server Error' Support developer-experience , technical	14	1056	December 5, 2024
Ollama Embeddings call fails with wrong URL path: /api/embeddings Support integration , python	5	2517	November 6, 2024
Unable to deploy Verba v1.0.3 and v2.0.0 with Docker on Windows Support	7	497	October 7, 2024

Vectorization failed 404 http://host.docker.internal:11434/api/embed

Description

Server Setup Information

The solution

Related topics