Hybrid Queries on new OpenAI Embedding Models failing server restart

@DudaNogueira I thought I would create a new thread for this issue rather than hijacking New OpenAI Embedding Models - #21 by SomebodySysop

The problem statement is that once a weaviate server is configured with an OpenAI vectorizer using the new model of text-embedding-3-large and dimensions of 1024, hybrid queries fails with a vector search: vector lengths don't match: 1024 vs 3072 error message upon a server reboot.

I was able to replicate this issue on codesandbox. This is using Weaviate v1.23.10 and python client 4.4.4.

Steps to reproduce

  1. https://codesandbox.io/p/sandbox/interesting-morse-hgvggd
  2. Sign-in using SSO of choice
  3. Open up setup.py and query.py and update line 16 with an OpenAI API Key. As this is being done codesandbox will “seamlessly fork” to your own private sandbox. If the URL does not change, you may have to go back to the dashboard CodeSandbox, go to My drafts, and open the newly created sandbox.
  4. Go to top left corner and select the “Restart Devbox” option. This should trigger sandbox initialization. Wait for container to be started and the pip -r requirements.txt job to complete.
  5. Open up a new terminal in the center bottom pane.
  6. Run the following in sequence:
  • docker compose down -v

  • docker compose up -d

  • python setup.py

  • python query.py

    Note the following:

    1. setup.py creates a collection and inserts a single object
    2. The single object we stored in weaviate has a vector length of 1024, indicating vectorizer is working properly
    3. We can fetch that object from weaviate, confirming that the inserted object is persisted
    4. We can hybrid query from weaviate
  1. Now run:
  • docker compose restart
  • python query.py

All we’ve done here is restart the weaviate container. Notice now that we can still fetch the inserted object (see output above the exception output), but now hybrid query fails with a vector length not matching error.

Hi @D3x !

Thanks for reporting.

I will try to reproduce this on my end and get back to you!

Hi @DudaNogueira were you able to reproduce this given the instructions?

Hi! Sorry, I couldn’t get to it yet.

have you tried running this locally?

Those sandboxes usually has a lot of limitations that may affect it, so removing that component may give us a hint if the issue is on there on in the server.

@DudaNogueira yes this is reproducible locally.

The same behaviors as I noted above persists. A simple server restart makes hybrid queries fail which seems like a fairly serious problem. Would appreciate your team’s attention on this asap.

Hi D3x!

Sorry for the delay here.

I was not able to reproduce this:

❯ python3 setup.py
UUID for new object created: 117a7993-a2aa-4847-9bd2-f69cbdac1160
fetch_objects: 117a7993-a2aa-4847-9bd2-f69cbdac1160 (1024) | Properties: {‘text’: ‘Some data’}
hybrid query: 117a7993-a2aa-4847-9bd2-f69cbdac1160 (1024) | Properties: {‘text’: ‘Some data’}
❯ python3 query.py
fetch_objects: 117a7993-a2aa-4847-9bd2-f69cbdac1160 (1024) | Properties: {‘text’: ‘Some data’}
hybrid query: 117a7993-a2aa-4847-9bd2-f69cbdac1160 (1024) | Properties: {‘text’: ‘Some data’}

Could we connect in Slack so I can take a closer look?