WeaviateQueryError max_tokens is too large with generative search and gpt-4-1106-preview

Hi!

I have a problem, which I can’t seem to figure out.
I created a schema with the following config:

{
    "class": "article",
    "descripition": "Article collection",
    "properties": [
        {
            "indexFilterable": false,
            "indexSearchable": false,
            "name": "dirId",
            "dataType": [
                "string"
            ],
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": true
                },
                "qna-openai": {
                    "skip": true
                },
                "generative-openai": {
                    "skip": true
                }
            }
        },
        {
            "name": "title",
            "dataType": [
                "text"
            ],
            "description": "Article Title",
            "moduleConfig": {
                "text2vec-openai": {
                    "model": "text-embedding-3-large",
                    "dimensions": 1024
                },
                "qna-openai": {
                    "model": "gpt-3.5-turbo-instruct"
                }
            }
        },
        {
            "name": "content",
            "dataType": [
                "text"
            ],
            "description": "Article Content",
            "indexFilterable": true,
            "indexSearchable": true,
            "moduleConfig": {
                "text2vec-openai": {
                    "model": "text-embedding-3-large",
                    "dimensions": 1024
                },
                "qna-openai": {
                    "model": "gpt-3.5-turbo-instruct"
                }
            }
        },
        {
            "name": "url",
            "dataType": [
                "string"
            ],
            "description": "Article URL",
            "indexFilterable": true,
            "indexSearchable": false,
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": true
                },
                "qna-openai": {
                    "skip": true
                },
                "generative-openai": {
                    "skip": true
                }
            }
        },
        {
            "name": "language",
            "dataType": [
                "string"
            ],
            "description": "Article Language"
        }
    ],
    "moduleConfig": {
        "generative-openai": {
            "model": "gpt-4-1106-preview",
            "maxTokensProperty": 4096
        },
        "text2vec-openai": {
            "model": "text-embedding-3-large",
            "dimensions": 1024
        },
        "qna-openai": {
            "model": "gpt-3.5-turbo-instruct"
        }
    }
}

During Import of articles I made sure I chunk the articles to stay within the token limit for the vectorization of 8000 token as estimated by the tiktoken npm package with the WASM bindings.

I imported 100 articles as a test which came out to 101 chunks in total. Meaning only 1 article exceeded the 8k chunk border.

Now If I try to run a simple groupedTask test to generate some responses across multiple articles I run into a token limit. This is the error I receive:

WeaviateQueryError: Query call with protocol gRPC failed with message: /weaviate.v1.Weaviate/Search UNKNOWN: connection to: OpenAI API failed with status: 400 error: max_tokens is too large: 10189. This model supports at most 4096 completion tokens, whereas you provided 10189.

This is the code I used to test the generative Search function:

const collection = client.collections.get('Article');

const result = await collection.generate.nearText(search, {
    groupedTask: prompt,
    groupedProperties: ['title', 'content'],
}, {
    autoLimit: 2,
});

I tried using limit vs autoLimit, tried setting it to 1 instead of 2 (altough I should be able to use somewhere around 16 articles as input, right? (128k input window for gpt-4-turbo)

The error suggests to me, that I need to set the max_tokens somewhere else but I tried to do it during the creation of my class.

Could anybody point me in the right direction?

Thanks!
Steve

hi @steveHimself ! Welcome to our community :hugs:

This error message comes directly from OpenAi.

One thing you could do is intercepting they payload that should go to openai by replacing the base url on query time:

So you do a vector search (to avoid triggering the vectorization) and it will give you the exact payload sent to OpenAi.

Let me know if this helps!

Thanks!

But isn’t this regarding the vectorization and not the generation?
The vectorization should work fine. My problem seems to be with the generative search.

And how would I set custom headers to replace the Base URL with the typescript client?

I can’t really find anything about that.

Thanks!

Update:

I created a small little bun server on port 4000 on my localhost.

I then initialized the old weaviate-ts-client like this:

import weaviateAlt from 'weaviate-ts-client';

....
const altClient = await weaviateAlt.client({
    scheme: 'http',
    host: 'weaviate:8080',
    headers: {
        "X-OpenAI-Api-Key": process.env.OPENAI_API_KEY || '',
        "X-OpenAI-BaseURL": "http://host.docker.internal:4000"
    }
});

const altResult = altClient.graphql
    .get()
    .withClassName('Article')
    .withNearText({
        concepts: [search]
    })
    .withGenerate({
        groupedTask: proompt,
    })
    .withLimit(LIMIT) // This is 50 at the moment to test
    .withFields('title content id')
    .do();
}

On my Bun server the situation looks like this:

Bun.serve({
    port: 4000,
    async fetch(req) {
        console.log({ receivedBodyOutside: { req: req, body: await req.json() } });
        const url = new URL(req.url);

        if (url.pathname === '/') {
            console.log({ receivedBody: req.body });
            return new Response('Hello Bun!');
        }

        return new Response('Not found');
    }
})

And finally this is the result that is being logged:

{
  receivedBodyOutside: {
    req: Request (0 KB) {
      method: "POST",
      url: "http://host.docker.internal:4000/v1/embeddings",
      headers: Headers {
        "host": "host.docker.internal:4000",
        "user-agent": "Go-http-client/1.1",
        "content-length": "78",
        "authorization": "****",
        "content-type": "application/json",
        "accept-encoding": "gzip",
      }
    },
    body: {
      input: [ "<my-search-term>" ],
      model: "text-embedding-3-large",
      dimensions: 1024,
    },
  },
}

Couple of questions:

  • Am I right in thinking, that this is just the payload for the vectorization of the search term? Can I skip this? You mentioned I should “just” do a vector search to skip vectorization. Do you mean I should use nearVector() instead of nearText()?
  • Is this what you meant with “intercepting the OpenAI Calls”?

Thank you for your help @DudaNogueira

1 Like

Hi!

That’s right.

Weaviate will first vectorize your query (if using neartext or hybrid without a vector).

So in order to avoid triggering the vectorization of your generative search, you need to use a nearVector (provide your query vector) or a nearObject. (provide an object UUID of the same collection)

Let me know if this helps!

Thanks!

I finally came around to testing this more.
I still couln’t make any progress.

This is my setup at the moment:

This is how I call weaviate to generate results based on search + prompt:

const altClient = await weaviateAlt.client({
        scheme: 'http',
        host: 'weaviate:8080',
        headers: {
            "X-OpenAI-Api-Key": process.env.OPENAI_API_KEY || '',
            "X-OpenAI-BaseURL": "http://host.docker.internal:4000"
        }
    });

 const altResult = altClient.graphql
        .get()
        .withClassName('Article')
        // .withNearText({
        //     concepts: [search]
        // })
        .withNearVector({
            vector: [...],
        })
        .withFields('title content')
        .withGenerate({
            singlePrompt: proompt,
        })
        .withLimit(LIMIT) // moved this down to 2
        .do();

With this setup I still get the same response in my bun console log output:

{
  receivedBodyOutside: {
    req: Request (0 KB) {
      method: "POST",
      url: "http://host.docker.internal:4000/v1/embeddings",
      headers: Headers {
        "host": "host.docker.internal:4000",
        "user-agent": "Go-http-client/1.1",
        "content-length": "94",
        "authorization": "Bearer sk-proj-CUjrXQPfsox5hc2GM03oT3BlbkFJM3w1UfyT59c5lURlcWsK",
        "content-type": "application/json",
        "accept-encoding": "gzip",
      }
    },
    body: {
      input: [ "<my-search-query>" ],
      model: "text-embedding-3-large",
      dimensions: 1024,
    },
  },
}

I’ve now also made sure to split each of my articles at 8k tokens because I got a smilar error while batch importing.
The import now runs through so the splitting must have solved the issue there.

Hence, I believe the content length should not be the problem?

And given the output in my bun server I still only seem to get a vectorization request even though I’m using withNearVector() instead of withNearText()

Am I doing something critically wrong here? And could you point me in the right direction?

Thanks for your help so far!