Is nearText() completely a Weaviate cloud operation, no outbound LLM call?

Does the Weaviate nearText() search use anoutbound LLM? If it does not, why is the OpenAI key being used when when initializing the Weaviate client as shown in the tutorial?:

In other words, if nearText() happens completely within my Weaviate cloud cluster call, could I initialize the Weaviate client without an API key, thereby omitting the “headers” section from the connectToWCS() call parameters?:

  const client = await weaviate.connectToWCS(
    WEAVIATE_CLUSTER_ENDPOINT_URL_1,
    {
      authCredentials: new weaviate.ApiKey(WEAVIATE_API_KEY_ADMIN_READ_WRITE),
    }
  )

hi @Robert_Oschler !

When you use nearText() Weaviate will vectorize your query using the same vectorizer you use for generating the vectors of your objects, then it will search against your objects.

So if you have chosen to use openai, it will indeed connect from Weaviate to OpenAI, using that Key, so it can vectorize your query, and perform the search.

So no, nearText doesn’t happen within Weaviate.

nearVector, on the other hand, will query your objects using the vector. So on that case, it will happen within Weaviate.

Let me know if that helps.

Thanks!

1 Like

I should have been a little clearer with my question. I figured it used an embeddings endpoint to vectorize the user input. What I really meant to ask is does it do anything at all with an outbound service after the user input is vectorized when searching the collection? Since you mentioned the nearVector method, I’m assuming that objects imported to the collection are vectorized automatically? If so, do I need to tell Weaviate what fields vectorize and what fields to leave out of the result vector?

Here is my use case. Perhaps it is better if I take the app notes route.

I want to use my Weaviate Cloud Cluster to do the pre-filter operation for a RAG search. However, I want to do the RAG search myself.

What I was hoping to do was:

  • Upload all my chunks of text in a batch import operation, with each object containing a field named textChunk and another field named sourceUrl, which will hold the URL of the original HTML document the text chunk came from.

  • I assumed that Weaviate would vectorize the objects during the import. Am I wrong on this? I’m hoping this is true so that the text chunks don’t have to be vectorized every time I want to do a similarity search, and that I would just have to vectorize the user input

Then I will take the result of the nearText or nearVector search and pass that to the LLM model I am using to complete the RAG operation.

Can I do this with Weaviate? If so, what methods do I call to make this happen?

Note, if I do use nearVector in my use case, how do I tell it not to use the sourceUrl field in any comparison operations?

Oh, ok!

Considering that you enabled the vectorizer for a collection, and that you provide the API KEY for that vectorizer service, Weaviate will vectorize the data at import.

Check here for a nice graphic about it:

Now, whenever you do a nearText, Weaviate will vectorize the query and search, using that vector, against your vectors.

So your assumption is correct.

When you create your collection, you can specify the vectorizer and the generative module (more on that below), and also specify the properties.

By default and because AUTO_SCHEMA is turned ON, if you just throw data at Weaviate, like you have ( textChunk and sourceUrl), they will created automatically as a property with the data_type of TEXT, and it will have skip:false, which means, all those fields (Including the collection name) will be concatenated and used to generate the vectors of that object.

This is described here:

Now, you mentioned you want to do a RAG operation. You can either get the objects from your neartext or nearvector search, generate a prompt with them and pass it to your llm, or use one of Weaviate generative modules.

That way Weaviate will do all that heavy lifting for your, while exposing your rag flow with simple to use APIs.

Also, check our events page, as we have a lots of free webinars that can help you advance in your developments:

Let me know if this helps.

Thanks!

1 Like

Thanks, very helpful. What data type should I use for the URL field, since I don’t want it in the vectorized text? I looked at the available data types and I didn’t see one for URLs and “string” is deprecated:

Should I make a tiny object just to house it like:

{
    url: <URL value>
}

And then use the “object” data type?

As an example, one of my objects I submit using a batch import would look like this:

{
    chunkText: "The quick brown fox jumped over the lazy grouse."
    attributionUrl:  {
        url: "https://dummy-domain.com/article/1234"
    }
}

With my schema marking the chunkText field as “text” and the attributionUrl field as “object”:

const schema = {
    name: 'Articles',
    properties: [
    {
      name: 'chunkText',
      dataType: 'text' as const,
      description: 'The searchable text' as const,
      tokenization: 'lowercase' as const,
      vectorizePropertyName: false,
    },
    {
      name: 'attributionUrl',
      dataType: 'object' as const,
      description: 'The URL of the source article' as const,
      tokenization: 'whitespace' as const,
      vectorizePropertyName: false,
    }
  ],
}

Is the best approach?

Hi!

The downside of using object data type, for now, is that they are not indexable nor filterable:

from docs:

As of 1.22, object and object[] datatype properties are not indexed and not vectorized.
Future plans include the ability to index nested properties, for example to allow for filtering on nested properties and vectorization options.

for the url, you can also use a text, however, as you don’t want it added as part of the vectorization, you mark it to skip, like so:

const schema = {
    name: 'Articles',
    properties: [
    {
      name: 'chunkText',
      dataType: 'text' as const,
      description: 'The searchable text' as const,
      tokenization: 'lowercase' as const,
      vectorizePropertyName: false,
    },
    {
      name: 'attributionUrl',
      dataType: 'text' as const,
      description: 'The URL of the source article' as const,
      tokenization: 'whitespace' as const,
      vectorizePropertyName: false,
      moduleConfig: {
        text2vec_openai: {
            skip: true,
            vectorizePropertyName: false
        }
      },
    }
  ],
}

Let me know if this helps!

1 Like

That’s what I needed. That “skip” property, thanks.

Regarding;

" As of 1.22 , object and object[] datatype properties are not indexed and not vectorized.
Future plans include the ability to index nested properties, for example to allow for filtering on nested properties and vectorization options."

That’s OK as long as I get the object back in any search results from nearText() or nearVector(). I just need it for a “Further reading” section that I will show to the user underneath the main answer, so they can go and read the original articles.

I do get that field back in any search results, right?

Regarding “text2vec_openai”:

I assume that block tells the vectorizer to skip that field when using the OpenAI vectorizer? Or is that an arbitrary field name you chose? If it is the former and not the latter, is there a different reserved field name for each vectorizer? In other words, is there a different outer property name for the “skip” declaration for OpenAI vs. Google vs. Cohere, etc (e.g. - text2vec_openai, text2vec_google, text2vec_cohere, etc.)?

If so, is there a web page or document that lists the correct value. Also, there are muttiple reserved property name that the developer needs to know for each outbound LLM, wouldn’t it be better if there was a generic way to tell the vectorizer to “skip” the field, regardless of what outbound LLM is going to vectorize the input query?

Two more questions:

  • Is the date type indexed and/or searchable?

  • Can I use a date in a filter() operation? If so, what would be the syntax for:

    • Filtering all results before a certain date
    • Filtering all results after a certain date
    • Filtering all results between two dates (i.e. - after date_1 but before date_2)

Note, I am using the beta/V3 version of the JavaScript client which is a much nicer syntax! :smile:

Hi!

yes, they are!

Here on how:

Unfortunately, the above link for conditional filters with timestamps doesn’t have an exable for the js v3 beta client yet.

Glad you are enjoying our new v3 js client!

1 Like

Thanks. If you’d get the chance I’d like to know the answers to my reply that precedes the reply you answered.

Sorry, I think I got lost here, hehehe.

What is the pending question?

Thanks!

The one above that starts with:

" That’s what I needed. That “skip” property, thanks.

Regarding;

" As of 1.22 , object and object[] datatype properties are not indexed and not vectorized.
Future plans include the ability to index nested properties, for example to allow for filtering on nested properties and vectorization options. …"

Ah, ok! Thanks!

So, when you use an object data type, you will get those values back, as expected.

However, you will not be able to use those content for searching, filtering or vectorization.

So, based on your use case, you should be fine.

Let me know if this helps :slight_smile:

1 Like