Problem with near_text Query and Metadata Filtering in Weaviate

  • Weaviate Server Version: 1.27.0
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: single node
  • Client Language and Version: Client: 4.9.3, python
  • Multitenancy?: no

I’m encountering an issue with filtering results in Weaviate using the near_text query.

First of all:


client.collections.create(
    name="RAG",
    properties=[
        wc.Property(name="transcription", data_type=wc.DataType.TEXT),
        wc.Property(name="data", data_type=wc.DataType.DATE, inverted_index_config={"IndexTimestamps": True}),
        wc.Property(name="hora_inicio_video", data_type=wc.DataType.TEXT),
        wc.Property(name="hora_fim_video", data_type=wc.DataType.TEXT),
        wc.Property(name="chave_unica", data_type=wc.DataType.TEXT),
        wc.Property(name="highlights_assunto", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="highlight_start", data_type=wc.DataType.NUMBER),
        wc.Property(name="highlight_end", data_type=wc.DataType.NUMBER),
        wc.Property(name="action_log", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="action_log_start", data_type=wc.DataType.NUMBER),
        wc.Property(name="action_log_end", data_type=wc.DataType.NUMBER),
        wc.Property(name="location", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="offset_start", data_type=wc.DataType.NUMBER),
        wc.Property(name="offset_end", data_type=wc.DataType.NUMBER)
    ],
    vectorizer_config=wc.Configure.Vectorizer.text2vec_huggingface(model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
    generative_config=wc.Configure.Generative.google(
                                                     project_id=project,
                                                     model_id="gemini-1.5-pro-preview-0514", 
                                                     temperature=0.3,
                                                    )
)

Here is my initial query that works fine:

response_time_near_text = rag.query.near_text(
    query="limpar a piscina",
    limit=1,
    return_metadata=wq.MetadataQuery(distance=True),
)

This query returns data and metadata successfully. However, when I take one of the metadata values returned (in this case, chave_unica, but it’s the same for other metadata) and use it as an exact match filter, the query returns no results:

response_time_near_text = rag.query.near_text(
    query="limpar a piscina",
    limit=1,
    return_metadata=wq.MetadataQuery(distance=True),
    filters=wq.Filter.by_property("chave_unica").equal("060a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80")
)

QueryReturn(objects=[])

The filter value "060a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80" is copied exactly from the metadata returned in the first query, but with this filter applied, the query returns no results.
Additionally, if I use:

response_time_near_text = rag.query.near_text(
    query="limpar a piscina",
    limit=1,
    return_metadata=wq.MetadataQuery(distance=True),
    filters=wq.Filter.by_property("chave_unica").not_equal("60a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80")
)

In other words, when I perform the query with not_equal and provide an incorrect value, the query returns data correctly.
Any advice on why this happens or how to fix it would be greatly appreciated. Thank you!

I’ve performed some additional testing.

If I populate my collection with the following code (using langchain):

def carregar_documentos(pasta):
    documentos = []
    arquivos = os.listdir(pasta)
    for arquivo in arquivos:
        caminho_completo = os.path.join(pasta, arquivo)
        if os.path.isfile(caminho_completo):
            with open(caminho_completo, 'r') as f:
                dados = json.load(f)
                documentos.extend([
                    Document(page_content=chunk['transcription'], metadata=chunk['metadata'])
                    for chunk in dados
                ])
    return documentos

documents = carregar_documentos('data/data_constructed')

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

db = WeaviateVectorStore.from_documents(documents, embedding=embeddings, client=client, index_name="RAG")

The search works fine. The issue is that I want to avoid having to define the embeddings like embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
So I tried this other way (batch.dynamic()):

def carregar_documentos(pasta):
    documentos = []
    arquivos = os.listdir(pasta)
    for arquivo in arquivos:
        caminho_completo = os.path.join(pasta, arquivo)
        if os.path.isfile(caminho_completo):
            with open(caminho_completo, "r") as f:
                dados = json.load(f)
                documentos.extend(
                    [
                        {
                            "page_content": chunk["transcription"],
                            "metadata": chunk["metadata"],
                        }
                        for chunk in dados
                    ]
                )
    return documentos

data_rows = carregar_documentos('data/data_constructed')

rag = client.collections.get("RAG")

with rag.batch.dynamic() as batch:
    for data_row in data_rows:
        try:
            batch.add_object(properties=data_row)
            print(f"Adicionado: {data_row}")
        except Exception as e:
            print(f"Erro ao adicionar {data_row}: {e}")

for both cases, if I do the test:

response = rag.aggregate.over_all(total_count=True)

everything is fine.

hi @lisascat !!

Welcome to our community :hugs:

I was not able to reproduce this.

Here is some code so we can share:

client.collections.delete("Test")
collection = client.collections.create(
    "Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    properties=[
        wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="chave_unica", data_type=wvc.config.DataType.TEXT),
    ]
)
collection.data.insert_many(
    objects=[
        {
            "chave_unica": "060a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80",
            "text": "Clean swimming pool"
        },
        {
            "chave_unica": "111a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80",
            "text": "Clean garage"
        },        
        
    ]
)
query = collection.query.near_text(
    query="limpar a piscina",
    limit=1,
    return_metadata=wvc.query.MetadataQuery(distance=True),
    filters=wvc.query.Filter.by_property("chave_unica").equal("060a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80")
)
print(query.objects)

note that if you have an ID for you objects, you can use it, like explained here:

That way, when performing batch operations, you can insert/update your objects

Hello @DudaNogueira
thank you for your response!

The insertion worked, and the search is also functioning correctly. What I don’t understand is why, when using:

batch_size = 100  # Batch size
for i in range(0, len(data_rows), batch_size):
    batch_data = data_rows[i : i + batch_size]
    try:
        rag.data.insert_many(objects=batch_data)
        print(f"Batch {i // batch_size + 1} successfully inserted!")
    except Exception as e:
        print(f"Error inserting batch {i // batch_size + 1}: {e}")

the chunks returned by the search with query.near_text are different from those generated when using:

db = WeaviateVectorStore.from_documents(documents, embedding=embeddings, client=client, index_name="RAG")

It seems that the embedding model is not being loaded correctly. The “distance” values are also inconsistent.

Oh, I see.

When you use WeaviateVectorStore.from_documents the framework (not sure what are you using, llamaindex or langchain) will embed it for you following it’s own login. This means it will concatenate each doc, vectorize it, and then bring the vectors to Weaviate.

When you let Weaviate vectorize it for you, you control which fields you want to be part of the vector.

With that said, and considering the collection schema you pasted, there is probably more properties being used as part of your vector.

So I believe that the two different ways of ingesting data will output different vectors, hence the difference.

Hi @DudaNogueira!

Thanks again for your previous help. The solution you suggested worked perfectly, and I was able to resolve the initial issue. However, I encountered other challenges as I adapted the code to my specific needs.

Ideally, I want the semantic search to happen only on one specific field: transcription. I managed to make this work using Named Vectors:


client.collections.create(
    name="RAG",
    properties=[
        wc.Property(name="transcription", data_type=wc.DataType.TEXT),
        wc.Property(name="data", data_type=wc.DataType.DATE),
        wc.Property(name="hora_inicio_video", data_type=wc.DataType.TEXT),
        wc.Property(name="hora_fim_video", data_type=wc.DataType.TEXT),
        wc.Property(name="chave_unica", data_type=wc.DataType.TEXT),
        wc.Property(name="highlights_assunto", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="highlight_start", data_type=wc.DataType.NUMBER),
        wc.Property(name="highlight_end", data_type=wc.DataType.NUMBER),
        wc.Property(name="action_log", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="action_log_start", data_type=wc.DataType.NUMBER),
        wc.Property(name="action_log_end", data_type=wc.DataType.NUMBER),
        wc.Property(name="location", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="offset_start", data_type=wc.DataType.NUMBER),
        wc.Property(name="offset_end", data_type=wc.DataType.NUMBER)    
    ],
    # Define the vectorizer module
    vectorizer_config=[
        wc.Configure.NamedVectors.text2vec_huggingface(
            name="transcription_vector",
            source_properties=["transcription"],
            model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
        )
    ],
    # Define the generative module
    generative_config=wc.Configure.Generative.google(
        project_id=project_id,
        model_id="gemini-1.5-pro-preview-0514", 
        temperature=0.3,
    )
)

The following query works as expected:

response = rag.query.near_text( 
    query="o que aconteceu na piscina",
    target_vector="transcription_vector",
    limit=5,
    return_metadata=wq.MetadataQuery(distance=True),
)

However, when I try to use a filter along with the semantic search, like this:

response_unique_key = rag.query.near_text(
    query="limpar a piscina",
    target_vector="transcription_vector",
    limit=1,
    return_metadata=wq.MetadataQuery(distance=True),
    filters=wq.Filter.by_property("chave_unica").equal("060a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80")
)

It does not work. My understanding is that this might happen because I vectorized only the transcription field. To address this, I tried adding another Named Vector for the chave_unica field:

wc.Configure.NamedVectors.text2vec_huggingface(
    name="chave_unica_vector",
    source_properties=["chave_unica"],
    model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

And then queried like this:

response_unique_key = rag.query.near_text(
    query="limpar a piscina",
    target_vector="chave_unica_vector",
    limit=1,
    return_metadata=wq.MetadataQuery(distance=True),
    filters=wq.Filter.by_property("chave_unica").equal("060a2b340101010101010f0013-000000-00000141d33e6dbe-060e2b347f7f-2a80")
)

But this still does not work…

What I Need

I need the semantic search to occur only in the transcription field, but I also need to be able to apply filters on other fields during the semantic search.

Is there a recommended way to achieve this? tks for all the support!!!

Hi!

Great! Glad to hear we are making progress :slight_smile:

Now, you need to understand that in your scenario, Weaviate will have two indices:

1 - A Named Vector called transcription_vector
2 - An inverted index with all the tokenized content that is searchable and filterable.

So no need to add a second NamedVector only for chave_unica, the filtering happens on the inverted index, and not on the vector index.

You near_text with that filter should work. One thing you can try, but it shouldn’t make a difference, is to set the tokenization to field for the chave_unica property, like so

...
wc.Property(name="chave_unica", data_type=wc.DataType.TEXT, tokenization=wc.Tokenization.FIELD),
...

this will ensure that all content that you set for chave_unica will be considered as a token.

But I am finding it strange that it should work already.

Let me know if you can share the dataset so I can try reproducing it in my end.

THanks!

If you want to learn more on tokenization:

so for example, if you let the default, word, and create a url property, this is what happens when you have a google.com value for your property

WITH WORD TOKENIZATION
google.com becomes 2 tokens: google and com

now when you search for property url equal to google.com you will NOT find that object.

WITH FIELD TOKENIZATION
google.com becomes 1 token google.com

now, when you search for property url equal to google.com, you will find that object :slight_smile:

Hello @DudaNogueira, thank you again for your response!

Everything worked out; there was a typo in my code that was super hidden and was causing the error…

1 Like

Oh!! that happens, hahaha

And when that happens, it can drive us nutts! :crazy_face:

Thanks for sharing, whenever you need help with your Weaviate journey, we are here to help!

Happy building!

1 Like