Filters do not seem to be working as expected

Description

Server Setup Information

  • Weaviate Server Version:
  • Deployment Method: WCS
  • Multi Node? Number of Running Nodes: no
  • Client Language and Version: php curl
  • Multitenancy?: no

Any additional Information

I think these queries should be bringing back only objects which contain ‘not working yet’ in the title, content, location or summary properties.

However, they are not. I am getting results that indicate that the keyword filters aren’t being executed.

{ Get { SolrCopy ( limit: 10 nearText: { concepts: ["specific text discussing query configurations in \"It's not working yet\""], } where: { operator: And, operands: [ { path: ["site"], operator: Equal, valueText:"https://master1and1.schoolboard.net/"},{ operator: Or, operands: [ { path: ["groups"], operator: Equal, valueText: "Development" } ] },{ operator: Or, operands: [ { operator: And, operands: [ { path: ["content"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["title"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["location"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["summary"], operator: Like, valueText: "'not working yet'" } ] } ] } ] } ){ _additional { distance certainty } docId site title nid type public url content taxonomy groups date summary questions sourceUrl solrId location } } }
{ Get { SolrCopy ( limit: 10 hybrid: { query: "query configurations in \"It's not working yet\"" alpha: 0.5 } where: { operator: And, operands: [ { path: ["site"], operator: Equal, valueText:"https://master1and1.schoolboard.net/"},{ operator: Or, operands: [ { path: ["groups"], operator: Equal, valueText: "Development" } ] },{ operator: Or, operands: [ { operator: And, operands: [ { path: ["content"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["title"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["location"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["summary"], operator: Like, valueText: "'not working yet'" } ] } ] } ] } ){ _additional { distance score } docId site title nid type public url content taxonomy groups date summary questions sourceUrl solrId location } } }

I’m executing these commands in the Weaviate console, and when I search the results, ‘not working yet’ is not found in any of the returned content.

I do not see any error in the response. The only thing that may be unusual is that the “location” property is null as I just added it and it doesn’t contain any data. Other than that the output seems correct – it’s just not filtered by the keywords.

Update: I removed “location” property, same results.

Any suggestions? I can send you sample output if you like (it’s pretty long).

Hi @SomebodySysop,

By default, Weaviate indexes keywords word-by-word, which means that matching is considered successful when a keyword matches.

This behaviour is defined by Tokenization.
In your case, you need to change the tokenization for properties you want to match exactly to field.

Here is a docs example on how to set tokenization when you create a collection.

note, I don’t think Weaviate allows updating tokenization on an existing collection.

I do not use Python. All of my requests are by curl.

My filter code has worked now for at least a year. I just noticed that it had stopped working recently, so I would estimate that it has topped within the past few weeks if not months.

Below is my schema. “content”, “title”, “summary” and “location” are all text properties. Are you saying I now need to add a “tokenization” element to each text property? Wouldn’t “word” already be the default tokenizer for text?

$schema = [
    "class" => "SolrAI",
    "description" => "Class representing the SolrAI index",
    "vectorizer" => "text2vec-openai",
    "moduleConfig" => [
      "generative-openai" => [
        "model" => "gpt-4o",
        ],
	  "text2vec-openai" => [
			"vectorizeClassName" => true,
			"model" => "text-embedding-3-large",
			"dimensions" => 3072,
			"type" => "text",
			"vectorizeClassName" => false
		],
    ],
    "multiTenancyConfig" => [
	  "text2vec-openai" => True
    ],
	"properties" => [
	[
		"name" => "docId",
		"dataType" => ["text"],
		"description" => "The unique identifier of the document - nid-dstype-dsid-delta",
		"moduleConfig" => [ "text2vec-openai" => 
			[ "skip" => false, "vectorizePropertyName" => false ]
			]
		],
	[
		"name" => "site",
		"dataType" => ["text"],
		"description" => "The site name.",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false ]]
	],
	[
		"name" => "title",
		"dataType" => ["text"],
		"description" => "The title of the document",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
	],
	[
		"name" => "summary",
		"dataType" => ["text"],
		"description" => "Summarization of the source document from which this text taken",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
	],
	[
		"name" => "nid",
		"dataType" => ["int"],
		"description" => "ID of node associated with this content.",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true]]
	],
	[
		"name" => "public",
		"dataType" => ["text"],
		"description" => "Is this content public (Y or N)..",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false]]
	],
	[
		"name" => "url",
		"dataType" => ["text"],
		"description" => "The url to the content on site",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false ]]
	],
	[
		"name" => "type",
		"dataType" => ["text"],
		"description" => "Solr datasource type",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false ]]
	],
	[
		"name" => "timestamp",
		"dataType" => ["date"],
		"description" => "Last Solr index date",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false ]]
	],
	[
		"name" => "content",
		"dataType" => ["text"],
		"description" => "Document content",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
	],
	[
		"name" => "groups",
		"dataType" => ["text[]"],
		"tokenization" => "word",
		"description" => "The group(s) to which this text document belongs.",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
	],
	[
		"name" => "date",
		"dataType" => ["text"],
		"description" => "The date for this event / activity.",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
	],
	[
		"name" => "taxonomy",
		"dataType" => ["text[]"],
	  "tokenization" => "word", 
		"description" => "The taxonomy or taxonomies (tag or tags) assigned to this text document.",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
	],
	[
		"name" => "solrId",
		"dataType" => ["text"],
		"description" => "The Solr database id.",
		"moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false ]]
	],
	[
	  "name" => "questions",
	  "dataType" => ["text"],
	  "description" => "Questions that this document answers.",
	  "moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
	],
	[
	  "name" => "dsid",
	  "dataType" => ["int"],
	  "description" => "Datasource ID.",
	  "moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false ]]
	],	
        [
          "name" => "sourceUrl",
          "dataType" => ["text"],
          "description" => "URL to original source document.",
          "moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => false ]]
        ],
        [
          "name" => "categories",
          "dataType" => ["text[]"],
          "description" => "Keyword categories.",
          "moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
        ],
        [
          "name" => "location",
          "dataType" => ["text"],
          "description" => "Location of document text (breadcrumb trail).",
          "moduleConfig" => [ "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]]
        ]
  ]
];

This is exactly what I want to happen. But, it’s not.

According to the documentation:

The word tokenization is the default tokenization method in Weaviate.

So, all of the filter elements in my query should be matching by word, but they are not.

{ Get { SolrCopy ( limit: 10 nearText: { concepts: ["specific text discussing query configurations in \"It's not working yet\""], } where: { operator: And, operands: [ { path: ["site"], operator: Equal, valueText:"https://master1and1.schoolboard.net/"},{ operator: Or, operands: [ { path: ["groups"], operator: Equal, valueText: "Development" } ] },{ operator: Or, operands: [ { operator: And, operands: [ { path: ["content"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["title"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["location"], operator: Like, valueText: "'not working yet'" } ] }, { operator: And, operands: [ { path: ["summary"], operator: Like, valueText: "'not working yet'" } ] } ] } ] } ){ _additional { distance certainty } docId site title nid type public url content taxonomy groups date summary questions sourceUrl solrId location } } }

Yes, that is correct. word is the default, while to achieve an exact match on the whole field, you need field tokenization.

Some background

Word tokenization only looks for alpha-numeric characters.
For example, it converts "Hello, (beautiful) world" into "hello", "beautiful", "world".

So later, when you query for "hello-world", that also gets tokenized into "hello", "world". And then Weaviate looks for a match of “hello” and “world”, which results in a positive match.

Field tokenization preserves the whole field as a single index item.
For example, "Hello, (beautiful) world" stays as "Hello, (beautiful) world".
Then the query "hello-world" won’t match, as it is different.

1 Like

OK, I see. Thanks.

So, how do I fix this?

Also, if I change the tokenization on these objects to accommodate keyword filtering, how will that affect my NearText and Hybrid semantic queries which, to this point, have been working very well?

I think you might need to recreate the collection, as I don’t think you can update tokenization.

@DudaNogueira do you know if we can update tokenization on an existing collection?

This won’t affect NearText, as it only uses vector embeddings for search.

However, yes, this would affect the keyword component of the Hybrid search.

I don’t necessarily recommend changing all properties to Field tokenization. However, if you need an exact match on say the URL, then you could set that property to use Field tokenization.

1 Like

@sebawita Tokenization is immutable

The collection needs to be recreated.

Regards,
Mohamed Shahin,
Support Engineer

1 Like

I don’t know. It sounds like FIELD tokenization may not be the solution here. The entire field may be 2500 characters long, and all I want to do is match a string within that field. But, it doesn’t sound like that’s possible with FIELD tokenization either – it sounds like I’d need to match the entire field.

That’s not what I want to do either.

It would be nice if you guys had a way to match a keyword string within a field, whether with WORD or FIELD tokenization.

Thank you @Mohamed_Shahin and @sebawita for your help.

I thought the filtering supported phrase searching because we normally use it with nouns - unique persons, places and things. It did not work as expected with ‘not working yet’ because we are dealing with a less unique verb and adverbs.

Having worked with database and text retrieval systems for the past 40+ years, I am a bit surprised at discovering that Weaviate does not support substring phrase searching in keyword filters.

Nonetheless, I will follow @Mohamed_Shahin advice and submit a feature request.

1 Like

Hey @SomebodySysop,

I misread your original question, as I thought that you wanted an exact match on the whole field. :man_facepalming:

You don’t need to change to Field, that was not a correct advice by me.
Apologies :pray:

Question

When you get your results back, do any of the properties contain the 3 words “not”, “working”, “yet” (not necessarily next to each other)?

Bonus

As a side note, I dived deeper into your GraphQL query.
There seems to be something odd about your where filter, as you have a few AND/OR operators with only one statement inside. Althogh, I am not sure if that affects your query.

For example, this OR contains only “groups”==“Development”

{operator: Or, operands: [ #THIS OR only contains "groups"=="Development"
  {path: ["groups"], operator: Equal, valueText: "Development"}
]},

And there is a bunch of AND operators like this:

{operator: And, operands: [
  {path: ["content"], operator: Like, valueText: "'not working yet'"}
]},

Here is your fully formatted GraphQL:

{
  Get {
    SolrCopy(
      limit: 10
      hybrid: {query: "query configurations in \"It's not working yet\"", alpha: 0.5}
      where: {
        operator: And, operands: [ # AND1 START
          {path: ["site"], operator: Equal, valueText: "https://master1and1.schoolboard.net/"},
          {operator: Or, operands: [ #THIS OR only contains "groups"=="Development"
            {path: ["groups"], operator: Equal, valueText: "Development"}
          ]},
          {operator: Or, operands: [ 
            {operator: And, operands: [
              {path: ["content"], operator: Like, valueText: "'not working yet'"}
            ]}, 
            {operator: And, operands: [
              {path: ["title"], operator: Like, valueText: "'not working yet'"}
            ]},
            {operator: And, operands: [
              {path: ["location"], operator: Like, valueText: "'not working yet'"}
            ]}, 
            {operator: And, operands: [
              {path: ["summary"], operator: Like, valueText: "'not working yet'"}
            ]}
          ]}
        ]}# AND1 END
    ) 
{ _additional { distance score } docId site title nid type public url content taxonomy groups date summary questions sourceUrl solrId location } } }

You query could be trimmed to:

{
  Get {
    SolrCopy(
      limit: 10
      hybrid: {query: "query configurations in \"It's not working yet\"", alpha: 0.5}
      where: {
        operator: And, operands: [ # AND1 START
          {path: ["site"], operator: Equal, valueText: "https://master1and1.schoolboard.net/"},
          {path: ["groups"], operator: Equal, valueText: "Development"},
          {operator: Or, operands: [ 
              {path: ["content"], operator: Like, valueText: "'not working yet'"}, 
              {path: ["title"], operator: Like, valueText: "'not working yet'"},
              {path: ["location"], operator: Like, valueText: "'not working yet'"},
              {path: ["summary"], operator: Like, valueText: "'not working yet'"}
          ]}
        ]}# AND1 END
    )
{ _additional { distance score } docId site title nid type public url content taxonomy groups date summary questions sourceUrl solrId location } } }

Yes, that is correct.

And there is a bunch of AND operators like this:

There could be other restrictions as well. The query looks a little clunky because it’s generated by code designed to dynamically create the filters according to the options it receives. I could try using a model to do this, but code is more reliable.

Yes, the results will contain all the words in at least one of the property fields listed. Sort of like the “contains all” filter, but not necessarily as a phrase ‘not working yet’, which is what I was looking for.

Thanks for the assistance!

Feature request posted: Keyword filters should support substring phrase searching · Issue #7206 · weaviate/weaviate · GitHub