Possible bug in Equal operator?

Description

Hi,

When making a graphql search request in our database, we filter based on the article and provider:

example query:

{
  query:
    Get {
      test_collection(
        limit: 5,
        nearVector: {
          vector: [vector]
        },
        where: {
          operator: And,
          operands: [
            {
              path: ["article"],
              operator: Equal,
              valueString: "Article 3.6"
            },
            {
              path: ["provider"],
              operator: Equal,
              valueString: "provider 1"
            }
          ]
        }
      ) {
        title
        content
        provider
        article
        _additional {
          distance
          id
        }
      }
    }
  }

The problem is, it returns results of both article 3.6 as well as 6.3. I was wondering how the Equal operator works?

Does it match strings?

thank you for your time

Hey @Steve,

That’s very odd! as yes it does match the keywords however that might be tokenization issue.

Could you please provide me more details of the following:

  • WeaviateDB version
  • Schema config in full

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)

Hi,

Thank you for the quick reply:

Our schema:

{
  'class': 'test_collection',
  'invertedIndexConfig': {
    'bm25': {
      'b': 0.75,
      'k1': 1.2
    },
    'cleanupIntervalSeconds': 60,
    'stopwords': {
      'additions': None,
      'preset': 'en',
      'removals': None
    }
  },
  'moduleConfig': {
    'text2vec-openai': {
      'baseURL': 'baseurl',
      'deploymentId': 'text-embedding-ada-002',
      'model': 'ada',
      'modelVersion': '002',
      'resourceName': 'text-embedding-ada-002',
      'skip': True,
      'vectorizeClassName': True
    }
  },
  'multiTenancyConfig': {
    'autoTenantActivation': False,
    'autoTenantCreation': False,
    'enabled': False
  },
  'properties': [
    {
      'dataType': [
        'text'
      ],
      'description': 'title of the chunk',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'title',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'law to which the chunk belongs',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'law',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'article of the law of the chunk',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'article',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'section of the law to wich the chunk belongs',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'section',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'jci of the chunk',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'jci',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'uri of the chunk',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'uri',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'provider of the chunk',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'provider',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'content of the chunk',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'content',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text[]'
      ],
      'description': 'list of accessible ids',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'access',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'location of the found chunk (Kluwer commentaar)',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'location',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'uuid'
      ],
      'description': 'id of the parent document',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': False,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'parent_id'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'valid_from',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'valid_from',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'valid_until',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'valid_until',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': 'hash of the content',
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'baseURL': 'baseurl',
          'deploymentId': 'text-embedding-ada-002',
          'model': 'ada',
          'modelVersion': '002',
          'resourceName': 'text-embedding-ada-002',
          'skip': True,
          'vectorizePropertyName': False
        }
      },
      'name': 'hash',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': "This property was generated by Weaviate's auto-schema feature on Mon Mar  3 13:38:22 2025",
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'skip': False,
          'vectorizePropertyName': False
        }
      },
      'name': 'document_type',
      'tokenization': 'word'
    },
    {
      'dataType': [
        'text'
      ],
      'description': "This property was generated by Weaviate's auto-schema feature on Mon Mar  3 13:38:22 2025",
      'indexFilterable': True,
      'indexRangeFilters': False,
      'indexSearchable': True,
      'moduleConfig': {
        'text2vec-openai': {
          'skip': False,
          'vectorizePropertyName': False
        }
      },
      'name': 'type',
      'tokenization': 'word'
    }
  ],
  'replicationConfig': {
    'asyncEnabled': False,
    'factor': 1
  },
  'shardingConfig': {
    'actualCount': 1,
    'actualVirtualCount': 128,
    'desiredCount': 1,
    'desiredVirtualCount': 128,
    'function': 'murmur3',
    'key': '_id',
    'strategy': 'hash',
    'virtualPerPhysical': 128
  },
  'vectorIndexConfig': {
    'bq': {
      'enabled': False
    },
    'cleanupIntervalSeconds': 300,
    'distance': 'l2-squared',
    'dynamicEfFactor': 8,
    'dynamicEfMax': 500,
    'dynamicEfMin': 100,
    'ef': -1,
    'efConstruction': 128,
    'filterStrategy': 'sweeping',
    'flatSearchCutoff': 40000,
    'maxConnections': 64,
    'pq': {
      'bitCompression': False,
      'centroids': 256,
      'enabled': False,
      'encoder': {
        'distribution': 'log-normal',
        'type': 'kmeans'
      },
      'segments': 0,
      'trainingLimit': 100000
    },
    'skip': False,
    'sq': {
      'enabled': False,
      'rescoreLimit': 20,
      'trainingLimit': 100000
    },
    'vectorCacheMaxObjects': 1000000000000
  },
  'vectorIndexType': 'hnsw',
  'vectorizer': 'text2vec-openai'
}

WeaviateDBVersion: 1.28

Hi @Steve !!

As your article has word as it tokenized, the string Article 3.6 will become 3 tokens: article 3 and 6

If you want to use article to filter out results, you will need to set the tokenization to field, where you will end up with a token with the value Article 3.6

on that scenario, your equal comparison will work as you are expecting.

Check here for more information on tokenization: Overview of tokenization | Weaviate Documentation

Let us know if this helps :slight_smile:

1 Like

Ah alright, is there any way to change the tokenization of a live collection? we already have some data and want to prevent re-vectorizing the database

You can add a new property, with field tokenizer, and set it’s value.

By the way, a nice trick is having a field you want to search and filter duplicated. One for search and the other to filtering with word and field tokenization, respectively.

Also, you don’t need to re vectorize you data. You can migrate your dataset from one collection to a new one at the same cluster, with the new configuration, and carry over the vectors.

Check below. Note it will specify the vectors at the target collection.

Let me know if this helps!

Happy coding!

1 Like

Thank you for all the help!

1 Like