Lowercase tokenization doesn't seem to be working

JohannesPertl · February 13, 2024, 1:40am

This is my schema

{
            "dataType": ["text"],
            "description": "The embedded content",
            "moduleConfig": {
                "text2vec-transformers": {
                    "skip": False,
                    "vectorizePropertyName": False,
                },
                "reranker-cohere": {
                    "model": "rerank-multilingual-v2.0",
                },
            },
            "name": "aiContent",
            "tokenization": "lowercase",
        },

A search for “jugend” does not find relevant pages, while a search for “Jugend” correctly finds this: Angebote und Programme speziell f\u00fcr Jugendliche

How can this be if the tokenization lowercases the entire text?

DudaNogueira · February 15, 2024, 1:44pm

Hi!

I was not able to reproduce this.

here is what I got:

import weaviate
client = weaviate.connect_to_local()
from weaviate import classes as wvc

client.collections.delete("Tokenization")
col = client.collections.create(
    name="Tokenization",
    vectorizer_config=None,
    properties=[
            wvc.config.Property(
                name="text",
                data_type=wvc.config.DataType.TEXT,
                description="The text.",
                vectorize_property_name=False,
                tokenization=wvc.config.Tokenization.LOWERCASE
            ),
        ],
)

# lets add a new object
col.data.insert({'text': 'Angebote und Programme speziell Jugendliche'})
col.data.insert({'text': 'I like jugend'})

from weaviate import classes as wvc
col = client.collections.get("Tokenization")
q = col.query.fetch_objects(filters=wvc.query.Filter.by_property("text").like("*jugend*"))
for r in q.objects:
    print(r.properties)

this will yield the two objects:
{‘text’: ‘Angebote und Programme speziell Jugendliche’}
{‘text’: ‘I like jugend’}

Thanks!

JohannesPertl · February 20, 2024, 10:58pm

I experimented with this, and the issue only occurs with vector similarity or hybrid search. With hybrid search, rankedFusion leads to slightly better results, but when searching for ‘jugend’, it also ranks other, unrelated pages way higher than some pages that include the word ‘Jugend’. A search for ‘Jugend’ works correctly.

Hybrid search:

const searchResult = await client.graphql
        .get()
        .withClassName('Page')
        .withFields(`aiContent _additional { score explainScore }`)
        .withHybrid({ query: query, fusionType: FusionType.relativeScoreFusion })
        .withAutocut(1)
        .do();

nearText:

        const searchResult = await client.graphql
        .get()
        .withClassName('Page')
        .withFields(`aiContent _additional { score explainScore }`)
        .withNearText({concepts: [query]})
        .withAutocut(1)
        .do();

I have to use relativeScoreFusion, because if I don’t use withAutocut, the search always returns my whole dataset, regardless which query is used.
What I don’t understand is why a search for ‘jugend’ leads to different results than a search for ‘Jugend’, that’s why I initially though the lowercase tokenization is the problem

Topic		Replies	Views
After importing the document into Weaviate for a period of time, it cannot be searched using BM25, Support bug	4	737	October 4, 2023
Select tokenizer for search/filter Support	2	379	November 16, 2023
Simple keyword search not working Support	4	1112	September 14, 2023
Filters do not seem to be working as expected Support developer-experience , feedback	12	11383	February 14, 2025
Exact Query Filter Support bug	2	218	May 26, 2025

Lowercase tokenization doesn't seem to be working

Related topics