Lowercase tokenization doesn't seem to be working

This is my schema

{
            "dataType": ["text"],
            "description": "The embedded content",
            "moduleConfig": {
                "text2vec-transformers": {
                    "skip": False,
                    "vectorizePropertyName": False,
                },
                "reranker-cohere": {
                    "model": "rerank-multilingual-v2.0",
                },
            },
            "name": "aiContent",
            "tokenization": "lowercase",
        },

A search for “jugend” does not find relevant pages, while a search for “Jugend” correctly finds this: Angebote und Programme speziell f\u00fcr Jugendliche

How can this be if the tokenization lowercases the entire text?

Hi!

I was not able to reproduce this.

here is what I got:

import weaviate
client = weaviate.connect_to_local()
from weaviate import classes as wvc

client.collections.delete("Tokenization")
col = client.collections.create(
    name="Tokenization",
    vectorizer_config=None,
    properties=[
            wvc.config.Property(
                name="text",
                data_type=wvc.config.DataType.TEXT,
                description="The text.",
                vectorize_property_name=False,
                tokenization=wvc.config.Tokenization.LOWERCASE
            ),
        ],
)

# lets add a new object
col.data.insert({'text': 'Angebote und Programme speziell Jugendliche'})
col.data.insert({'text': 'I like jugend'})

from weaviate import classes as wvc
col = client.collections.get("Tokenization")
q = col.query.fetch_objects(filters=wvc.query.Filter.by_property("text").like("*jugend*"))
for r in q.objects:
    print(r.properties)

this will yield the two objects:
{‘text’: ‘Angebote und Programme speziell Jugendliche’}
{‘text’: ‘I like jugend’}

Thanks!

I experimented with this, and the issue only occurs with vector similarity or hybrid search. With hybrid search, rankedFusion leads to slightly better results, but when searching for ‘jugend’, it also ranks other, unrelated pages way higher than some pages that include the word ‘Jugend’. A search for ‘Jugend’ works correctly.

Hybrid search:

const searchResult = await client.graphql
        .get()
        .withClassName('Page')
        .withFields(`aiContent _additional { score explainScore }`)
        .withHybrid({ query: query, fusionType: FusionType.relativeScoreFusion })
        .withAutocut(1)
        .do();

nearText:

        const searchResult = await client.graphql
        .get()
        .withClassName('Page')
        .withFields(`aiContent _additional { score explainScore }`)
        .withNearText({concepts: [query]})
        .withAutocut(1)
        .do();

I have to use relativeScoreFusion, because if I don’t use withAutocut, the search always returns my whole dataset, regardless which query is used.
What I don’t understand is why a search for ‘jugend’ leads to different results than a search for ‘Jugend’, that’s why I initially though the lowercase tokenization is the problem