This is my schema
{
"dataType": ["text"],
"description": "The embedded content",
"moduleConfig": {
"text2vec-transformers": {
"skip": False,
"vectorizePropertyName": False,
},
"reranker-cohere": {
"model": "rerank-multilingual-v2.0",
},
},
"name": "aiContent",
"tokenization": "lowercase",
},
A search for âjugendâ does not find relevant pages, while a search for âJugendâ correctly finds this: Angebote und Programme speziell f\u00fcr Jugendliche
How can this be if the tokenization lowercases the entire text?
Hi!
I was not able to reproduce this.
here is what I got:
import weaviate
client = weaviate.connect_to_local()
from weaviate import classes as wvc
client.collections.delete("Tokenization")
col = client.collections.create(
name="Tokenization",
vectorizer_config=None,
properties=[
wvc.config.Property(
name="text",
data_type=wvc.config.DataType.TEXT,
description="The text.",
vectorize_property_name=False,
tokenization=wvc.config.Tokenization.LOWERCASE
),
],
)
# lets add a new object
col.data.insert({'text': 'Angebote und Programme speziell Jugendliche'})
col.data.insert({'text': 'I like jugend'})
from weaviate import classes as wvc
col = client.collections.get("Tokenization")
q = col.query.fetch_objects(filters=wvc.query.Filter.by_property("text").like("*jugend*"))
for r in q.objects:
print(r.properties)
this will yield the two objects:
{âtextâ: âAngebote und Programme speziell Jugendlicheâ}
{âtextâ: âI like jugendâ}
Thanks!
I experimented with this, and the issue only occurs with vector similarity or hybrid search. With hybrid search, rankedFusion leads to slightly better results, but when searching for âjugendâ, it also ranks other, unrelated pages way higher than some pages that include the word âJugendâ. A search for âJugendâ works correctly.
Hybrid search:
const searchResult = await client.graphql
.get()
.withClassName('Page')
.withFields(`aiContent _additional { score explainScore }`)
.withHybrid({ query: query, fusionType: FusionType.relativeScoreFusion })
.withAutocut(1)
.do();
nearText:
const searchResult = await client.graphql
.get()
.withClassName('Page')
.withFields(`aiContent _additional { score explainScore }`)
.withNearText({concepts: [query]})
.withAutocut(1)
.do();
I have to use relativeScoreFusion, because if I donât use withAutocut, the search always returns my whole dataset, regardless which query is used.
What I donât understand is why a search for âjugendâ leads to different results than a search for âJugendâ, thatâs why I initially though the lowercase tokenization is the problem