How to specify stopwords with python V4 API

Hi,
I’m using the python V4 API and I’d like to specify stopwords for French.
In V3, this comes down to setting up invertedIndexConfig.
I can’t find the equivalent for V4

So far I got

client.collections.create(
                name= collection_name,
                vectorizer_config=vectorizer,
                generative_config=wvc.Configure.Generative.openai(),
                inverted_index_config = wvc.Configure.inverted_index( ??? ),

                properties=[ list of properties               ]
            )

what parameters to pass to wvc.Configure.inverted_index( ... ) so that I can specify a list of stopwords

Thanks

Digging further into the documentation and classes, I found the following solution to add a list of stopwords to the collection.

To see the configuration of the collection use:

collection.config.get()

To set the stopwords using the wvc.Configure.inverted_index( ??? ) function:

set the params of the function (all default values except for stopwords_additions)

params = {
    "bm25_b": 0.75,
    "bm25_k1": 1.2,
    "cleanup_interval_seconds": 60,
    "index_timestamps":  False,
    "index_property_length":  False,
    "index_null_state":  False,
    "stopwords_preset": None,
    "stopwords_additions":  list_stopwords,
    "stopwords_removals": None,
}

The create the collection, passing params to the function:

collection = client.collections.create(
    name= <collection_name>,
    vectorizer_config=vectorizer,
    generative_config=wvc.Configure.Generative.openai(),
    inverted_index_config = wvc.Configure.inverted_index(**params),
    properties=[
        < list of properties>
    ]
)

However this fails to set the stopword_preset to None although it seems to be a valid value for stopwords_preset.

You can also create the collection without specifying the stopwords, and later on update the stopwords property with

collection.config.update(
    # Note, use Reconfigure here (not Configure)
    inverted_index_config=wvc.Reconfigure.inverted_index(
        stopwords_additions=["le", "la", "il", "elle"]
    )
)
1 Like

Wrote a short blog post to summarize everything

Adding French stopwords tp a weaviate collection with Python V4 API

1 Like

Hi @alexisperrier !!

Wow! Thanks a lot for writing and sharing your blog post! Glad you figured it out!

We should be updating the docs soon.

Thanks!

As of May 2024 the problem of the EN stopwords being added to the collection configuration even if you properly use the “stopwords_preset”: None is still present.

I have performed the same but still see <StopwordsPreset.EN: ‘en’> via a config.get()

How can I properly define my collection to ONLY use my provided additions?

PS As of May 2024 there’s a small change. You should use
import weaviate.classes.config as wvcc
to be able to use
inverted_index_config=wvcc.Configure.inverted_index(**bm25_params)

hi @rjalex

Can you produce a code sample? We could open an issue for that as I was not able to find one related to that.

Thanks!

1 Like

@DudaNogueira here we are. I used the embedded version for brevity but in my real case I am using 1.24.11 and the behaviour is the same. Thank you.

import weaviate
import weaviate.classes.config as wvcc

COLL_NAME_STR = "no_english_stopwords"

client = weaviate.connect_to_embedded()
client.collections.delete(COLL_NAME_STR)
bm25_params = {
    "bm25_b": 0.75,
    "bm25_k1": 1.2,
    "cleanup_interval_seconds": 60,
    "index_timestamps": False,
    "index_property_length": False,
    "index_null_state": False,
    "stopwords_preset": None,
    "stopwords_additions": ["di", "a", "da", "in", "con", "su", "per", "tra", "fra"],
    "stopwords_removals": None,
}

client.collections.create(
    name=COLL_NAME_STR,
    description="A collection with only a custom list of stopwords",
    vectorizer_config=None,
    inverted_index_config=wvcc.Configure.inverted_index(**bm25_params),
    properties=[
        wvcc.Property(
            name="blahblah",
            data_type=wvcc.DataType.TEXT,
            skip_vectorization=True,
        )
    ],
)

mycollection = client.collections.get(COLL_NAME_STR)
COLL_CONFIG = mycollection.config.get()
assert (
    COLL_CONFIG.inverted_index_config.stopwords.preset is None
), f"Expected stopwords preset to be None as declared in the inverted index configuration, but got {COLL_CONFIG.inverted_index_config.stopwords.preset}"