V4 client with custom vectorizer question

As V4 is very new I am a bit struggling to understand details from various videos and tutorials. I am learning in a class/collection that has three properties:

  • kicker: this holds a paragraph of text and I want to vectorize it with my custom model
  • author: a string of names; I wish to tokenize, index and search terms to find the names of an author.
  • slug: this is an article id and I do not need to neither vectorize nor do text searches on this field.
    The following is the code I am using to create the collection/class:
with weaviate.connect_to_local( # this will connect and then at the end implicitely close
    host = "localhost",
    port = 8077,
    headers = {
        "X-OpenAI-Api-Key": openai_key,  #  for generative queries
    }
    )  as client:
    client.collections.delete(schema_name) 
    client.collections.create(
        schema_name,
        description="A class to store articles with a semantic kicker and searchable author.",
        vectorizer_config=None,
        generative_config=wc.Configure.Generative.openai(),
        inverted_index_config=wc.Configure.inverted_index(
            index_property_length=True
        ),
        vector_index_config=wc.Configure.VectorIndex.hnsw(
            distance_metric=wc.VectorDistances.COSINE
        ),
        properties=[
            wc.Property(name="kicker", data_type=wc.DataType.TEXT, skip_vectorization=True),
            wc.Property(name="slug", data_type=wc.DataType.TEXT, skip_vectorization=True), 
            wc.Property(name="author", data_type=wc.DataType.TEXT,skip_vectorization=True)
        ]
    )
    print(f"Successfully created the {schema_name} schema.")
    articles = client.collections.get(schema_name)
    response = articles.aggregate.over_all(
        total_count=True
    )
    print(f"We have {response.total_count} in the {schema_name} collection")

Is it right that I declare the skip_vectorization like this since I am providing a vector manuallly at insertion time?

How do I declare that author is the only field on which I will do BM25 searches?

Thanks

Hi!

As you are providing the vector yourself, and you don’t have a vectorizer for that class, the skip_vectorization parameter becomes useless.

While performing a bm25 field, you can not only specify the fields to search, but also weight in for fields:

for example:

    collection = client.collections.get(schema_name)
    response = jeopardy.query.bm25(
        query="Pelé",
        query_properties=["kicker^2", "author"],
        limit=3
    )

    for o in response.objects:
        print(o.properties)

Let me know if that helps :slight_smile:

Edson Arantes do Nascimento is a planetary myth and I do not need weaviate to find him. He’s always in our hearts !!! :slight_smile:
More seriously I will have to find the right V4 way of declaring which fields I want to be indexed for BM25 searches.
If I do understand your example that “^2” is a weight right? Can you point me out to the new V4 BM25 query syntax to better grasp it?
Muito obrigado.

1 Like

Mestre Pelé :slight_smile:

The syntax is in the doc:

Note that you have tabs for different languages. :slight_smile:

Thank you @DudaNogueira . The docs you link are very clear and fine but the problems I have arise from probable mismatches between the way I am setting up the server, instantiating the client and declaring the V4 collection and the queries. It might be that both that each are correct but overall hide some problems.

1 Like