[Feedback] Update to the Python client – collections, search, CRUD operations

Hi all, we are working on updating our language clients – starting with the Python Client – to make it easier to:

  • create and manage collections,
  • configure what vectors are used,
  • CRUD data operations
  • search.

But also to enable modern IDEs to help build with Weaviate. For example, wouldn’t you love to get Intelissense support and get suggestions on what params are available?

We would love to hear your feedback about what we are about to propose below.
Note: the below examples are a subset of use cases that we are looking at, but you should get the general gist and direction we are heading towards.

Collections

A quick note, but not the main topic for this post:

There is a little confusion around Schemas and Classes in Weaviate, so we thought that we could make it easier for everyone to understand what is what, and introduce the concept of Collections.

A collection (currently called a class) is where you store your data with vector embeddings.

Create a collection

Creating a new collection should be as easy as:

client.collection.create(name="Articles")

Create a collection - select vectorizer

You could also create a new collection with a vectorizer module:

client.collection.create(
  name="Articles",
  vectorizerConfig=VectorizerConfig(vectorizer="text2vec-openai")
)

Note: VectorizerConfig would be defined as a class with a named set of parameters, so your IDE could help you pick the parameters you need to pass. Like this:

@dataclass(frozen=True)        
class VectorizerConfig:
    alias: str
    vectorizer: str
    model: str
    vectorProperties: list[str]

Create a collection - select model and properties to vector

vc = VectorizerConfig(
    vectorizer="text2vec-cohere",
    model="multilingual-22-12",
    vectorProperties=["title", "description"]
)

client.collection.create(
  name="Articles",
  vectorizerConfig=vc
)

Note:vectorProperties indicates which properties should be used for vectorization. Currently, this is done as part of the schema property definition, where we exclude properties from vectorization. Which is a bit of a problem when your data objects are made of several properties, but you only want to vectorize on 1-2 properties.

Create a collection – with property definition

p = [
  Property(name="title", description="The title of the article", dataType="string"),
  Property(name="content", dataType="string"),
  Property(name="url", dataType="string"),
  Property(name="img", dataType="blob")
]

client.collection.create(
  name="Articles",
  vectorizerConfig=vc,
  properties=p
)

The property definition is what most databases out there consider a data schema. This is where we can define properties (and their types) for our data collections. i.e. Articles are made of title, content, url, etc.

Get collection configuration

Getting a collection configuration, should be as simple as calling getConfiguration:

configuration = client.collection.getConfiguration(name="Articles")

print(configuration.vectorizerConfig)
print(configuration.properties)

Alternatively, we could use the configuration namespace like this:

configuration = client.collection.configuration.get(name="Articles")

What do you think about these two options?

Update collection configuration

Updating a collection configuration should be done with a call to upadateConfiguration:

// define properties
p = {...}
// define new vector configuration
v = VectorizerConfig(...)

client.collection.updateConfiguration(
  name="Articles",
  vectorizerConfig=v,
  properties=p
)

Alternatively, it could be done with configuration.update:

client.collection.configuration.update(...)

Delete Collection

To delete a collection, you can call:

client.collection.delete(name="Articles")

Data Operations

Following the concept of collections, we propose to introduce collection.data, which can be used for data operations and search.

Data Insert

For example, to insert a new object, first we can get a data object for the Articles collection. Then we can use the data object (called here “data” to insert a new object, like this:

data = client.collection.data("Articles")
data.insert({ name: "foo", description: "bar"})

Insert multiple objects

data.insert([
  { name: "foo", description: "bar"},
  { name: "ping", description: "pong"},
  { name: "cat", description: "kitten"},
  { name: "dog", description: "puppy"}
])

Data Get

To get a number of objects, you could call get:

items = data.get(limit=5)
print(items)

Get with a filter

items = data.get(
  where=Filter(
    property="price",
    operator=Operator.GreaterThan,
    value=100
  ),
  limit=5
)

Note the use of the enum for the operator: Operator.GreaterThan
This will help you see what filter operators are available and get code predictions in your IDE.

Loop through data in a collection

data = client.collection.data("Articles")

for item in data.iterate(20):
  print (item)

Update

To update an object by ID, you could call:

article={ name: "foo", description: "bar"},
data.update_by_id(uuid="1234-1234-1234", object=article)

Delete

To delete objects, we can call data.delete().

Delete by ID

To delete an object by ID, you could call:

data.delete(uuid="1234-1234-1234")

Delete where
To delete based on a where filter:

data.delete(
  where=Filter(
    property="price",
    operator=Operator.GreaterThan,
    value=100
  )
)

Search

Here are a couple examples of how the new syntax for search might look like:

new – nearText

result = data.textSearch(
  concept="marvel avengers",
  properties=["title", "description"],
  limit=10
)

new - nearImage

result = data.imageSearch(
  base64=img,
  properties=["title", "description", "url"],
  where=Filter(property="price", operator="GreaterThan", value=100),
  distance=123,
  limit=10
)

We will share the examples for search in a separate thread, as that is a whole different discussion.

2 Likes

Hi, thanks for the proposal. There is a lot in here, so before I dive into details, maybe first up a high-level question.

How does this proposal relate to the Models Proposal for the python client that received quite some positive feedback when initially introduced.

If I see it correctly, the proposal above mainly focuses on changing the schema CRUD methods, but doesn’t address the underlying problem that the object itself would still be an untyped dict, as opposed to be a typed Class as in the models proposal.

I’m not a Python-person, but at first sight the models proposal looks more pythonic to me. Would love to hear the opinions of folks with more Python experience than me on this.

EDIT: This is probably less about what’s more pythonic, but more about being data-object-centric vs being schema-centric. I think the proposal above is still schema-centric (with a nicer API) whereas the ORM-style Models proposal is data-object-centric. I generally like the data-object-centric style from working with other databases. For example, I still consider the Google Cloud Datastore Go client one of the best DB-clients I’ve ever used myself. The Go doesn’t directly translate to Python, but the data-centricity should.

1 Like

Another high-level question:

You mention that this is meant as a proposal for the Python client, but it seems that it changes quite some underlying concepts of the Weaviate server configuration itself (for example VectorConfig is currently module-centric, but in the proposal it’s not). I’m not sure if this is something that could be built in the client in isolation. This looks more like a proposal for a v2 API for the server and client?

EDIT: To add a bit more clarity. What I mean is a renaming of something is something that we could easily abstract in the client. For example we could call it collection in the client even when it’s still called class on the server. We then just need to make sure that we replace the name everywhere in transit. However, I’m not sure if the same could be done with for example VectorConfig which currently doesn’t exist in this form on the server and translating that to the current structures may be quite hard.

PS: Since I mentioned VectorConfig, I think what you describe is more a VectorizerConfig? Under VectorConfig I would expected to find what you currently find under vectorIndexConfig.

I like the proposed collections interface especially data.insert([Array]) and other usability enhancements.

I think the key question around feasibility is does this require large breaking changes in Weaviate or is this purely focused on being a clients design overhaul?

For instance simply replacing the Class naming convention with Collection would be quite a massive undertaking in Weaviate itself (though I understand the motivation with class being a reserved word in python and other programming languages).

Will users get confused if they create Collections in the clients but see Classes in the api? Ideally we introduce as few additional abstractions as possible and not break existing usage.

2 Likes

:100:

To give us a feasible implementation path, I really like this idea of breaking this up into two parts:

  • Step 1: everything that’s possible in the short-term, we can get started immediately, only non-breaking changes, can be handled by the “Clients” team
  • Step 2: the long-term goals, including breaking changes on the server-side, e.g. for Weaviate v2 or for introducing a /v2/ API, requires all Core teams to collab on this (something we don’t have the capacity for at the moment)

Some examples for something that would fall into either Step 1 or Step2:

  • Provider nicer Python APIs with structured Classes to allow for auto-completion, etc. → Step 1
  • Rework how modules are configured (breaking change), e.g. from making it property-centric to a new VectorizerConfig → Step 2

The above proposal shows changes (with a strong focus on making it easier to write and read the related code) to the following:

  • Schema CRUD,
  • Data CRUD + iterator
  • and a preview of how we could update the search syntax

I am still new to Python, but my usual approach would be to use generics, which could help us address the problem. In TypeScript this would look like this:

interface Article {
  name: string;
  description: string;
  readLength: number;
}

collection = client.collection.get<Article>('Articles')

then any operations, would require us to provide the data of the right type:

const item1 = {
  name: 'foo',
  description: 'bar',
  readLength: 20,
}

const item2 = {
  name: 'boob',
  description: 'bop',
  readLength: 'five minutes',
}

// the signature for insert would be
// collection.insert(item: Article)

collection.insert(item1) // this is fine
collection.insert(item2) // this would throw an error - readLength should be a number

But also, queries could follow the generic types:

// the signature for textSearch would be
// textSearch(...) : Result<Article>

result = await collection.textSearch(prompt='programming concepts', limit=2);

console.log(result.items) // items would be of type Article[]

@etiennedi Can we use generics in Python like the above example?
If yes, then this could help us address the challenge :wink:
This should probably be a post on a thread of its own.

Good point, I’ve updated the examples to reflect that

In the longer term, that would require for us to refactor the core Weaviate endpoints.
However, to begin with, we could do it purely at the language client level.

Most of the changes I propose (except for the new VectorizerConfig), don’t require any update to the underlying Weaviate endpoints.

If we introduce the new syntax under client.collection, we can have a transition period, where developers could work with Weaviate using the current (old) syntax, and the new syntax.

So, that wouldn’t introduce an immediate breaking change.

That sounds reasonable.

We could start with implementing client.collection namespace that would cover:

  • data operations
  • new query syntax – I will share more in a separate thread
  • generics for data types

Which would introduce a big UX boost.

Then, we could plan the new approach for collection CRUD operations, which would require a new endpoint /v2/collection

Hey there…
It’s great that you want to improve the dev UX here…
Although weaviate is the leader from the feature side perspective, the CRUD operations were always a huge struggle …
I think it’s ok if you have fixed queries (still… need to write that boiler plate). … but if you want to manage you model without rewriting the weaviate code, it’s pretty hard task…

Please consider something like @sebawita is proposing (for TS obviously) also for python…
Something like ORM layer would be so much helpful…
pydantic is starting to be an industry standard now… so that would be nice… even if you wouldn’t actually make the dependency to it, testing you ideas for UC here developer has a model structure in some dataclasses (with a tenant field :wink: ) will give you good perspective :slight_smile:

I just rewrote my weaviate wrapper from scratch, so I have quite a few opinionated ideas, but Im not going to spill it up here (unless you explicitly ask me to:) … instead let me share the biggest pain points:

  1. selecting referenced fields…
    lets assume a data model with a owner and child… when I want to create a query equivalent of select * from collection, its near to impossible… i have two options… analyzing my target class (using pedantic) … or analyzing weaviate schema…
    combine this with fact that multitenancy is still a proposal, you need to propagate not only the schema info but also the tenant info to all the sub methods… you need to understand what type is the field , if it is a reference and then encode it as a class name…
    Unless you have a static schema (static set of classes) … this is extremely difficult task

  2. filtering… very similar problem because the path for references includes referenced classes (something that I’ve never saw anywhere else) …

  • the fact that I need to encode the value type for filter is far from ideal… luckily the text/string problem is over, so it can be decoded from the entity_type class,but the fact that its even needed is another road block to overcome
  1. decoding data from what was returned from weaviate to pydantic class again…
    multiple hiccups, but the biggest challenge here is the fact that all references are always returned as a list… now pydantic has a sweet parsing mechanism, if you feed him a dict, it would populate its child, but this cant be leveraged by data loaded from weaviate since the references are as a list … so I need to analyze the class first, to understand whether the field is a list type (understanding the annotation which can by List[Entity], Optional[List[Entity]] … etc… ) and based on this converting the data into an original format…

  2. the weird thing about id and _additional … this is not such a big challenge to solve, but an inconvenience anyway… id is a reserved word anyway, yet the id is actually wrapped in a _additional object… so it needs to be popped out and resetting as “id” property to the main dict

this is not a criticism, just feedback meant with :heart: … I’ve already overcome these issues (multiple times :sweat:)… but it would be great if other people didn’t need to…

Btw… MongoDB has probably the best dev UX ive used for DB… not that you should copy it… just that you’d know where Im at:)

1 Like

Since we’re having this discussion - I thought I’d bring up what Dirk was considering (he’s on holiday). He’s been considering using dataclasses like this:

Class(
    name="testClass",
    properties=[
        Property(name="Prop1", dataType=DataType.UUID),
        Property(name="Prop2", dataType=DataType.TEXT_ARRAY),
    ],
)

I think this would be a huge improvement personally. What do you think?

1 Like

Thank you for your feedback @ju-bezdek. This is helpful.

Incidentally, I worked at MongoDB in my previous life, so there might be some similarities here and there :wink:

I am not sure if I understand this well enough.
I have a feeling that there are multiple pain challenges in this one bullet point. :thinking:

Although, this post is not so much about managing references, we will make sure to take it into consideration as part of this design exercise.

I also know that Dirk is working on a great idea for helper classes in Python, which should make constructing a query easier using References.

The question I have here is: how much of the underlying data schema should we bring into the language client. When I work with noSQL databases in JavaScript, there is no out-of-the-box mechanism that would make my JS environment aware of the collection structure in the database. However, I can create interfaces for each data collection, and with generics, I can enhance my queries, to let the IDE know what properties to expect.

Is your suggestion, that Weaviate should automatically check the underlying schema, as you type your query, to provide you with code suggestions and highlight mistakes?
(I imagine that would be really hard to implement, and the mechanism would be very different for each programming language we support)

I think we could try to tackle this, this is why my example had a where filter like this:

where=Filter(property="price", operator=Operator.GreaterThan, value=100)

I don’t know much about pydantic (I am still new to Python), but I do think that we could improve the structure of the returned data.

For example, a simple one-collection query, should return the data in a standard structure:

result = collection.textSearch(
  concept="foo",
  properties=["title", "description"],
  limit=2
)

print(result.items)

/*
  [
    { title:"foo", description: "bar" },
    { title:"tik", description: "tock" }
  ]
*/

And a query with a reference (this is completely made up):

result = collection.textSearch(
  concept="foo",
  properties=[
    "title", "description", 
    Reference(collection="Author", label="authors", linkOn="some_id", properties=["name", "handle"])
  ],
  limit=2
)

print(result.items)

/*
  [
    { title:"foo", description: "bar",  authors:[
       { name: "Sebastian", handle: "sebawita" },
       { name: "Juraj", handle: "ju-bezdek" },
    ]},
    { title:"tik", description: "tock", authors: [...] }
  ]
*/

My idea for some of the standard properties, like id, is to use a $ notation as shorthand, like this:

items = collection.get(
  limit=5,
  properties=["title", "description", "$id"]
)

print("first id: " + items.result[0]["$id"])

Im not sure I know what Im looking at…

but I assume it’s a schema description…

looks good… but I never struggled with this part… having a possibility to build schema in this was is nice to have, but not a pain killer

1 Like

I wont realy know how to structure the answer here :sweat_smile: so it would be readable… will so my best

I am not sure if I understand this well enough.
I have a feeling that there are multiple pain challenges in this one bullet point. :thinking:

… basically what I meant here is to make it easier to select everything… there is no way of doing it… and if you have references & multitenancy, it gets super complicated…
I don’t suggest a solution… just pointed out a pain point when fetching data … my code for this has way over a hundred lines of code… at its best form… that’s not right

as for the example… assume this scenario

  • assume classes Book, Chapter, Passage … Book has Authors.
  • now you want to be able to fetch Passage with reference to passage and the book with its author
  • now add the constraint that you need to have a specific class for each tenant… so you need to propagate that
  • the most important one: I don’t want the code that handles fetching data from weaviate to be so tightly coupled with the model… if a make some changes in the model, and I have multiple queries… that just too many places to fix, everytime iI change something…

Now Im doing autogeneration of the properties collection based on the model…, I can filter the data by passing by a list of property paths, but if not set, everything get fetched…

… one simple way of doing it would be if I could fetch the whole object as it is just by selecting the property authors… without specifying the fields… the whole
*list_of_passage_properties, aditional {id}, author{tenent} {… on Chapter_{tenant} { *list_of_this_class_properties, aditional {id}, {… on Book{tenant} { * book_properties, _additional{id} }}}

… that is like super complicated to construct … the same thing can be achieved with find(filter) in mongo:)

… , like id , is to use a $ notation…

$id is better than _additonal {id} for sure, however I still don’t get it… on one place weaviate use uuid, on another (filter) used id , on the third one _additional {id} …
… and the best part is that “id” is already a reserved word, so if you want so set id as your property, you cant…(insert fails) … so what is the reason not to use id for the actual id than? why to come up with 4the version of “id” syntax?

Awesome feedback and discussion, thanks all.

@ju-bezdek since you explicitly brought up cross-references, I’m curious about your opinion on the proposal for refs in the ORM/Models proposal:

            and_(
                Author.first_name == "Kazuo", 
                Author.writes_for >> Publication.name == "Faber and Faber"
            ))

I don’t know Python well enough to understand the magic behind the >> but I love how concise this is and how easy it is to read. I’m wondering if this is something that could be incorporated with the above? I guess a requirement for that would be that Weaviate Classes/Collections are strongly typed on the client-side?

EDIT: A quick Google indicated that they’re bitwise operators, that’s both more and less magic than I expected :rofl:

Thanks, this was the point I tried to make when introducing the proposed Models/ORM proposal. The pain point is more on the objects being untyped, rather than how the schema is created.

The example on the pydantic main page looks (to my untrained eye) quite close to the Models/ORM proposal?

Ok, the ORM proposal is very close to what I had in mind…

I dropped some thoughts under that issue, to keep the thread on one place…

tl;dr; - that proposal is very nice… it’s bit too opinionated, magicky, and heavy to my taste… too many non-standard things…
as for the bitwise operator… I don’t see a clear value to it there… but I might be missing something… I looks cool and nerdy, but might scare a log to people as some kind of witchcraft… although I might enjoy the syntax, I wouldn’t choose it for a library that aims to target a mass audience…

That said, I still think there are quite some good ideas there, and with some refining, it might be a nice addition…

BTW… the whole typing issue is only an issue because you weaviate is so picky about it, and you need to include the type/class everywhere…

If there’d be a way not to enforce the types at all, and only allow stronger typing on the top, that would be the best way IMHO…

It is possible to implement something similar with Python generics in its current form but the syntax is likely to change once Python 3.12 rolls around. PEP 695 does a good job of summarising the current state of generics and the proposed syntactical changes. This PEP has been accepted and will roll-out with 3.12 as and when!

Some proof-of-concept pseudo-code for 3.7 < x < 3.12 might look like the following.

import requests
from dataclasses import dataclass
from typing import Generic, TypeVar, Type

@dataclass
class Person:
    name: str
    age: int

@dataclass
class Pet:
    name: str
    age: int
    breed: str

T = TypeVar('T')

class Collection(Generic[T]):
    _type: Type[T]

    def __init__(self, _type: Type[T]) -> None:
        self._type = _type

    def create(self, item: T) -> T:
        res = requests.post(
            url='http://localhost:8080/v1/objects',
            json={
                'class': item.__class__.__name__,
                'properties': item.__dict__
            }
        )
        data: dict = res.json()
        return self._type(**data.get('properties'))

class Client:
    def collection(self, _type: Type[T]) -> Collection[T]:
        return Collection[T](_type)
    
client = Client()

person = Person(name='Tommy', age=28)
pet = Pet(name='Momo', age=5, breed='Cat')

created_person = client.collection(Person).create(person)
created_pet = client.collection(Pet).create(pet)

# mypy (if used) will error the following line
error = client.collection(Person).create(pet)

The error thrown by mypy for this code is:

src/main.py:48: error: Argument 1 to "create" of "Collection" has incompatible type "Pet"; expected "Person"  [arg-type]
Found 1 error in 1 file (checked 2 source files)

Sadly this sort of compile-time error isn’t rendered in your IDE like for TypeScript with VScode but the compile check can still be factored into the development workflow with mypy.

For Python3.12, the syntax that you have in your TypeScript-like pseudo-code will be possible as: client.collection[Person]().create(person), but will require a refactoring of the given code above as described in PEP 695. The potential syntactical changes are not massively breaking but there will likely be a need to maintain two separate codebases for <3.12 and >=3.12.

My two cents: I reckon that any implementation using generics is preferential over one that establishes an opinionated ORM specification, e.g. using Pydantic or a Weaviate-defined base model, since the more general definition of the client with generics allows users to pick and choose their data validation library/solution of choice without being tightly-coupled to the Weaviate-chosen specification.

However, in order to make the pseudo-code above general enough for this then the methods by which the generic classes are serialised to and from JSON will need to be abstracted away using some combination of typing.Protocol and/or abc.ABC. This sort of thing is easily achievable using static-typed languages but, alas! Python is awfully tricky with types :sweat_smile:

1 Like