How to get unique results based on property?

Hi, looking for any better way for getting unique results based on property. So far I am using this combined calls

def search_similar_authors_ids_by_image_data(image_data_bytes, limit=2):
    weaviate_client = weaviate.connect_to_local() # Connect with default parameters
    artworks = weaviate_client.collections.get("Artworks")
    image_data_base64 = base64.b64encode(image_data_bytes).decode('utf-8')

    responses = []
    filters = None

    for _ in range(limit):
        response = artworks.query.near_image(
            near_image=image_data_base64,
            limit=1,
            filters=filters,
            return_metadata=MetadataQuery(distance=True)
        )

        if response.objects:
            responses.append(response.objects[0])
            author_psql_id = response.objects[0].properties['author_psql_id']
            new_filter = Filter.by_property("author_psql_id").not_equal(author_psql_id)

            if filters is None:
                filters = new_filter
            else:
                filters = filters & new_filter

    weaviate_client.close()
    return responses

hi @jan-miksik ! Welcome to our community! :hugs:

Are you looking into searching, as per your example, for 2 unique authors given an image search at the Artworks collection?

I believe the best way is to do the search, and iterate of the results, extracting the authors, and breaking the iteration when you reach the number of authors you are looking for.

That way you only hit the database once.

Please, let me know if this is what you are looking for.

Thanks!

hey, thanks for reply. Primarly i looking what would be more source wise method to search lets say 20 authors/artists which are making similar artworks. Issue is that in the database the amount of artworks from one person can vary from 1 to 100 artworks. And there is big chance that the author makes similar pieces. One options is give it the search big enough buffer for exmaple set limit to 2000 and then filter them. Other is do multiple queries in this case 20. What is more resource and speed optimal?
Or maybe if exists some way how to query for unique values by property.

Hi @jan-miksik,

Resource and speed-wise, it is best to run a single query.

Also, running multiple near_x queries will perform multiple vectorisations. Each vectorisation is time-consuming, and if you use an external service, you will incur cost for each.

Group By

You could try running an aggregate query with a group_by and grouping by the artist name/ID.

example 1: Group by author_psql_id

response = artworks.aggregate.near_image(
    near_image=image_data_base64,

    # The total number of objects to collect from the query before the group_by begins
    object_limit=200,

    # set a group by condition, to group artwork for each artist
    group_by=GroupByAggregate(
        prop="author_psql_id", # property name used for group by - i.e. group results by author_psql_id 
        limit=3                # the number of groups to return - this will return 3 artist groups
    ),
)

Then, to get the author_psql_id, run:

for group in response.groups:
    # Here is how you access each author_psql_id that was used for group_by
    print("Artist Name: ", group.grouped_by.value)
    print("Matched artwork count: ", group. total_count)

example 2: Group by author_psql_id + return artwork name

You can also return a selected property for each artist, for example return the name of artwork.

response = artworks.aggregate.near_image(
    near_image=image_data_base64,
    object_limit=200,
    group_by=GroupByAggregate(
        prop="author_psql_id",
        limit=3
    ),

    # Here is how to tell Weaviate to return a specific property
    # use it to return the title of the artwork or some other identifying property
    return_metrics=Metrics("work").text(),
)

And here is how you can display the results:

for group in response.groups:
    print("Artist: ", group.grouped_by.value)

    # Here is how you access each matched piece of work for the given artist
    list_of_work = group.properties["work"].top_occurrences
    for work in list_of_work:
        print("Work: ", work.value)
1 Like

FYI, I’ve created a GitHub issue, which should make it easier to return properties in a group. Feel free to upvote :wink:

1 Like

Oh, I’ve just learnt that there is also group_by in query.near_x.

You can call it like this:

from weaviate.classes.query import GroupBy

response = artworks.query.near_image(
    near_image=image_data_base64,
    group_by=GroupBy(
        prop="author_psql_id",
        number_of_groups=3,
        objects_per_group=5
    )
)

Then you can use belongs_to_group to get the matched author_psql_id, while properties will give you access to other matched properties.

for item in response.objects:
    print(item.belongs_to_group)
    print(item.properties)
1 Like