How to specify the number of results for topOccurences?

Description

Is it possible to change the number of results returned from topOccurences in aggregate graphql queries? Preferably using the go-client.

For example, I have a dataset with 16 distinct values for a field “entities” in a class “Chunk”. When querying the topOccurences for this field, I only get 5 results though.

I’ve included the relevant code snippets below.

Thanks in advance.

Server Setup Information

  • Weaviate Server Version: 1.28.4
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: Go client (weaviate v1.30.1 and weaviata-go-client/v5 v5.1.0)
  • Multitenancy?: Yes

Any additional Information

Code snippets

graphql.Field{
	Name: propertyName,
	Fields: []graphql.Field{
		{Name: "type"},
		{Name: "count"},
		{
			Name: "topOccurrences",
			Fields: []graphql.Field{
				{Name: "value"},
				{Name: "occurs"},
			},
		},
	},
}
// Execute the aggregate query
result, err := p.client.GraphQL().Aggregate().
	WithClassName("Chunk").
	WithTenant(orgId).
	WithFields(fields...).
	WithWhere(where).
	Do(ctx)

Docker compose file

This is the docker compose used for local development

services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.28.4
    ports:
    - 8089:8080
    - 50051:50051
    volumes:
    - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      # Disable anonymous access.
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
      # Enables API key authentication.
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      # List one or more keys in plaintext separated by commas. Each key corresponds to a specific user identity below.
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: '<my-key>'
      # List one or more user identities, separated by commas. Each identity corresponds to a specific key above.
      AUTHENTICATION_APIKEY_USERS: '<my-user>'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      CLUSTER_HOSTNAME: 'node1'

volumes:
  weaviate_data:

Hey @Willera,

It’s lovely to have you here in our community :hugs:, and welcome to Weaviate :partying_face:.

Have you tried to set limit to 1?

See here:

I’ve not written in Go before but not sure how to set the limit, but wouldn’t it be something like this?


graphql.Field{
    Name: "topOccurrences (limit: 1)", 
    Fields: []graphql.Field{
        {Name: "value"},
        {Name: "occurs"},
    },
}

I may need to look on how to add limit to the Go snippet :upside_down_face:.

By setting limit to 1, you’re telling Weaviate to include all values that appear at least once, which should give you all 16 distinct values in your case.

One quick note: if your dataset grows a lot in the future, be aware that if your dataset has many unique values, requesting all of them might impact performance.

Best regards,
Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)

Hello @Mohamed_Shahin, thank you for the quick response.

I have not tried setting the limit before. My understanding was it only acts as a “minimum occurences” required to be included in the topOccurences?

In my test dataset, all 16 distinct values only occued once, so I didn’t think the limit would make a difference.

I just gave it a try with your suggestion (including the limit in the name of the field). It does solve my issue but it also behaved different from what you wrote. If I set the limit to 1, I only get a single topOccurence. If I set limit to 20, it actually returns all 16 topOccurences (each with a occurs of 1).

This is exactly the functionality I was hoping for. I don’t really need all distinct values for larger datasets. Just the top 100 or so would be enough. But it does seem to behave different from what the documentation says it would do?

For reference, this is the query value in runGraphQLQuery (gql.go) that is used to call the /graphql REST endpoint by the go sdk:

{Aggregate{Chunk(tenant: "my-tenant", where:{operator: Equal path: ["bucket_id"] valueString: "default"}){meta{count} document_types{type count topOccurrences (limit: 100){value occurs}} entities{type count topOccurrences (limit: 100){value occurs}}}}}

Regarding the performance on larger datasets: This is something we need very infrequently and would do during times of low use to get some statistics. If it starts causing issues, we can still fall back to iterating over all data using pagination (with the after cursor) and counting manually.

Thanks a lot

Good morning @Willera,

Thank you so much for testing and letting me know — because yeah, my understanding was based on the docs. I have come to a confusion about TopOccurrences:

  • I thought that: “The limit parameter acts as a minimum count threshold, where only values occurring at least that many times would be included in the results.”

I really appreciate your input here — I will take that to the docs team and get the docs updated with the right understanding.

I am glad you have it working now and everything is ok.

Have a lovely weekend!

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)

1 Like