Shard Status cannot be updated via Python API

Hi all,

I’m facing the problem that ingesting data / updating items does not work anymore because of shards being in READONLY mode. The Weaviate (1.20.0) is deployed in Google Kubernetes Engine with 10 pods.

The K8s logs show “lsm_compaction” with message “compaction halted due to shard READONLY status” for all shards.

However, when I use the Python Shard API all shards are shown as READY. In addition, when setting all shards to READY before updating an item via Python, it has no impact – the shards are still in READONLY mode as shown in the pod logs in contrast to what has happened before via Python.

Eventhough IOPS to the disk may be a problem, I don’t get why
(1) K8s shows different shard information (READY or READONLY) compared to the Python API
(2) inserting/updating an item doesn’t work because of READONLY mode as I forced the class shards to be READY.

Can someone please explain to me what I’m doing wrong or misunderstand?

Thanks in advance!

Hi @troasted - and welcome.

My knowledge on this topic is limited, but I think sometimes this is caused by memory pressure. So if ~90% of memory pressure occurs, shards will go into READONLY mode.

Is it possible that this is happening? (You can set the available memory with GOMEMLIMIT variable I believe.)

(Edit: I’ve passed on your specific questions to the team also)

Hey,

thanks for welcoming and the fast response! Memory is not be a problem as each pod is heavily overprovisioned even for peak loads. So, we’re for all pods way below of a 90% or even 80% memory pressure. And currently, if I try to run a ingestion into one class with a batch size of 100, single threading, the maximum memory usage of a node is <50% and CPU <20%.

Right, thanks for that. I’ve asked so hopefully someone will get back to you soon. I’ll pass this on too.

1 Like

Hi @troasted - so the likely reasons are:

  1. The setup ran out of memory
  2. The setup ran out of disk space

If it is either of these two, there should be multiple logs, first warnings leading up to the (disk or mem) threshold. Then finally a message that confirms that it was marked RO

As to this:

K8s shows different shard information (READY or READONLY) compared to the Python API - why might this be?

The k8s status is a k8s thing, it only has ready true/false. Ready means “give me traffic”, not ready means “don’t include me in the load balancing”. That is unrelated to a shard’s ready vs read-only status.

Does that help? If you take a look at the logs leading to where they get marked as READONLY, does it indicate why this might be happening?

Hey @jphwang

memory cannot be an issue as the problem even arises when updating just one single item via

client.data_object.update(
                    data_object=properties,
                    class_name=weaviate_class_name,
                    uuid=item_id
                )

At the point of time when the item gets updated all pods show a memory usage at most at 40%.

Disk space, at least as I understand the term, should also not be an issue, but I will dig into this a bit more:

In the helm-chart I didn’t update the “storage” block meaning that it’s kept with default values:

storage:
  size: 32Gi
  storageClassName: ""

This leads in K8s to 10 volumes, one per pod, with storage class “standard-rwo” and requested storage of 32Gi and capacity of 32Gi. When I take a look into the persistent volume claim details, requested access modes are “Read Write Once”.

Inspecting the persistent volumes, no events are shown in GKE console.

Let me share an example of the warning arising in the logs:

action: lsm_compaction, class: <CLASS_NAME>, index: <class_name>, level: warning, msg: compaction halted due to shard READONLY status, path: /var/lib/weaviate/<class_name>_Q9v1RipHZD0P_lsm, shard: <shard_name>}

This happens right after an item-update as shown above. I’m using a delta load logic to ingest data implemented like this. The problem arises always because of the section “client.data_object.update”

print("Extraction running…")
source_data = get_source_data()

# Upserting
print("Upserting running…")
with client.batch(
        batch_size=100,
        timeout_retries=10,
        connection_error_retries=10,
        dynamic=True,
        callback=_check_batch_result
) as batch:
    for item in tqdm(source_data, mininterval=1, desc=weaviate_class_name):
        item.prompt = prompt_generator(item)
        properties = asdict(item)
        item_id = generate_weaviate_id_from_item(item)

        try:
            weaviate_obj = client.data_object.get(
                uuid=item_id,
                class_name=.weaviate_class_name,
                with_vector=True
            )
        except Exception:
            weaviate_obj = None

        # 1. Item exists --> Check for updating the item
        if item_exist_condition():
            weaviate_item = MyDataClass(**weaviate_obj.get("properties"))

            # 1.1 Update properties, not embedding
            if property_update_condition():
                client.data_object.update(
                    data_object=properties,
                    class_name=weaviate_class_name,
                    uuid=item_id
                )
                continue
            # 1.2 Item is the same --> No update necessary
            elif weaviate_item == item:
                continue

        # 2. Item does not yet exist --> Insert the item
        batch.add_data_object(
            data_object=properties,
            class_name=weaviate_class_name,
            uuid=item_id,
            vector=vectorizer(item.prompt)
        )

And even when I change the update section like the following shards are still in READONLY mode and the update fails:

            if property_update_condition():
                client.schema.update_class_shard(
                    class_name=weaviate_class_name,
                    status="READY"
                )
                
                client.data_object.update(
                    data_object=properties,
                    class_name=weaviate_class_name,
                    uuid=item_id
                )
                continue

Thanks for that. Let us know what you find re: disk usage.

And if disk space is a problem I think you should see warnings from Weaviate leading upto the error that you posted. Please let us know if you see anything like that.

Disk usage is also not a problem unfortunately…

It seems to be an issue of having multiple replicas (in my case 10): Before the logs tell that the shards of a class are put to READONLY the following error message below pops up. I could imagine that the item that should be updated is simply not found (nil pointer dereference), because the batch has been written on multiple pods.

I will try to recreate the database with 1 replica. Nevertheless, horizontal scaling does not work then.

{"error":"runtime error: invalid memory address or nil pointer dereference","level":"error","method":"GET","msg":"runtime error: invalid memory address or nil pointer dereference","path":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/objects/TpacsItem_ada02/cc626292-e9d4-0b77-d3db-2142383892b9","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"include=vector","Fragment":"","RawFragment":""},"time":"2023-07-14T12:18:06Z"}
{"action":"requests_total","api":"rest","class_name":"","error":"runtime error: invalid memory address or nil pointer dereference","level":"error","msg":"unexpected error","query_type":"","time":"2023-07-14T12:18:06Z"}
goroutine 6089 [running]:
runtime/debug.Stack()
    /usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
    /usr/local/go/src/runtime/debug/stack.go:16 +0x19
github.com/weaviate/weaviate/adapters/handlers/rest.handlePanics({0x1c1f7c0, 0xc002f3ef00}, {0x1c0bad8, 0xc00440e0a0}, 0xc003a99400)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/panics_middleware.go:80 +0x2c5
panic({0x15f30a0, 0x2689e10})
    /usr/local/go/src/runtime/panic.go:884 +0x213
github.com/weaviate/weaviate/adapters/handlers/rest.(*objectHandlers).getObject(0xc0039e6000, {0xc003a99600, {0xc002bc6610, 0xf}, 0x0, {0xc005d72930, 0x24}, 0xc003fd9670, 0x0, 0x0}, ...)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/handlers_objects.go:152 +0xde
github.com/weaviate/weaviate/adapters/handlers/rest/operations/objects.ObjectsClassGetHandlerFunc.Handle(0xc003ba7a20?, {0xc003a99600, {0xc002bc6610, 0xf}, 0x0, {0xc005d72930, 0x24}, 0xc003fd9670, 0x0, 0x0}, ...)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/operations/objects/objects_class_get.go:32 +0x63
github.com/weaviate/weaviate/adapters/handlers/rest/operations/objects.(*ObjectsClassGet).ServeHTTP(0xc0039a0de0, {0x1c09c30, 0xc0039e4540}, 0xc003a99600)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/operations/objects/objects_class_get.go:81 +0x2da
github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x1c09c30, 0xc0039e4540}, 0xc003a99600)
    /go/pkg/mod/github.com/go-openapi/runtime@v0.24.2/middleware/operation.go:28 +0x59
net/http.HandlerFunc.ServeHTTP(0xc003a99600?, {0x1c09c30?, 0xc0039e4540?}, 0x40fea7?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/usecases/auth/authentication/anonymous.(*Client).Middleware.func1({0x1c09c30, 0xc0039e4540}, 0x1c02e60?)
    /go/src/github.com/weaviate/weaviate/usecases/auth/authentication/anonymous/middleware.go:48 +0x68
net/http.HandlerFunc.ServeHTTP(0xc0028cc000?, {0x1c09c30?, 0xc0039e4540?}, 0xc003a99500?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.makeSetupMiddlewares.func1.1({0x1c09c30, 0xc0039e4540}, 0xc003a99600)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/middlewares.go:42 +0xec
net/http.HandlerFunc.ServeHTTP(0x417ed0?, {0x1c09c30?, 0xc0039e4540?}, 0xc00268cfd0?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.NewRouter.func1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/pkg/mod/github.com/go-openapi/runtime@v0.24.2/middleware/router.go:78 +0x257
net/http.HandlerFunc.ServeHTTP(0x7f151c2e9108?, {0x1c09c30?, 0xc0039e4540?}, 0xc003fd95a0?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.Redoc.func1({0x1c09c30, 0xc0039e4540}, 0x110?)
    /go/pkg/mod/github.com/go-openapi/runtime@v0.24.2/middleware/redoc.go:72 +0x242
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c09c30?, 0xc0039e4540?}, 0x0?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.Spec.func1({0x1c09c30, 0xc0039e4540}, 0xc00268d401?)
    /go/pkg/mod/github.com/go-openapi/runtime@v0.24.2/middleware/spec.go:46 +0x18c
net/http.HandlerFunc.ServeHTTP(0xc0039c0960?, {0x1c09c30?, 0xc0039e4540?}, 0xc003a99400?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/rs/cors.(*Cors).Handler.func1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/pkg/mod/github.com/rs/cors@v1.5.0/cors.go:200 +0x1b8
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c09c30?, 0xc0039e4540?}, 0xc0028cc000?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest/swagger_middleware.AddMiddleware.func1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/swagger_middleware/swagger_middleware.go:37 +0x222
net/http.HandlerFunc.ServeHTTP(0xc000485960?, {0x1c09c30?, 0xc0039e4540?}, 0x3?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.makeAddLogging.func1.1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/middlewares.go:118 +0x285
net/http.HandlerFunc.ServeHTTP(0x41?, {0x1c09c30?, 0xc0039e4540?}, 0x40?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.addPreflight.func1({0x1c09c30?, 0xc0039e4540?}, 0x80?)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/middlewares.go:153 +0x263
net/http.HandlerFunc.ServeHTTP(0xc0028cc000?, {0x1c09c30?, 0xc0039e4540?}, 0xc00268d878?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.addLiveAndReadyness.func1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/middlewares.go:192 +0x119
net/http.HandlerFunc.ServeHTTP(0xc0028cc000?, {0x1c09c30?, 0xc0039e4540?}, 0x40?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.addHandleRoot.func1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/middlewares.go:57 +0x1dd
net/http.HandlerFunc.ServeHTTP(0xc0028cc000?, {0x1c09c30?, 0xc0039e4540?}, 0x0?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.makeAddModuleHandlers.func1.1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/middlewares.go:78 +0x9c
net/http.HandlerFunc.ServeHTTP(0xc00268d9a0?, {0x1c09c30?, 0xc0039e4540?}, 0x26b1fc0?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.addInjectHeadersIntoContext.func1({0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/middlewares.go:171 +0x212
net/http.HandlerFunc.ServeHTTP(0x4e2e33?, {0x1c09c30?, 0xc0039e4540?}, 0xe?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/weaviate/weaviate/adapters/handlers/rest.makeCatchPanics.func1.1({0x1c09c30?, 0xc0039e4540?}, 0x166b380?)
    /go/src/github.com/weaviate/weaviate/adapters/handlers/rest/panics_middleware.go:29 +0xa3
net/http.HandlerFunc.ServeHTTP(0xc002bc6645?, {0x1c09c30?, 0xc0039e4540?}, 0x46d1ce?)
    /usr/local/go/src/net/http/server.go:2122 +0x2f
net/http.serverHandler.ServeHTTP({0xc00287d9b0?}, {0x1c09c30, 0xc0039e4540}, 0xc003a99400)
    /usr/local/go/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc00049e6c0, {0x1c0b288, 0xc0032bf950})
    /usr/local/go/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:3089 +0x5ed

nope, still no luck with it… Redeployed with 1 replica and when I do an item update it get

{error: runtime error: invalid memory address or nil pointer dereference, level: error, method: GET, msg: runtime error: invalid memory address or nil pointer dereference, path: {…}}

and afterwards the shard gets locked to READONLY:

{action: lsm_compaction, class: <CLASS_NAME>, index: <class_name>, level: warning, msg: compaction halted due to shard READONLY status, path: /var/lib/weaviate/<class_name>_xGihpxwxWDOi_lsm, shard: xGihpxwxWDOi}

However:

curl -s <IP>/v1/schema/<CLASS_NAME>/shards -H "Authorization: Bearer <MY_TOKEN>" | jq
[
  {
    "name": "xGihpxwxWDOi",
    "status": "READY"
  }
]

Hmm. Sorry to hear you are still having issues.

How many objects do you have, and would you be able to share hardware specs?

Number of objects over all schemas is <300,000 since the database is just in proof of concept state. However, currently, the database is almost blank (only 1 schema with 1019 items) as I recreated by dropping all data and loading just one schema and there a few data to check if the “item-update leading to readonly-shards mode” problem persists (see our discussion above).

Hardware specs: We’re on a GKE Autopilot cluster with following information

Node overview:
NAME STATUS ROLES AGE VERSION
gk3-tacs-pool-1-9cbe58f4-myc5 Ready 23d v1.26.5-gke.1200
gk3-tacs-pool-1-9e6b258a-m6jh Ready 17d v1.26.5-gke.1200
gk3-tacs-pool-2-19276bca-5752 Ready 10d v1.26.5-gke.1200
gk3-tacs-pool-3-75f67287-mw4f Ready 4d21h v1.26.5-gke.1200

Capacity (total of 1 replica with 4 CPU and 20GB of Memory):
{
“cpu”: “2”,
“ephemeral-storage”: “98831908Ki”,
“hugepages-1Gi”: “0”,
“hugepages-2Mi”: “0”,
“memory”: “8145592Ki”,
“pods”: “32”
}
{
“cpu”: “2”,
“ephemeral-storage”: “98831908Ki”,
“hugepages-1Gi”: “0”,
“hugepages-2Mi”: “0”,
“memory”: “8145592Ki”,
“pods”: “32”
}
{
“cpu”: “4”,
“ephemeral-storage”: “98831908Ki”,
“hugepages-1Gi”: “0”,
“hugepages-2Mi”: “0”,
“memory”: “16390360Ki”,
“pods”: “32”
}
{
“cpu”: “8”,
“ephemeral-storage”: “98831908Ki”,
“hugepages-1Gi”: “0”,
“hugepages-2Mi”: “0”,
“memory”: “32879884Ki”,
“pods”: “32”
}

Thanks for your help!

Problem is solved – it arouse from a wrong way of upserting; The “client.data_object.update” does not work inside the batch context manager “with client.batch(…) as batch” – see above. Instead, one must use the “batch.add_data_object” to perform updates.