Unable to restart my weaviate container

Description

I’m running weaviate 1.25.0 as a container in AWS ECS.
I’m persisting storage via EFS.

I’m able to start up weaviate, create my schema and load in a bunch of data. And run some test queries to make sure it’s all working well.

Then I run the following test:
I redeploy the container by scaling it down to 0 then back up to 1. This forces ECS docker to basically kill my old container and start a new one.

At this point weaviate fails to start up correctly.

/v1/schema throws an error: {“error”:[{“message”:“could not read schema with strong consistency: failed to execute query: leader not found”}]}

In the log I’m seeing the following:
{“action”:“bootstrap”,“error”:“could not join a cluster from [172.17.0.22:8300]”,“level”:“warning”,“msg”:“failed to join cluster, will notify next if voter”,“servers”:[“172.17.0.22:8300”],“time”:“2024-05-14T03:07:18Z”,“voter”:true}
{“action”:“bootstrap”,“level”:“info”,“msg”:“notified peers this node is ready to join as voter”,“servers”:[“172.17.0.22:8300”],“time”:“2024-05-14T03:07:18Z”}

Following env vars are set at the moment.

                "name": "AUTHENTICATION_APIKEY_ENABLED",
                "value": "true"

                "name": "CLUSTER_HOSTNAME ",
                "value": "weaviate-dev"

                "name": "AUTHENTICATION_APIKEY_ALLOWED_KEYS",
                "value": "ae...fd"

                "name": "AUTHENTICATION_APIKEY_USERS",
                "value": "a...r"

                "name": "PERSISTENCE_DATA_PATH",
                "value": "/var/lib/weaviate"

                "name": "ENABLE_CUDA",
                "value": "0"

I clearly have something setup wrong so any advice at all is welcome as I’m trying to setup the dev env with just a single host and it’s telling me it wants to join a non-existant cluster.

Server Setup Information

  • Weaviate Server Version: 1.25.0
  • Deployment Method: docker (aws ecs)
  • Multi Node? Number of Running Nodes: single node
  • Client Language and Version: just using graphql for testing

Any additional Information

Here’s the output from the weaviate container:

2024-05-14T03:09:00Z INF action=startup default_vectorizer_module=none msg=the default vectorizer modules is set to “none”, as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer
2024-05-14T03:09:00Z INF action=startup auto_schema_enabled=true msg=auto schema enabled setting is set to “true”
2024-05-14T03:09:00Z INF msg=No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true
2024-05-14T03:09:00Z INF msg=open cluster service servers={“2983fa3610b4”:8300}
2024-05-14T03:09:00Z INF address=172.17.0.22:8301 msg=starting cloud rpc server …
2024-05-14T03:09:00Z INF msg=starting raft sub-system …
2024-05-14T03:09:00Z INF address=172.17.0.22:8300 msg=tcp transport tcpMaxPool=3 tcpTimeout=10000000000
2024-05-14T03:09:00Z INF metadata_only_voters=false msg=construct a new raft node name=2983fa3610b4
2024-05-14T03:09:00Z INF action=raft index=1 msg=raft initial configuration servers=[[{Suffrage:Voter ID:04aa32840caa Address:172.17.0.22:8300}]]
2024-05-14T03:09:00Z INF last_log_applied_index=3 last_snapshot_index=0 msg=raft node raft_applied_index=0 raft_last_index=3
2024-05-14T03:09:00Z INF action=raft follower={} leader-address= leader-id= msg=raft entering follower state
2024-05-14T03:09:02Z warning action=bootstrap error=could not join a cluster from [172.17.0.22:8300] msg=failed to join cluster, will notify next if voter servers=[“172.17.0.22:8300”] voter=true
2024-05-14T03:09:02Z INF action=bootstrap candidates=[{“Suffrage”:0,“ID”:“2983fa3610b4”,“Address”:“172.17.0.22:8300”}] msg=starting cluster bootstrapping
2024-05-14T03:09:02Z ERR action=bootstrap error=bootstrap only works on new clusters msg=could not bootstrapping cluster
2024-05-14T03:09:02Z INF action=bootstrap msg=notified peers this node is ready to join as voter servers=[“172.17.0.22:8300”]
2024-05-14T03:09:02Z warning msg=raft heartbeat timeout reached, not part of a stable configuration or a non-voter, not triggering a leader election
2024-05-14T03:09:02Z INF action=grpc_startup msg=grpc server listening at [::]:50051
2024-05-14T03:09:03Z INF action=restapi_management msg=Serving weaviate at http://[::]:8080
2024-05-14T03:09:03Z warning action=bootstrap error=could not join a cluster from [172.17.0.22:8300] msg=failed to join cluster, will notify next if voter servers=[“172.17.0.22:8300”] voter=true
2024-05-14T03:09:03Z INF action=bootstrap msg=notified peers this node is ready to join as voter servers=[“172.17.0.22:8300”]
2024-05-14T03:09:04Z INF action=telemetry_push msg=telemetry started payload=&{MachineID:0eba0978-f3b0-4d55-92d0-32a900f97751 Type:INIT Version:1.25.0 Modules: NumObjects:0 OS:linux Arch:amd64}
2024-05-14T03:09:04Z warning action=bootstrap error=could not join a cluster from [172.17.0.22:8300] msg=failed to join cluster, will notify next if voter servers=[“172.17.0.22:8300”] voter=true
2024-05-14T03:09:04Z INF action=bootstrap msg=notified peers this node is ready to join as voter servers=[“172.17.0.22:8300”]
2024-05-14T03:09:05Z warning action=bootstrap error=could not join a cluster from [172.17.0.22:8300] msg=failed to join cluster, will notify next if voter servers=[“172.17.0.22:8300”] voter=true
2024-05-14T03:09:05Z INF action=bootstrap msg=notified peers this node is ready to join as voter servers=[“172.17.0.22:8300”]
2024-05-14T03:09:07Z warning action=bootstrap error=could not join a cluster from [172.17.0.22:8300] msg=failed to join cluster, will notify next if voter servers=[“172.17.0.22:8300”] voter=true
2024-05-14T03:09:07Z INF action=bootstrap msg=notified peers this node is ready to join as voter servers=[“172.17.0.22:8300”]
2024-05-14T03:09:08Z warning action=bootstrap error=could not join a cluster from [172.17.0.22:8300] msg=failed to join cluster, will notify next if voter servers=[“172.17.0.22:8300”] voter=true
2024-05-14T03:09:08Z INF action=bootstrap msg=notified peers this node is ready to join as voter servers=[“172.17.0.22:8300”]
2024-05-14T03:09:09Z warning action=bootstrap error=could not join a cluster from [172.17.0.22:8300] msg=failed to join cluster, will notify next if voter servers=[“172.17.0.22:8300”] voter=true
2024-05-14T03:09:09Z INF action=bootstrap msg=notified peers this node is ready to join as voter servers=[“172.17.0.22:8300”]

1 Like

Looks like this is an issue with 1.25.0

I just tried rolling my configuration down to 1.24.12 and after loading some data then restarting, it’s still working fine.

However some other error comes up now:
{“error”:[{“message”:“list objects: search index searchvideo: remote shard object search q6648XVHIeWP: resolve node name "240a004f6aec" to host”}]}

found this discussion here: Error resolving node name to host

Does this mean my hostname env parameter isn’t applying right?

omg, I had an extra space in my CLUSTER_HOSTNAME env variable. :sob:

Seems like that was my 1.24.12 issue.

hi @maddios !! Welcome to our community :hugs:

Thanks for sharing!!

I’ve migrated from 1.22.x to 1.25.0 and started getting the same error when connecting to the Weaviate instance via weaviate-ts-client:

Error: usage error (403): {"error":[{"message":"could not read schema with strong consistency: failed to execute query: leader not found"}]}

Falling back to 1.24.12 helped, but there is still an error with 1.25.0.

hi @evenfrost !

If you are running a multi node cluster on k8s, you need to follow this migration guide in order to upgrade to 1.25:

Let me know if this helps!

Thanks!

Hi @DudaNogueira , no, I’m using Docker image locally.

That on a multi node deployment, right?

There are some changed environment variable changes needed in order to leverage the new RAFT consensus algorithm.

Notice the new RAFT_environment variables here:

@evenfrost, do you mind opening a new thread post so we can have better visibility on that?

I will try to reproduce this scenario later this week.

Thanks!

Hi @DudaNogueira , sorry for the late reply.

I created a separate topic here: Error when upgrading to Weaviate 1.25.1.