Description
We are seeing periodic errors in Weaviate, “transferring leadership” but we are only using a single node cluster so the RAFT algorithm makes no sense here. It takes hours for Weaviate to recover sometimes, and it brings down our whole application.
We also see these errors in the client when this happens:
API_ERROR:
UnexpectedStatusCodeError
message: Collection may not exist.! Unexpected status code: 500, with response body: {‘error’: [{‘message’: ‘failed to execute query: leader not found’}]}.
This is our environement variables:
ENV QUERY_DEFAULTS_LIMIT=25
ENV AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true
ENV PERSISTENCE_DATA_PATH=/var/lib/weaviate
ENV DEFAULT_VECTORIZER_MODULE=none
ENV ENABLE_MODULES=text2vec-cohere,text2vec-huggingface,text2vec-palm,text2vec-openai,generative-openai,generative-cohere,generative-palm,ref2vec-centroid,reranker-cohere,qna-openai,backup-filesystem
ENV CLUSTER_HOSTNAME=node1
ENV BACKUP_FILESYSTEM_PATH=‘/var/lib/weaviate/backups’
One proposed solution I saw somewhere is setting:
ENV RAFT_BOOTSTRAP_EXPECT=1
Will that help?
Server Setup Information
- Weaviate Server Version: Docker file semitechnologies/weaviate:1.26.5
- Deployment Method: k8s
- Multi Node? No Number of Running Nodes: 1
- Client Language and Version: Python 3.10
- Multitenancy?: No
Any additional Information
For some reason we are unable to reproduce the problem on our development environment, only in production. The only difference I can see is the size of the database, production is much larger.