Weaviate error "transferring leadership" on single node cluster

Description

We are seeing periodic errors in Weaviate, “transferring leadership” but we are only using a single node cluster so the RAFT algorithm makes no sense here. It takes hours for Weaviate to recover sometimes, and it brings down our whole application.

We also see these errors in the client when this happens:

API_ERROR:
UnexpectedStatusCodeError
message: Collection may not exist.! Unexpected status code: 500, with response body: {‘error’: [{‘message’: ‘failed to execute query: leader not found’}]}.

This is our environement variables:

ENV QUERY_DEFAULTS_LIMIT=25
ENV AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true
ENV PERSISTENCE_DATA_PATH=/var/lib/weaviate
ENV DEFAULT_VECTORIZER_MODULE=none
ENV ENABLE_MODULES=text2vec-cohere,text2vec-huggingface,text2vec-palm,text2vec-openai,generative-openai,generative-cohere,generative-palm,ref2vec-centroid,reranker-cohere,qna-openai,backup-filesystem
ENV CLUSTER_HOSTNAME=node1
ENV BACKUP_FILESYSTEM_PATH=‘/var/lib/weaviate/backups’

One proposed solution I saw somewhere is setting:

ENV RAFT_BOOTSTRAP_EXPECT=1

Will that help?

Server Setup Information

  • Weaviate Server Version: Docker file semitechnologies/weaviate:1.26.5
  • Deployment Method: k8s
  • Multi Node? No Number of Running Nodes: 1
  • Client Language and Version: Python 3.10
  • Multitenancy?: No

Any additional Information

For some reason we are unable to reproduce the problem on our development environment, only in production. The only difference I can see is the size of the database, production is much larger.

Good morning @Stefan_Edlund :sunrise_over_mountains:

Welcome to our community! It’s lovely to have you on board.

RAFT implementation is used to ensure fault tolerance in multi-node clusters. In a single-node setup, RAFT cannot elect leaders or followers as there are no peers. As a result, it is unable to proceed with leadership transfer, which is why you’re seeing those logs

These leadership failure messages is not critical issues, and Weaviate should continue functioning as expected in a single-node configuration. For stability & performance, especially in production environments, we typically recommend a 3-node setup as minimum.

You can set the following:

  • name: RAFT_JOIN

value: weaviate-0

  • name: RAFT_BOOTSTRAP_EXPECT

value: “1”

Actually, this step of setting env is not strictly necessary, as the error logs do not affect Weaviate’s operations.

Best regards,
Mohamed Shahin,
Weaviate Support Engineer

Hi Mohamed,

the fix using RAFT_BOOT_EXPECT did not work work so we reverted back and got rid of it. Weaviate is taking forever for us to start up. We see lots of entries like this in the log
{"action":"lsm_recover_from_active_wal","build_git_commit":"353d907","build_go_version":"go1.22.7","build_image_tag":"1.26.5","build_wv_version":"1.26.5","class":"INTERNET_SEARCH_cf5bac2e_94d4_45b6_9dd0_37fc3c4f39c5","index":"internet_search_cf5bac2e_94d4_45b6_9dd0_37fc3c4f39c5","level":"warning","msg":"empty write-ahead-log found. Did weaviate crash prior to this or the tenant on/loaded from the cloud? Nothing to recover from this file.","path":"/var/lib/weaviate/internet_search_cf5bac2e_94d4_45b6_9dd0_37fc3c4f39c5/eCdjnhtViEoQ/lsm/property__id/segment-1733796788431478066","shard":"eCdjnhtViEoQ","time":"2024-12-13T17:53:23Z"}

{

Are they harmless? It takes forever to Weaviate to start up, and it’s bringing our application to a halt, we could very much use some help here.

Hi @Stefan_Edlund !!

Those messages will usually appear after a crash.

Do you, by any chance, have a lot of collections?

The more collections you have , more Weaviate can take to start accordingly.

If you have multiple collections, all of them with the same properties, you need to leverage the multi tenancy feature.

Let me know if this is your scenario here.

Also, we suggest to always use the latest version. There was a lot of improvements since 1.26.5

Thanks!

Hi Duda,

likely we have lots of collections. What’s the latest version?

We also see these timeout errors in the Python client, what might cause these:

Traceback (most recent call last):
File “/usr/local/lib/python3.10/site-packages/langchain_core/retrievers.py”, line 377, in aget_relevant_documents
result = await self._aget_relevant_documents(
File “/usr/local/lib/python3.10/site-packages/langchain/retrievers/merger_retriever.py”, line 56, in _aget_relevant_documents
merged_documents = await self.amerge_documents(query, run_manager)
File “/usr/local/lib/python3.10/site-packages/langchain/retrievers/merger_retriever.py”, line 105, in amerge_documents
retriever_docs = await asyncio.gather(
File “/usr/local/lib/python3.10/site-packages/langchain_core/retrievers.py”, line 384, in aget_relevant_documents
raise e
File “/usr/local/lib/python3.10/site-packages/langchain_core/retrievers.py”, line 377, in aget_relevant_documents
result = await self._aget_relevant_documents(
File “/workspace/app/langchain_tool/web_research_retriever.py”, line 272, in _aget_relevant_documents
self.vectorstore.delete_collection()
File “/workspace/app/langchain_tool/vectorstore.py”, line 113, in delete_collection
self._client.collections.delete(self._index_name)
File “/usr/local/lib/python3.10/site-packages/weaviate/collections/collections.py”, line 203, in delete
self._delete(_capitalize_first_letter(name))
File “/usr/local/lib/python3.10/site-packages/weaviate/collections/base.py”, line 93, in _delete
self._connection.delete(
File “/usr/local/lib/python3.10/site-packages/weaviate/connect/v4.py”, line 466, in delete
return self.__send(
File “/usr/local/lib/python3.10/site-packages/weaviate/connect/v4.py”, line 449, in __send
res = self._client.send(req)
File “/usr/local/lib/python3.10/site-packages/ddtrace/contrib/httpx/patch.py”, line 166, in _wrapped_sync_send
resp = wrapped(*args, **kwargs)
File “/usr/local/lib/python3.10/site-packages/httpx/_client.py”, line 914, in send
response = self._send_handling_auth(
File “/usr/local/lib/python3.10/site-packages/httpx/_client.py”, line 942, in _send_handling_auth
response = self._send_handling_redirects(
File “/usr/local/lib/python3.10/site-packages/httpx/_client.py”, line 979, in _send_handling_redirects
response = self._send_single_request(request)
File “/usr/local/lib/python3.10/site-packages/httpx/_client.py”, line 1015, in _send_single_request
response = transport.handle_request(request)
File “/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py”, line 232, in handle_request
with map_httpcore_exceptions():
File “/usr/local/lib/python3.10/contextlib.py”, line 153, in exit
self.gen.throw(typ, value, traceback)
File “/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py”, line 86, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadTimeout: timed out

Hi!

The latest version is 1.28.0

In order to upgrade, we suggest first upgrading for the latest of each release.

So for example:

1.26.5 → 1.26.12 → 1.27.8 → 1.28.0

Note that langchain already supports multi tenancy.

Here you can find a recipe I wrote myself with this exact scenario:

We have identified some issues when there is a lot of collections.

One of them, the startup time will grow according to the number of collections.

Also, each time you add a new collection, the graphql schema needs to be rebuild, so this can affect some queries on that endpoint.

Let me know if this helps!

Thanks!

Hi,

we have a sick Weaviate, and still need help figuring out what we can do.

The k8s liveness probe often restarts Weaviate when failing pings, and when it restarts it goes through some recovery process that takes an hour.

How can we diagnose this? We do likely have lots of collections, what’s a quick way to clean that up, since I think most of them are unimportant?

We are also struggling with the “leader not found” problem still, it seems to happen when Weaviate has been idle for a while, and suddenly wakes up.

One thing we’re trying to do is deleting collections from the file system (mounted on /var/lib/weaviate), but Weaviate keeps recreating them. How can we prevent that?

hi @Stefan_Edlund !!

Deleting from the file system is not advisable :grimacing:

You can try increasing the readiness probes on k8s, allowing Weaviate enough time to start up. Because of the number of collections and resource allocated, it can take quite some time.

Then you clear all collections you can, deleting them using the client, not remove from filesystem.

Once you cluster is up, running and stable, you’ll need to spin up a new cluster, using latest version, and migrate your data using multi tenancy.

So now each customer collection, let’s say, Customer1234, will be come a tenant id 1234 in the collection Customer for example.

You then change your code to initialize the collection with the given tenant.

Let me know if this helps.

Thanks!

Following your advice, I wrote a python script deleting the unwanted collections. We have about 6000 of them.

It took about 30 seconds per deletion (!), and after some time it crashed and restarted (possibly due to liveness probe). When it came back up again the collections deleted were re-created :(.

They were recreated? :confused:

With all the same content inside?

That doesn’t make sense.

Please, if possible, can you reach out to me in our public slack?
https://weaviate.io/slack

That way I can take a closer look on it.

Thanks!

Hi,

the application is in a maintenance mode, but it’s still popular and we want to keep it up.

We were able to get rid of the unwanted collections by disabling the liveness probe and running a script deleting them. We are not sure if the collections will be recreated again when Weaviate restarts.

It’s a temporary solution, but for now we’re okay.

Thanks for the help.

1 Like

Great!

Let me know when there is any issue where we can help!

Thanks for using Weaviate :love_you_gesture: