Description
We are currently using docker-swarm-based deployment on production and shifting to Kubernetes-based deployment. While everything is up including weaviate we are facing weviate connection issues when trying to insert data into weaviate using batches. Not always, but it fails 8/10 times and the error is connection error, not able to connect to weaviate. Weaviate automatically restarts as well while inserting any data, added the logs below at the time when it restarts.
On docker-swarm based deployment, everything is working fine.
Server Setup Information
- Weaviate Server Version: 1.22.5
- Deployment Method: K8s, helm
- Multi Node? Number of Running Nodes: only 1 node
- Client Language and Version: Python V3
Any additional Information
- I am using AWS EKS and EKS nodes are in a private subnet.
- I am using text2vec-openai as the vectorizer.
- We have multi-tenancy enabled and the tenants are HOT as well
- args while starting up weaviate:
args:
- '--host'
- '0.0.0.0'
- '--port'
- '8080'
- '--scheme'
- 'http'
- '--config-file'
- '/weaviate-config/conf.yaml'
- --read-timeout=200s
- --write-timeout=400s
- Weaviate is flooded with the below logs:
time="2024-05-19T15:13:32Z" level=trace msg="no segment eligible for compaction" action=lsm_compaction class=Paragraph index=paragraph path=/var/lib/weaviate/paragraph_DWI57WOAQ_lsm/property_metadata_searchable shard=DWI57WOAQ
- While checking weaviate logs I can see that it makes a request to openid-configuration as well but I’ve explicitly made the value as false for opened authentication
** Troubleshooting steps used:**
- My app is able to connect to weaviate I made a curl request and got schema and metadata as well.
- I have used
timeout_retries
as well while configuring my batch and batch_size is 10 only. - No issues related to resources, I am monitoring the resources while training and I’ve doubled the resources of my server and removed all the resource limits from K8s as well but weaviate is not stable and works only 2/10 times.
- Doubled the
timeout_config
as compared to my docker-swarm-based setup - Weaviate restarts automatically while inserting any data and below are the logs I got for the previous pod which was restarted using
kubectl logs <weaviate-pod-name> --previous
time="2024-05-22T11:44:40Z" level=trace msg="no segment eligible for compaction" action=lsm_compaction class=Paragraph index=paragraph path=/var/lib/weaviate/paragraph_DWI57WOAQ_lsm/property_content_searchable shard=DWI57WOAQ
time="2024-05-22T11:44:48Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/.well-known/openid-configuration
time="2024-05-22T11:44:48Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/meta
time="2024-05-22T11:44:51Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/.well-known/openid-configuration
time="2024-05-22T11:44:51Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/meta
time="2024-05-22T11:45:03Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/.well-known/openid-configuration
time="2024-05-22T11:45:03Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/meta
time="2024-05-22T11:45:06Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/.well-known/openid-configuration
time="2024-05-22T11:45:06Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/meta
time="2024-05-22T11:45:18Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/.well-known/openid-configuration
time="2024-05-22T11:45:18Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/meta
time="2024-05-22T11:45:21Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/.well-known/openid-configuration
time="2024-05-22T11:45:21Z" level=debug msg="received HTTP request" action=restapi_request method=GET url=/v1/meta
time="2024-05-22T11:45:30Z" level=debug msg="received HTTP request" action=restapi_request method=POST url=/v1/graphql
time="2024-05-22T11:45:31Z" level=debug msg="received HTTP request" action=restapi_request method=POST url=/v1/schema/Paragraph/tenants
time="2024-05-22T11:45:31Z" level=trace msg="number of partitions for class \"Paragraph\" does not match number of requested tenants" #partitions=0 #requested=1 action=add_tenants
time="2024-05-22T11:45:31Z" level=debug msg="saving updated schema to configuration store" action=schema.add_tenants
time="2024-05-22T11:45:31Z" level=debug msg="received HTTP request" action=restapi_request method=POST url=/v1/batch/objects
time="2024-05-22T11:45:31Z" level=debug msg="received HTTP request" action=restapi_request method=DELETE url="/v1/batch/objects?tenant=DWI57WOAQ"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 43.872µs" action=store_object_store_determine_status took="43.872µs"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 61.211µs" action=store_object_store_determine_status took="61.211µs"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 14.351µs" action=store_object_store_determine_status took="14.351µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 31.778µs" action=store_object_store_upsert_object_data took="31.778µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 31.299µs" action=store_object_store_upsert_object_data took="31.299µs"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 119.613µs" action=store_object_store_determine_status took="119.613µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 27.862µs" action=store_object_store_upsert_object_data took="27.862µs"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 39.337µs" action=store_object_store_determine_status took="39.337µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 27.194µs" action=store_object_store_upsert_object_data took="27.194µs"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 53.342µs" action=store_object_store_determine_status took="53.342µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 28.427µs" action=store_object_store_upsert_object_data took="28.427µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 30.167µs" action=store_object_store_upsert_object_data took="30.167µs"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 115.14µs" action=store_object_store_determine_status took="115.14µs"
time="2024-05-22T11:45:31Z" level=trace msg="retrieving previous and determining status in KV took 18.414µs" action=store_object_store_determine_status took="18.414µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 16.989µs" action=store_object_store_upsert_object_data took="16.989µs"
time="2024-05-22T11:45:31Z" level=trace msg="storing object data in KV took 30.053µs" action=store_object_store_upsert_object_data took="30.053µs"
time="2024-05-22T11:45:31Z" level=trace msg="object batch took 4.784573ms" action=batch_objects batch_size=10 took=4.784573ms
panic: close of nil channel
goroutine 103 [running]:
github.com/weaviate/weaviate/adapters/repos/db.(*vectorQueue).releaseChunk(0xc002624150, 0xc02c7ea000)
/go/src/github.com/weaviate/weaviate/adapters/repos/db/index_queue.go:732 +0x28
github.com/weaviate/weaviate/adapters/repos/db.asyncWorker(0x0?, {0x1d38810, 0xc0033d5480}, 0x0?)
/go/src/github.com/weaviate/weaviate/adapters/repos/db/repo.go:366 +0x1b4
github.com/weaviate/weaviate/adapters/repos/db.New.func1()
/go/src/github.com/weaviate/weaviate/adapters/repos/db/repo.go:169 +0x6b
created by github.com/weaviate/weaviate/adapters/repos/db.New in goroutine 1
/go/src/github.com/weaviate/weaviate/adapters/repos/db/repo.go:166 +0x76d
My application Logs:
2024-05-22T11:45:31.50544861Z stdout F 2024-05-22 11:45:31.505 | DEBUG | urllib3.connectionpool:_make_request:474 - http://weaviate:80 "POST /v1/batch/objects HTTP/1.1" 200 None
2024-05-22T11:45:31.505275694Z stderr F [2024-05-22 11:45:31,505: DEBUG/MainProcess] http://weaviate:80 "POST /v1/batch/objects HTTP/1.1" 200 None
2024-05-22T11:45:31.287547948Z stdout F
2024-05-22T11:45:31.287545334Z stdout F celery.exceptions.InvalidTaskError: Failed to insert data into weaviate
2024-05-22T11:45:31.287543009Z stdout F
2024-05-22T11:45:31.287540948Z stdout F raise InvalidTaskError("Failed to insert data into weaviate")
2024-05-22T11:45:31.287538623Z stdout F File "/app/backend/embeddings/service.py", line 106, in task_create_embeddings
2024-05-22T11:45:31.287536314Z stdout F
2024-05-22T11:45:31.287534282Z stdout F return self.run(*args, **kwargs)
2024-05-22T11:45:31.287531846Z stdout F File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 760, in __protected_call__
2024-05-22T11:45:31.287529249Z stdout F R = retval = fun(*args, **kwargs)
2024-05-22T11:45:31.287526631Z stdout F > File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 477, in trace_task
2024-05-22T11:45:31.287523642Z stdout F
2024-05-22T11:45:31.28699494Z stderr F [2024-05-22 11:45:31,286: ERROR/MainProcess] Task backend.embeddings.service.task_create_embeddings[20518e81-af2a-4c29-a19d-d9c4625e3c4b] raised unexpected: InvalidTaskError('Failed to insert data into weaviate')
2024-05-22T11:45:31.2852683Z stdout F 2024-05-22 11:45:31.285 | DEBUG | urllib3.connectionpool:_make_request:474 - http://weaviate:80 "DELETE /v1/batch/objects?tenant=DWI57WOAQ HTTP/1.1" 200 270
2024-05-22T11:45:31.285111655Z stderr F [2024-05-22 11:45:31,285: DEBUG/MainProcess] http://weaviate:80 "DELETE /v1/batch/objects?tenant=DWI57WOAQ HTTP/1.1" 200 270
2024-05-22T11:45:31.2833688Z stdout F 2024-05-22 11:45:31.283 | DEBUG | urllib3.connectionpool:_get_conn:291 - Resetting dropped connection: weaviate
2024-05-22T11:45:31.283229575Z stderr F [2024-05-22 11:45:31,283: DEBUG/MainProcess] Resetting dropped connection: weaviate
2024-05-22T11:45:31.282412984Z stdout F
2024-05-22T11:45:31.282407219Z stdout F requests.exceptions.ConnectionError: Batch was not added to weaviate.
2024-05-22T11:45:31.282404783Z stdout F
2024-05-22T11:45:31.282402276Z stdout F └ <class 'requests.exceptions.ConnectionError'>
2024-05-22T11:45:31.282395753Z stdout F raise RequestsConnectionError("Batch was not added to weaviate.") from conn_err
2024-05-22T11:45:31.282386678Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 742, in _create_data
2024-05-22T11:45:31.282378524Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282375411Z stdout F │ └ <function Batch._create_data at 0x7f39feaffc10>
2024-05-22T11:45:31.282372926Z stdout F response = self._create_data(
2024-05-22T11:45:31.282370749Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1099, in _flush_in_thread
2024-05-22T11:45:31.282368107Z stdout F └ None
2024-05-22T11:45:31.282365727Z stdout F │ └ None
2024-05-22T11:45:31.282363534Z stdout F │ │ └ None
2024-05-22T11:45:31.282361294Z stdout F result = self.fn(*self.args, **self.kwargs)
2024-05-22T11:45:31.282356209Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
2024-05-22T11:45:31.282353501Z stdout F └ None
2024-05-22T11:45:31.282351137Z stdout F raise self._exception
2024-05-22T11:45:31.282348709Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
2024-05-22T11:45:31.282346586Z stdout F └ None
2024-05-22T11:45:31.282344277Z stdout F return self.__get_result()
2024-05-22T11:45:31.282340587Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result
2024-05-22T11:45:31.282329781Z stdout F └ <Future at 0x7f3a025c8ee0 state=finished raised ConnectionError>
2024-05-22T11:45:31.282327321Z stdout F │ └ <function Future.result at 0x7f3a02727310>
2024-05-22T11:45:31.28232506Z stdout F response_objects, nr_objects = done_future.result()
2024-05-22T11:45:31.282322448Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1151, in _send_batch_requests
2024-05-22T11:45:31.282319901Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282317637Z stdout F │ └ <function Batch._send_batch_requests at 0x7f39fea860d0>
2024-05-22T11:45:31.282314719Z stdout F self._send_batch_requests(force_wait=False)
2024-05-22T11:45:31.282302087Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1242, in _auto_create
2024-05-22T11:45:31.28229852Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282295618Z stdout F │ └ <function Batch._auto_create at 0x7f39fea86160>
2024-05-22T11:45:31.282292253Z stdout F self._auto_create()
2024-05-22T11:45:31.282287575Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 569, in add_data_object
2024-05-22T11:45:31.282285173Z stdout F
2024-05-22T11:45:31.282282937Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282276776Z stdout F │ └ <function Batch.add_data_object at 0x7f39feaffaf0>
2024-05-22T11:45:31.282274603Z stdout F batch.add_data_object(
2024-05-22T11:45:31.282272183Z stdout F File "/app/backend/utils.py", line 382, in batch_insert_data
2024-05-22T11:45:31.282265159Z stdout F
2024-05-22T11:45:31.282262959Z stdout F └ None
2024-05-22T11:45:31.282255779Z stdout F raise self._exception
2024-05-22T11:45:31.282253444Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
2024-05-22T11:45:31.282251042Z stdout F └ None
2024-05-22T11:45:31.282248506Z stdout F return self.__get_result()
2024-05-22T11:45:31.282241499Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result
2024-05-22T11:45:31.282239296Z stdout F └ <Future at 0x7f3a025c8ee0 state=finished raised ConnectionError>
2024-05-22T11:45:31.282237053Z stdout F │ └ <function Future.result at 0x7f3a02727310>
2024-05-22T11:45:31.282234831Z stdout F response_objects, nr_objects = done_future.result()
2024-05-22T11:45:31.282231203Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1151, in _send_batch_requests
2024-05-22T11:45:31.282227617Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282224022Z stdout F │ └ <function Batch._send_batch_requests at 0x7f39fea860d0>
2024-05-22T11:45:31.282220159Z stdout F self._send_batch_requests(force_wait=True)
2024-05-22T11:45:31.28221752Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1252, in flush
2024-05-22T11:45:31.282215315Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282212882Z stdout F │ └ <function Batch.flush at 0x7f39fea861f0>
2024-05-22T11:45:31.282210617Z stdout F self.flush()
2024-05-22T11:45:31.282208306Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1646, in __exit__
2024-05-22T11:45:31.282205936Z stdout F
2024-05-22T11:45:31.28220378Z stdout F return True
2024-05-22T11:45:31.282201666Z stdout F File "/app/backend/utils.py", line 385, in batch_insert_data
2024-05-22T11:45:31.282199549Z stdout F
2024-05-22T11:45:31.282197427Z stdout F └ None
2024-05-22T11:45:31.282195114Z stdout F raise self._exception
2024-05-22T11:45:31.282192284Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
2024-05-22T11:45:31.282190032Z stdout F └ None
2024-05-22T11:45:31.282187796Z stdout F return self.__get_result()
2024-05-22T11:45:31.282185522Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result
2024-05-22T11:45:31.282183242Z stdout F └ None
2024-05-22T11:45:31.28218058Z stdout F │ └ <Future at 0x7f3a025c8ee0 state=finished raised ConnectionError>
2024-05-22T11:45:31.282178398Z stdout F │ │ └ <function Future.result at 0x7f3a02727310>
2024-05-22T11:45:31.282172037Z stdout F response_objects, nr_objects = done_future.result()
2024-05-22T11:45:31.282167096Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1151, in _send_batch_requests
2024-05-22T11:45:31.282165054Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282162978Z stdout F │ └ <function Batch._send_batch_requests at 0x7f39fea860d0>
2024-05-22T11:45:31.282160702Z stdout F self._send_batch_requests(force_wait=False)
2024-05-22T11:45:31.282158218Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1242, in _auto_create
2024-05-22T11:45:31.282154646Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282151111Z stdout F │ └ <function Batch._auto_create at 0x7f39fea86160>
2024-05-22T11:45:31.282147332Z stdout F self._auto_create()
2024-05-22T11:45:31.282143699Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 569, in add_data_object
2024-05-22T11:45:31.282140465Z stdout F
2024-05-22T11:45:31.282138078Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282135774Z stdout F │ └ <function Batch.add_data_object at 0x7f39feaffaf0>
2024-05-22T11:45:31.282133317Z stdout F batch.add_data_object(
2024-05-22T11:45:31.282130609Z stdout F File "/app/backend/utils.py", line 382, in batch_insert_data
2024-05-22T11:45:31.282128249Z stdout F
2024-05-22T11:45:31.282126079Z stdout F └ None
2024-05-22T11:45:31.282123751Z stdout F raise self._exception
2024-05-22T11:45:31.282121377Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
2024-05-22T11:45:31.282118753Z stdout F └ None
2024-05-22T11:45:31.282116396Z stdout F return self.__get_result()
2024-05-22T11:45:31.282108389Z stdout F File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result
2024-05-22T11:45:31.282099414Z stdout F └ None
2024-05-22T11:45:31.282097089Z stdout F │ └ <Future at 0x7f3a025c8ee0 state=finished raised ConnectionError>
2024-05-22T11:45:31.28209463Z stdout F │ │ └ <function Future.result at 0x7f3a02727310>
2024-05-22T11:45:31.282092296Z stdout F response_objects, nr_objects = done_future.result()
2024-05-22T11:45:31.282089857Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1151, in _send_batch_requests
2024-05-22T11:45:31.282087413Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282085129Z stdout F │ └ <function Batch._send_batch_requests at 0x7f39fea860d0>
2024-05-22T11:45:31.282082873Z stdout F self._send_batch_requests(force_wait=True)
2024-05-22T11:45:31.28208024Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1252, in flush
2024-05-22T11:45:31.282077525Z stdout F └ <weaviate.batch.crud_batch.Batch object at 0x7f3a0214c280>
2024-05-22T11:45:31.282073453Z stdout F │ └ <function Batch.flush at 0x7f39fea861f0>
2024-05-22T11:45:31.282070048Z stdout F self.flush()
2024-05-22T11:45:31.282066665Z stdout F File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 1646, in __exit__
2024-05-22T11:45:31.282063146Z stdout F
2024-05-22T11:45:31.282060696Z stdout F return True