I’m using the spark connector to import nearly 200M records. While I’d like to use bigger batches and make use of asynchronous importing from weaviate version 1.22, the spark connector seems to have issues in handling batch sizes beyond 200. Specifically, when going beyond 200, I often see errors like the following:
reason=ExceptionFailure(io.weaviate.spark.WeaviateResultError,error getting result and no more retries left. Error from Weaviate: [WeaviateErrorMessage(message=java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $, throwable=com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $), WeaviateErrorMessage(message=Failed ids: 42946687-9c7b-5a99-b5a5-60f2216e894d,...
Any help would be appreciated!
Hello, are you importing data into WCS sandbox? or it’s your own Weaviate setup?
Your exception message
Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $
suggests that there was a timeout on the LB part. Looks like you’re overwhelming your WCS instance and that’s why you get this error from LB.
Thank you for the reply. I’m running self-hosted weaviate, with two transformer-inference containers each running on a GPU and 4 nodes each running one container of weaviate as a shard. My load balancer timeout is set to 10 min, and I’m surprised that I’m overloading my instance with what feels like a marginal change (200 vs 250). I will explore resource planning options and see where the error could be.
Please do! I think that you are overwhelming your current Weaviate infra setup and that’s why you are getting those kind of errors when using spark connector. I think that
transformer-inference service might be a bottleneck here.