[Question] Issue with Data Insertion using spark in Weaviate Cluster

Ahmed_Mahmoud · October 22, 2024, 11:18pm

Hello Weaviate Support Team,

I’m encountering an issue when inserting data into a multi-node Weaviate cluster using Spark. I am successfully able to insert data into other tables, but when I attempt to insert data into the Ebright table, I receive the following error:

WeaviateErrorMessage(message=local index "Ebright" not found: deadline exceeded for waiting for update: version got=3 want=108, throwable=null)

The error occurs when writing batches of data using the following Spark job:

spark = SparkSession.getActiveSession()
df = spark.createDataFrame(documents)
df.write.format("io.weaviate.spark.Weaviate") \
    .option("batchSize", 200) \
    .option("scheme", "http") \
    .option("host", weaviate_url) \
    .option("grpc:host", "10.2.0.13:50051") \
    .option("grpc:secured", "false") \
    .option("className", weaviate_table_name) \
    .mode("append").save()

The schema for the Ebright table is as follows:

properties = [
    Property(name="chunk", data_type=DataType.TEXT, vectorize_property_name=False),
    Property(name="chunk_id", data_type=DataType.INT, vectorize_property_name=False),
    Property(name="chunk_unique_id", data_type=DataType.TEXT, vectorize_property_name=False),
    Property(name="abs_url", data_type=DataType.TEXT, vectorize_property_name=False, skip_vectorization=True),
    Property(name="file_name", data_type=DataType.TEXT, vectorize_property_name=False, skip_vectorization=True),
    Property(name="server_relative_url", data_type=DataType.TEXT, vectorize_property_name=False, skip_vectorization=True),
    Property(name="topic", data_type=DataType.TEXT, vectorize_property_name=False, skip_vectorization=True),
    Property(name="time_created", data_type=DataType.DATE, vectorize_property_name=False, skip_vectorization=True),
    Property(name="time_lastmodified", data_type=DataType.DATE, vectorize_property_name=False, skip_vectorization=True),
    Property(name="unique_id", data_type=DataType.TEXT, vectorize_property_name=False, skip_vectorization=True),
]

Cluster Setup:

3-node cluster.
Using Weaviate version 1.26.6.
Spark for batch data insertion.
Weaviate Python client 4.9.0.

I can see the collection exists already, so its strange that the error says index not found:

test_col = client.collections.get("Ebright")
print(test_col)

<weaviate.Collection config={
  "name": "Ebright",
  "description": null,
  "generative_config": null,
  "inverted_index_config": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    },
    "cleanup_interval_seconds": 60,
    "index_null_state": false,
    "index_property_length": false,
    "index_timestamps": false,
    "stopwords": {
      "preset": "en",
      "additions": null,
      "removals": null
    }
  },
  "multi_tenancy_config": {
    "enabled": false,
    "auto_tenant_creation": false,
    "auto_tenant_activation": false
  },
  "properties": [
    {
      "name": "chunk",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": {
        "skip": false,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "chunk_id",
      "description": null,
      "data_type": "int",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": false,
      "nested_properties": null,
      "tokenization": null,
      "vectorizer_config": {
        "skip": false,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "chunk_unique_id",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": {
        "skip": false,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "abs_url",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": {
        "skip": true,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "file_name",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": {
        "skip": true,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "server_relative_url",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": {
        "skip": true,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "topic",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": {
        "skip": true,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "time_created",
      "description": null,
      "data_type": "date",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": false,
      "nested_properties": null,
      "tokenization": null,
      "vectorizer_config": {
        "skip": true,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "time_lastmodified",
      "description": null,
      "data_type": "date",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": false,
      "nested_properties": null,
      "tokenization": null,
      "vectorizer_config": {
        "skip": true,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    },
    {
      "name": "unique_id",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": {
        "skip": true,
        "vectorize_property_name": false
      },
      "vectorizer": "text2vec-transformers"
    }
  ],
  "references": [],
  "replication_config": {
    "factor": 1,
    "async_enabled": false,
    "deletion_strategy": "NoAutomatedResolution"
  },
  "reranker_config": null,
  "sharding_config": {
    "virtual_per_physical": 128,
    "desired_count": 3,
    "actual_count": 3,
    "desired_virtual_count": 384,
    "actual_virtual_count": 384,
    "key": "_id",
    "strategy": "hash",
    "function": "murmur3"
  },
  "vector_index_config": {
    "quantizer": null,
    "cleanup_interval_seconds": 300,
    "distance_metric": "cosine",
    "dynamic_ef_min": 100,
    "dynamic_ef_max": 500,
    "dynamic_ef_factor": 8,
    "ef": -1,
    "ef_construction": 128,
    "filter_strategy": "sweeping",
    "flat_search_cutoff": 40000,
    "max_connections": 32,
    "skip": false,
    "vector_cache_max_objects": 1000000000000
  },
  "vector_index_type": "hnsw",
  "vectorizer_config": {
    "vectorizer": "text2vec-transformers",
    "model": {
      "poolingStrategy": "masked_mean"
    },
    "vectorize_collection_name": false
  },
  "vectorizer": "text2vec-transformers",
  "vector_config": null
}>

Could you please help me diagnose this issue or provide suggestions on how to resolve the index update problem? Is there any specific configuration or tuning required for multi-node cluster setups in this context?

Thank you in advance!

DudaNogueira · November 1, 2024, 2:44pm

hi @Ahmed_Mahmoud !!

Sorry for the delay here.

Were you able to solve this?

Can you correctly insert directly into Weaviate with one of our clients?

That would be interesting to try first.

Also, can you check if all 3 nodes are healthy at the nodes api?

Topic		Replies	Views
Trouble Batch Inserting Records Using Langchain w/ Weaviate Support	5	628	July 24, 2024
Weaviate batch writes Erroring when i try to write 500000 data from spark connector Support	2	1067	April 9, 2024
Getting exception while storing in weaviate General	1	396	December 18, 2024
Weaviate connection error 8/10 times but works the other two time in Kubernetes deployment Support	8	1237	June 10, 2024
Weaviate Batch writes throwing errors Support python , technical	2	440	July 4, 2025

[Question] Issue with Data Insertion using spark in Weaviate Cluster

Related topics