Debugging an insertion failure

EDIT: After writing this I have tried using exactly the same code but configured the weaviate container to use text2vec-cohere and not text2vec-transformers. I therefore suspect the problem is with the latter.

I am trying to insert objects in a Weaviate collection with an external vectorization module, both run in local containers.

I am confused since there’s a discrepancy between the number of objects I’m trying to insert and the ones that are actually inserted (none), but the collection.batch.failed_objects and failed_references are both empty.

This is one of the weaviate container log lines:
weaviate-1 | {"action":"restapi_request","level":"debug","method":"POST","msg":"received HTTP request","time":"2024-04-08T09:12:30Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/graphql","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}
nothing to be see from the model container either (BTW how to add a timestamp here):
t2v-e5-mistral-1 | INFO: 127.0.0.1:59582 - "GET /docs HTTP/1.1" 200 OK
Here is the complete code:

# pylint: disable=C0301,W0703,W0718,W0719
""" fill the Articles collection schema """
import sys
from datetime import date, timedelta
from dotenv import load_dotenv
from weaviate.client import WeaviateClient
from weaviate.exceptions import WeaviateBatchError, WeaviateBaseError
from manivectors.memalib.mema_utils import get_env_variable
from manivectors.memalib.mema_wv import wv_connect
from manivectors.memalib.mema_graph import get_articles
from manivectors.memalib.logconfig import APPLOG


def main() -> None:
    """
    Connects to a Weaviate instance and manages a collection of articles.

    This script loads environment variables, connects to a Weaviate instance,
    and either creates or updates a collection named `Articles` (or another name
    specified in the environment variables) with a predefined schema. It then
    retrieves and prints the schema of the created or updated collection.
    """
    # prepare the environment
    load_dotenv()
    wv_host: str = get_env_variable("WV_HOST", default_value="weaviate")
    wv_port: int = int(get_env_variable("WV_PORT", default_value=8080))
    wv_grpcport: int = int(get_env_variable("WV_GRPCPORT", default_value=50051))
    # wv_ratelimit: int = int(get_env_variable("WV_RATELIMIT", default_value=5000))
    wv_artcollname: str = get_env_variable("WV_ARTCOLL", default_value="Articles")
    wv_artfrom: str = get_env_variable("WP_ARTFROM", default_value="2024-01-01")
    wv_artto: str = get_env_variable(
        "WP_ARTTO",
        default_value=(date.today() + timedelta(days=1)).strftime("%Y-%m-%d"),
    )  # tomorrow's ISO date string
    wv_artcollname: str = get_env_variable("WV_ARTCOLL", default_value="Articles")

    # prepare the data to be inserted
    data_rows = []
    for article in get_articles(from_date=wv_artfrom, to_date=wv_artto):
        article_text = (
            article.label + " " + article.kicker + " " + article.excerpt
        ).rstrip()
        if article_text.endswith("[…]"):
            article_text = article_text[:-3]
        article_data = {
            "articleextract": article_text.rstrip(),
            "kg_id": article.id,
        }
        data_rows.append(article_data)
    APPLOG.info("We have %d objects we need to insert in Weaviate.", len(data_rows))

    # connect to weaviate or die
    try:
        wv_client = wv_connect(
            wv_host=wv_host, wv_port=wv_port, wv_grpcport=wv_grpcport
        )
        assert isinstance(wv_client, WeaviateClient)
    except AssertionError as error:
        APPLOG.error(error)
        sys.exit(
            "Program ended abnormally because of weaviate client not initialied properly."
        )
    except Exception as e:
        APPLOG.error("Failed to connect to Weaviate: %s", e)
        return

    # batch insert the data
    try:
        with wv_client.batch.dynamic() as batch:
            for data_row in data_rows:
                batch.add_object(collection=wv_artcollname, properties=data_row)
    except WeaviateBatchError as error:
        APPLOG.error("Weaviate batch error: %s", error)
    except WeaviateBaseError as error:
        APPLOG.error("Weaviate error: %s", error.message)

    # now check the insertion outcome
    wv_coll = wv_client.collections.get(wv_artcollname)
    response = wv_coll.aggregate.over_all(total_count=True)
    APPLOG.info(
        "We now have %d articles in the %s collection",
        response.total_count,
        wv_artcollname,
    )
    APPLOG.info("Failed Objects %d", len(wv_coll.batch.failed_objects))
    APPLOG.info("Failed References %d", len(wv_coll.batch.failed_references))

    # close the connection with weaviate
    wv_client.close()


if __name__ == "__main__":
    main()

and here is an output of the logger at INFO level:

2024-04-08 12:28:40 - wvschema_fill - main - INFO - We have 26 objects we need to insert in Weaviate.
2024-04-08 12:28:40 - mema_wv - wv_connect - DEBUG - Attempting a connect_to_local to HOST localhost, PORT 8080 and GRPC 50051
2024-04-08 12:28:41 - mema_wv - wv_connect - INFO - Successfully connected to Weaviate at localhost:8080
2024-04-08 12:30:21 - wvschema_fill - main - INFO - We now have 0 articles in the Articles collection
2024-04-08 12:30:21 - wvschema_fill - main - INFO - Failed Objects 0
2024-04-08 12:30:21 - wvschema_fill - main - INFO - Failed References 0

where you can see the objects in the source data are 26 while the inserted objects are zero. The data visually looks valid thoughout the 26 objects. Could there be “strange charachters” that could make the insert fail?
This is the collection creation:

client.collections.create(
                wv_artcollname,
                description="A collection of articles data",
                vectorizer_config=wvcc.Configure.Vectorizer.text2vec_transformers(),
                vector_index_config=wvcc.Configure.VectorIndex.hnsw(
                    distance_metric=wvcc.VectorDistances.COSINE
                ),
                properties=[
                    wvcc.Property(name="articleextract", data_type=wvcc.DataType.TEXT),
                    wvcc.Property(
                        name="kg_id",
                        data_type=wvcc.DataType.TEXT,
                        skip_vectorization=True,
                    ),
                ],
            )

and here is the result of querying the collection.config.get():
2024-04-08 12:30:21 - wvschema_fill - main - DEBUG - _CollectionConfig(name='Articles', description='A collection of articles data', generative_config=None, inverted_index_config=_InvertedIndexConfig(bm25=_BM25Config(b=0.75, k1=1.2), cleanup_interval_seconds=60, index_null_state=False, index_property_length=False, index_timestamps=False, stopwords=_StopwordsConfig(preset=<StopwordsPreset.EN: 'en'>, additions=None, removals=None)), multi_tenancy_config=_MultiTenancyConfig(enabled=False), properties=[_Property(name='articleextract', description=None, data_type=<DataType.TEXT: 'text'>, index_filterable=True, index_searchable=True, nested_properties=None, tokenization=<Tokenization.WORD: 'word'>, vectorizer_config=_PropertyVectorizerConfig(skip=False, vectorize_property_name=True), vectorizer='text2vec-transformers'), _Property(name='kg_id', description=None, data_type=<DataType.TEXT: 'text'>, index_filterable=True, index_searchable=True, nested_properties=None, tokenization=<Tokenization.WORD: 'word'>, vectorizer_config=_PropertyVectorizerConfig(skip=True, vectorize_property_name=True), vectorizer='text2vec-transformers')], references=[], replication_config=_ReplicationConfig(factor=1), reranker_config=None, sharding_config=_ShardingConfig(virtual_per_physical=128, desired_count=1, actual_count=1, desired_virtual_count=128, actual_virtual_count=128, key='_id', strategy='hash', function='murmur3'), vector_index_config=_VectorIndexConfigHNSW(quantizer=None, cleanup_interval_seconds=300, distance_metric=<VectorDistances.COSINE: 'cosine'>, dynamic_ef_min=100, dynamic_ef_max=500, dynamic_ef_factor=8, ef=-1, ef_construction=128, flat_search_cutoff=40000, max_connections=64, skip=False, vector_cache_max_objects=1000000000000), vector_index_type=<VectorIndexType.HNSW: 'hnsw'>, vectorizer_config=_VectorizerConfig(vectorizer=<Vectorizers.TEXT2VEC_TRANSFORMERS: 'text2vec-transformers'>, model={'poolingStrategy': 'masked_mean'}, vectorize_collection_name=True), vectorizer=<Vectorizers.TEXT2VEC_TRANSFORMERS: 'text2vec-transformers'>, vector_config=None)
Any ideas/practices on how to better debug such a situation?
weaviate python client is 4.5.4 and the server is 1.24.6

As stated above this problem DOES NOT appear when using text2vec-cohere. This is the new docker-compose section for the weaviate container:

weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.24.6
    command:
      - "--host=0.0.0.0"
      - "--port=8080"
      - "--scheme=http"
    ports:
      - "8080:8080"
      - "50051:50051"
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: unless-stopped
    environment:
      LOG_LEVEL: debug 
      ENABLE_CUDA: 0
      LIMIT_RESOURCES: true
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: true
      PERSISTENCE_DATA_PATH: /var/lib/weaviate
      CLUSTER_HOSTNAME: finland
      ENABLE_MODULES: text2vec-transformers, text2vec-cohere 
      #DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      DEFAULT_VECTORIZER_MODULE: text2vec-cohere
      COHERE_APIKEY: myverysecretkey
      TRANSFORMERS_INFERENCE_API: http://t2v-e5-mistral:8080

  t2v-e5-mistral:
    build:
      context: /home/mema/llms/e5-mistral-7b-instruct  
      dockerfile: Dockerfile
    image: e5-mistral-7b-instruct 
    environment:
      ENABLE_CUDA: '0'
    ports:
      - "9090:8080"
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8080/docs"]
      interval: 30s
      timeout: 10s
      retries: 2
      start_period: 10s

of course with this docker-compose setup the t2v-e5-mistral container will not be used.

Ciao @rjalex ! How are you today :slight_smile:

This will only happen when using batch insert? When you insert only one object or insert_many, does it work as expected?

This is a weird issue.

Also, I do agree that the text2vec-transformers could have a better logging. For example, once I wanted to inspect the payload Weaviate sends to the vectorizer, and needed to add an ugly print right into the vectors endpoint hehhehe

I would be nice to have a DEBUG env. var, that allows, along with the timestamp.

Thanks!

Hey @DudaNogueira thanks for chiming in. I hope you’re doing great. this morning I will try your suggestion and try inserting without batch and then report the results.

Have a question right away though: why the failed objects and references are zero when no insert has really happened?

Talk soon.

Ok @DudaNogueira I can confirm your hunch was correct. If I insert the objects one at a time, the text2vec-transformers container with my mistral-e5 model does vectorize and the same data is properly inserted along with its vectors.

An observation though. The process is VERY slow. I am inserting objects with just two text properties, the first being a string of around 400 chars and the second a URL of approx 100 chars (the latter only indexed not embedded). On a small test set of 26 objects it takes from 2 to 10 seconds per object !!!

Will have to think about how to proceed …

Thanks

1 Like