EDIT: After writing this I have tried using exactly the same code but configured the weaviate container to use text2vec-cohere and not text2vec-transformers. I therefore suspect the problem is with the latter.
I am trying to insert objects in a Weaviate collection with an external vectorization module, both run in local containers.
I am confused since there’s a discrepancy between the number of objects I’m trying to insert and the ones that are actually inserted (none), but the collection.batch.failed_objects and failed_references are both empty.
This is one of the weaviate container log lines:
weaviate-1 | {"action":"restapi_request","level":"debug","method":"POST","msg":"received HTTP request","time":"2024-04-08T09:12:30Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/graphql","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}
nothing to be see from the model container either (BTW how to add a timestamp here):
t2v-e5-mistral-1 | INFO: 127.0.0.1:59582 - "GET /docs HTTP/1.1" 200 OK
Here is the complete code:
# pylint: disable=C0301,W0703,W0718,W0719
""" fill the Articles collection schema """
import sys
from datetime import date, timedelta
from dotenv import load_dotenv
from weaviate.client import WeaviateClient
from weaviate.exceptions import WeaviateBatchError, WeaviateBaseError
from manivectors.memalib.mema_utils import get_env_variable
from manivectors.memalib.mema_wv import wv_connect
from manivectors.memalib.mema_graph import get_articles
from manivectors.memalib.logconfig import APPLOG
def main() -> None:
"""
Connects to a Weaviate instance and manages a collection of articles.
This script loads environment variables, connects to a Weaviate instance,
and either creates or updates a collection named `Articles` (or another name
specified in the environment variables) with a predefined schema. It then
retrieves and prints the schema of the created or updated collection.
"""
# prepare the environment
load_dotenv()
wv_host: str = get_env_variable("WV_HOST", default_value="weaviate")
wv_port: int = int(get_env_variable("WV_PORT", default_value=8080))
wv_grpcport: int = int(get_env_variable("WV_GRPCPORT", default_value=50051))
# wv_ratelimit: int = int(get_env_variable("WV_RATELIMIT", default_value=5000))
wv_artcollname: str = get_env_variable("WV_ARTCOLL", default_value="Articles")
wv_artfrom: str = get_env_variable("WP_ARTFROM", default_value="2024-01-01")
wv_artto: str = get_env_variable(
"WP_ARTTO",
default_value=(date.today() + timedelta(days=1)).strftime("%Y-%m-%d"),
) # tomorrow's ISO date string
wv_artcollname: str = get_env_variable("WV_ARTCOLL", default_value="Articles")
# prepare the data to be inserted
data_rows = []
for article in get_articles(from_date=wv_artfrom, to_date=wv_artto):
article_text = (
article.label + " " + article.kicker + " " + article.excerpt
).rstrip()
if article_text.endswith("[…]"):
article_text = article_text[:-3]
article_data = {
"articleextract": article_text.rstrip(),
"kg_id": article.id,
}
data_rows.append(article_data)
APPLOG.info("We have %d objects we need to insert in Weaviate.", len(data_rows))
# connect to weaviate or die
try:
wv_client = wv_connect(
wv_host=wv_host, wv_port=wv_port, wv_grpcport=wv_grpcport
)
assert isinstance(wv_client, WeaviateClient)
except AssertionError as error:
APPLOG.error(error)
sys.exit(
"Program ended abnormally because of weaviate client not initialied properly."
)
except Exception as e:
APPLOG.error("Failed to connect to Weaviate: %s", e)
return
# batch insert the data
try:
with wv_client.batch.dynamic() as batch:
for data_row in data_rows:
batch.add_object(collection=wv_artcollname, properties=data_row)
except WeaviateBatchError as error:
APPLOG.error("Weaviate batch error: %s", error)
except WeaviateBaseError as error:
APPLOG.error("Weaviate error: %s", error.message)
# now check the insertion outcome
wv_coll = wv_client.collections.get(wv_artcollname)
response = wv_coll.aggregate.over_all(total_count=True)
APPLOG.info(
"We now have %d articles in the %s collection",
response.total_count,
wv_artcollname,
)
APPLOG.info("Failed Objects %d", len(wv_coll.batch.failed_objects))
APPLOG.info("Failed References %d", len(wv_coll.batch.failed_references))
# close the connection with weaviate
wv_client.close()
if __name__ == "__main__":
main()
and here is an output of the logger at INFO level:
2024-04-08 12:28:40 - wvschema_fill - main - INFO - We have 26 objects we need to insert in Weaviate.
2024-04-08 12:28:40 - mema_wv - wv_connect - DEBUG - Attempting a connect_to_local to HOST localhost, PORT 8080 and GRPC 50051
2024-04-08 12:28:41 - mema_wv - wv_connect - INFO - Successfully connected to Weaviate at localhost:8080
2024-04-08 12:30:21 - wvschema_fill - main - INFO - We now have 0 articles in the Articles collection
2024-04-08 12:30:21 - wvschema_fill - main - INFO - Failed Objects 0
2024-04-08 12:30:21 - wvschema_fill - main - INFO - Failed References 0
where you can see the objects in the source data are 26 while the inserted objects are zero. The data visually looks valid thoughout the 26 objects. Could there be “strange charachters” that could make the insert fail?
This is the collection creation:
client.collections.create(
wv_artcollname,
description="A collection of articles data",
vectorizer_config=wvcc.Configure.Vectorizer.text2vec_transformers(),
vector_index_config=wvcc.Configure.VectorIndex.hnsw(
distance_metric=wvcc.VectorDistances.COSINE
),
properties=[
wvcc.Property(name="articleextract", data_type=wvcc.DataType.TEXT),
wvcc.Property(
name="kg_id",
data_type=wvcc.DataType.TEXT,
skip_vectorization=True,
),
],
)
and here is the result of querying the collection.config.get():
2024-04-08 12:30:21 - wvschema_fill - main - DEBUG - _CollectionConfig(name='Articles', description='A collection of articles data', generative_config=None, inverted_index_config=_InvertedIndexConfig(bm25=_BM25Config(b=0.75, k1=1.2), cleanup_interval_seconds=60, index_null_state=False, index_property_length=False, index_timestamps=False, stopwords=_StopwordsConfig(preset=<StopwordsPreset.EN: 'en'>, additions=None, removals=None)), multi_tenancy_config=_MultiTenancyConfig(enabled=False), properties=[_Property(name='articleextract', description=None, data_type=<DataType.TEXT: 'text'>, index_filterable=True, index_searchable=True, nested_properties=None, tokenization=<Tokenization.WORD: 'word'>, vectorizer_config=_PropertyVectorizerConfig(skip=False, vectorize_property_name=True), vectorizer='text2vec-transformers'), _Property(name='kg_id', description=None, data_type=<DataType.TEXT: 'text'>, index_filterable=True, index_searchable=True, nested_properties=None, tokenization=<Tokenization.WORD: 'word'>, vectorizer_config=_PropertyVectorizerConfig(skip=True, vectorize_property_name=True), vectorizer='text2vec-transformers')], references=[], replication_config=_ReplicationConfig(factor=1), reranker_config=None, sharding_config=_ShardingConfig(virtual_per_physical=128, desired_count=1, actual_count=1, desired_virtual_count=128, actual_virtual_count=128, key='_id', strategy='hash', function='murmur3'), vector_index_config=_VectorIndexConfigHNSW(quantizer=None, cleanup_interval_seconds=300, distance_metric=<VectorDistances.COSINE: 'cosine'>, dynamic_ef_min=100, dynamic_ef_max=500, dynamic_ef_factor=8, ef=-1, ef_construction=128, flat_search_cutoff=40000, max_connections=64, skip=False, vector_cache_max_objects=1000000000000), vector_index_type=<VectorIndexType.HNSW: 'hnsw'>, vectorizer_config=_VectorizerConfig(vectorizer=<Vectorizers.TEXT2VEC_TRANSFORMERS: 'text2vec-transformers'>, model={'poolingStrategy': 'masked_mean'}, vectorize_collection_name=True), vectorizer=<Vectorizers.TEXT2VEC_TRANSFORMERS: 'text2vec-transformers'>, vector_config=None)
Any ideas/practices on how to better debug such a situation?
weaviate python client is 4.5.4 and the server is 1.24.6
As stated above this problem DOES NOT appear when using text2vec-cohere. This is the new docker-compose section for the weaviate container:
weaviate:
image: cr.weaviate.io/semitechnologies/weaviate:1.24.6
command:
- "--host=0.0.0.0"
- "--port=8080"
- "--scheme=http"
ports:
- "8080:8080"
- "50051:50051"
volumes:
- weaviate_data:/var/lib/weaviate
restart: unless-stopped
environment:
LOG_LEVEL: debug
ENABLE_CUDA: 0
LIMIT_RESOURCES: true
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: true
PERSISTENCE_DATA_PATH: /var/lib/weaviate
CLUSTER_HOSTNAME: finland
ENABLE_MODULES: text2vec-transformers, text2vec-cohere
#DEFAULT_VECTORIZER_MODULE: text2vec-transformers
DEFAULT_VECTORIZER_MODULE: text2vec-cohere
COHERE_APIKEY: myverysecretkey
TRANSFORMERS_INFERENCE_API: http://t2v-e5-mistral:8080
t2v-e5-mistral:
build:
context: /home/mema/llms/e5-mistral-7b-instruct
dockerfile: Dockerfile
image: e5-mistral-7b-instruct
environment:
ENABLE_CUDA: '0'
ports:
- "9090:8080"
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080/docs"]
interval: 30s
timeout: 10s
retries: 2
start_period: 10s
of course with this docker-compose setup the t2v-e5-mistral container will not be used.