WCS DEPLOYMENT of text2vec-transformer

Ricky_D · February 7, 2024, 4:27pm

Hi,
I am very new to weaviate, I have created a schema in my docker and imported the data from a CSV. I did the vectorizing using SBERT paraphrase-multilingual-mpnet-base-v2 model as it is multilingual application I am Building.

When I wanted to deploy this in WCS, I found that text2vec-transformer is not available as a module. So I used text2vec-huggingface sentence-transformers/paraphrase-multilingual-mpnet-base-v2 model. But unfortunately I cannot import the data.

Is there any other way or any other model that I should use?

Please help.

Thanks

DudaNogueira · February 7, 2024, 6:39pm

hi @Ricky_D !

Welcome to our Community

As you mentioned, we do not provide all available modules in WCS.

So using text2vec-huggingface is a way to use that same model.

Why you were not able to import data? Was there any errors or logs?

Thanks!

Ricky_D · February 8, 2024, 9:09am

Hi @DudaNogueira
I have tried this code:

import weaviate
from weaviate import Config
import weaviate.classes as wvc
import weaviate.exceptions
import os
import pandas as pd
import numpy as np
import json
import datetime
from datetime import datetime

Starting up the weaviate client

auth_config = weaviate.auth.AuthApiKey(api_key=“API”)
client = weaviate.Client(
url=“https://someweaviate.network”,
auth_client_secret=auth_config,
additional_headers={
“X-HuggingFace-Api-Key”: “API_hf”
}
)
client.is_ready()

Deleting any previously existing “MachineFailures” class

print(“delete previous”)
client.schema.delete_class(“MachineFailures”)

Creating a new class with the defined schema

#Here “vectorizer”: “text2vec-transformers” it is using: the default transformer model used was bert-base-uncased.
#If you need to use a specific transformer model, such as paraphrase-multilingual-mpnet-base-v2,
#and Weaviate’s default model does not meet your requirements, you may need to vectorize your text data outside of Weaviate using the Sentence Transformers library,
#as I described in the previous answer, and then store the resulting vectors in Weaviate manually.

Created all the properties with ‘text’ so it enables with semantic and keyword search(Hybrid search)

client.schema.create_class(
{
“class”: “MachineFailures”,
“description”: “A class to store machine failure records”,
“vectorIndexConfig”: {
“distance” : “cosine”
},
“vectorIndexType”: “hnsw”,
“vectorizer”: “text2vec-huggingface”,
“moduleConfig”: {
“text2vec-huggingface”: {
“model”: “sentence-transformers/paraphrase-multilingual-mpnet-base-v2”,
}
},
“properties”: [
{
“name”: “description”,
“dataType”: [“text”],
},
{
“name”: “wo_number”,
“dataType”: [“number”],
},
{
“name”: “heading”,
“dataType”: [“text”],
},
{
“name”: “fail_source”,
“dataType”: [“text”],
},
{
“name”: “fail_cause”,
“dataType”: [“text”],
},
{
“name”: “wo_closed_on”,
“dataType”: [“date”],
},
{
“name”: “repairman”,
“dataType”: [“text”],
},
{
“name”: “working_hours”,
“dataType”: [“number”],
},
{
“name”: “machine_no”,
“dataType”: [“text”],
},
{
“name”: “workdone_comments”,
“dataType”: [“text”],
},
    ],

}
)

Checking is the collection is created successfully or not

print(“create new”)
print(client.collection.exists(“MaichineFailures”))

Importing the data using pandas

data = pd.read_csv(‘./data/MachineFailures.csv’, index_col=0)

Getting the collection “DiseaseSearch” that was created earlier

Machine_Failures_data = client.collection.get(“MachineFailures”)

Function to format dates in Weaviate ISO8601 format

def format_date_weaviate(date):
# Convert the date to a pandas datetime object
date_obj = pd.to_datetime(date, errors=‘coerce’) # ‘coerce’ will convert invalid dates to NaT
# Check if the date is NaT (Not-a-Time)
if pd.isna(date_obj):
return None # or return a default date string in the correct format, if applicable
# Format the date to Weaviate’s expected ISO8601 format
return date_obj.strftime(“%Y-%m-%dT%H:%M:%S+00:00”)

Iterating through the dataset and storing it all in an array to be inserted later

objects_to_add = [
{
“description”: row[“description”],
“wo_number”: row[“wo_number”],
“heading”: row[“heading”],
“fail_source”: row[“fail_source”],
“fail_cause”: row[“fail_cause”],
“wo_closed_on”:format_date_weaviate(row[“wo_closed_on”]),
“repairman”: row[“repairman”],
“working_hours”: row[“working_hours”], # Convert to string
“machine_no”: row[“machine_no”], # Convert to string if needed
“workdone_comments”: row[“workdone_comments”],
}
for index, row in data.iterrows()
]

Define a function to replace non-compliant float values

def replace_non_compliant_values(value):
if isinstance(value, float) and (np.isnan(value) or np.isinf(value)):
return None # Replace with None or an appropriate placeholder
return value

Inserting the data into the class

for obj in objects_to_add:
# Replace non-compliant float values in the object
sanitized_obj = {k: replace_non_compliant_values(v) for k, v in obj.items()}
client.data_object.create(sanitized_obj, “MachineFailures”)

Fetching any 5 objects from the class and printing the response

query_string = “”"
{
Get {
MachineFailures(limit: 5) {
description
wo_number
machine_no
heading
repairman
}
}
}
“”"
response = client.query.raw(query_string)
print(“Output”)
print(response)

But i received this error:
weaviate.exceptions.UnexpectedStatusCodeException: Creating object! Unexpected status code: 500, with response body: {'error': [{'message': 'update vector: failed with status: 503 error: Model sentence-transformers/paraphrase-multilingual-mpnet-base-v2 is currently loading estimated time: 44.490128'}]}.

But in my docker using text2vec-transformers module i had success to import that data.
WHat can be the problem. Thanks in advance.
//Ricky

DudaNogueira · February 8, 2024, 1:32pm

Hi!

This is most likely an issue with the inference model, returning you 500 http code. My bet is that the inference model is not yet ready to vectorization, or failed to start.

Can you try vectorizing something directly in that vectorizer to make sure it is working properly?

Ricky_D · February 8, 2024, 2:05pm

Hi,
My local docker image is:

version: ‘3.4’
image: semitechnologies/weaviate:1.23.2
DEFAULT_VECTORIZER_MODULE: ‘text2vec-transformers’
ENABLE_MODULES: ‘text2vec-transformers’
CLUSTER_HOSTNAME: ‘node1’

t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-paraphrase-multilingual-mpnet-base-v2

and the code of import is: which vectorizing with transformers-inference:sentence-transformers-paraphrase-multilingual-mpnet-base-v2 model.

client.schema.create_class(
{
“class”: “MachineFailures”,
“description”: “A class to store machine failure records”,
“vectorIndexConfig”: {
“distance” : “cosine”
},
“vectorIndexType”: “hnsw”,
“vectorizer”: “text2vec-transformers”,
“properties”: [

I don’t know if you exactly meant that.
Thanks

Ricky_D · February 8, 2024, 3:14pm

Hi
With Cohere, I managed to import and run my app.
Thanks for the help.

Topic		Replies	Views
Looking for a way to vectorize a data object using WCS internal vectorizer module General	1	431	July 7, 2023
Recommendations for free ML models of Weaviate text2vec-transformers for Semantic Search purposes? Support	5	816	November 10, 2023
Error when using text2vec-huggingface mpnet v2 Support	3	683	June 13, 2023
Using sentence_transformers together with Weaviate Support bug , python	5	652	July 24, 2024
VoyageAI text embedding in Weaviate Cloud - Not working Support integration , wcs , python	3	244	May 30, 2024

WCS DEPLOYMENT of text2vec-transformer

Starting up the weaviate client

Deleting any previously existing “MachineFailures” class

Creating a new class with the defined schema

Created all the properties with ‘text’ so it enables with semantic and keyword search(Hybrid search)

Checking is the collection is created successfully or not

Importing the data using pandas

Getting the collection “DiseaseSearch” that was created earlier

Function to format dates in Weaviate ISO8601 format

Iterating through the dataset and storing it all in an array to be inserted later

Define a function to replace non-compliant float values

Inserting the data into the class

Fetching any 5 objects from the class and printing the response

Related topics