[Question] PLEASE HELP ME DEBUG.. HAEVE BEEN LOOKING FOR SOLUTION FOR HOURS

i AM USING my own openai embedding model and trying to create a collection in which i am trying to create Collection like this but it is not working:


import weaviate
import weaviate.classes.config as wc
import os
from weaviate.auth import AuthApiKey
from langchain_weaviate.vectorstores import WeaviateVectorStore

weaviate_key = my api

# Connect to a WCS instance
client = weaviate.connect_to_wcs(
    cluster_url="my url  
    auth_credentials=AuthApiKey(weaviate_key), 
)

try:
    collection_rg = client.collections.create(
        [
            {
                "class": "Document",
                "description": "A collection for storing document entities",
                "vectorIndexType": "hnsw",
                "vectorizer": "text2vec-contextionary",
                "properties": [
                    {
                        "name": "title",
                        "description": "The title of the document",
                        "dataType": ["string"],
                        "indexFilterable": True,  # Changed 'true' to 'True'
                        "indexSearchable": True   # Changed 'true' to 'True'
                    },
                    {
                        "name": "description",
                        "description": "The description of the document",
                        "dataType": ["string"],
                        "indexFilterable": True,  # Changed 'true' to 'True'
                        "indexSearchable": True   # Changed 'true' to 'True'
                    }
                ],
                "invertedIndexConfig": {
                    "indexTimestamps": False,  # Changed 'false' to 'False'
                    "indexNullState": False,   # Changed 'false' to 'False'
                    "indexPropertyLength": False  # Changed 'false' to 'False'
                }
            },
            {
                "class": "Section",
                "description": "A collection for storing different sections of a document",
                "vectorIndexType": "hnsw",
                "vectorizer": "text2vec-contextionary",
                "properties": [
                    {
                        "name": "content",
                        "description": "The main content of the section",
                        "dataType": ["text"],
                        "indexFilterable": True,  # Changed 'true' to 'True'
                        "indexSearchable": True   # Changed 'true' to 'True'
                    },
                    {
                        "name": "contentVector",
                        "description": "Vector representation of the section content",
                        "dataType": ["vector(float[])"]
                    }
                ],
                "invertedIndexConfig": {
                    "indexTimestamps": False,  # Changed 'false' to 'False'
                    "indexNullState": False,   # Changed 'false' to 'False'
                    "indexPropertyLength": False  # Changed 'false' to 'False'
                }
            }
        ]
    )
finally:
    client.close()  # Ensure the connection is closed

Hi @Rohit !

Welcome to our community :hugs:

For creating a collection based on a dict definition, you should use, on your case:

client.collections.create_from_dict({"class": "Document", .....})

Let me know if this helps :slight_smile:

By the way, I see you are using langchain.

I have just updated this recipe to use the new python v4 client and with the new integration:

@DudaNogueira - Thanks buddy this helps. I initially was experimenting with langchain but their library is not helping on my use case. By the way i am creating a use case in weaviate and would highly appreciate brother if you can help me out.
My use case:

  1. Defining Class Schemas

Class: ArticleLevel

Properties:
Title:** Stores the title of the article.
Summary:** Stores a brief summary of the article.
ArticleContentLink:** Links to an instance of the ArticleContent class for detailed content.

Class: ArticleContent

Properties:
ArticleSplit:** This contains the text of the article split according to your predefined criteria (e.g., by section or paragraph).
Embedding:** Stores the embedding of the article’s content, computed externally using OpenAI’s EmbeddingLargeModel since i cant see its support internally in WEaviate
ArticleLevelLink:** Links back to the corresponding ArticleLevel instance for high-level details.

  1. Implementing Hybrid Search

Step A: Keyword Search

Operation:** Perform a keyword-based search via Weaviate hybrid query on the ArticleLevel instances, targeting either the Title or Summary

Step B: Vector Embedding Search

Operation:** Once the relevant ArticleLevel instance is identified from the keyword search, retrieve the linked ArticleContent instance. Use the embedding stored in the Embedding property to perform a similarity search or a nearest neighbor search to find relevant content or similar articles.

  1. Integration with GPT-4 TurboModel

Operation:** After retrieving the necessary article content through your hybrid search, will pass the content to the GPT-4 Turbo Model for further analysis or generation tasks.

When you say using your own openai embedding, you mean you have your own embedding models?

If you are using the OpenAi to generate your embeddings, that example can help you achieve what you are looking for.

The model i will be using is text-embedding-3-large from openai.

Secondly how will the complex searching happen through the example you gave on my use case. I recently shifted to Weaviate from Pinecone and finding it difficult to get hold of different functionalities available. I worked last night on another code:


#!pip install weaviate-client
import weaviate
import json
import os
from weaviate.auth import AuthApiKey
from weaviate.classes.query import QueryReference
from weaviate.classes.config import Configure, Property, DataType, ReferencePropert

weaviate_key = "xx"
openai_api_key="xx"

# Connect to a WCS instance
viking = weaviate.connect_to_wcs(skip_init_checks=True,
    cluster_url="xx",  # Your Weaviate URL
    auth_credentials=AuthApiKey(weaviate_key),
    headers={
        "X-OpenAI-Api-Key": "xx"  # Replace with your inference API key
    }
)

viking.collections.delete("ArticleLevel")

viking.collections.delete("ArticleContent")

viking.collections.create(
    "ArticleContent",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(model="text-embedding-3-large",dimensions=3072),
    properties=[
        Property(name="content", data_type=DataType.TEXT)
    ]
)

viking.collections.create(
        "ArticleLevel",
        vectorizer_config=Configure.Vectorizer.text2vec_openai(model="text-embedding-3-large",dimensions=3072),
        properties=[
            Property(name="title", data_type=DataType.TEXT),
            Property(name="summary", data_type=DataType.TEXT)],
        references=[
            ReferenceProperty(name="hasContent",
            target_collection="ArticleContent")
        ]
    )



from weaviate.classes.config import ReferenceProperty

qwerty = viking.collections.get("ArticleLevel")

qwerty.data.insert({
    "title": "JK",
    "summary": "This is jk summary page"
})



qwerty1 = viking.collections.get("ArticleContent")

qwerty1.data.insert({
    "content": "Roses are red, violet is blue",
    "content" : "Dogs are good"
}
)

# Assuming `viking` is your Weaviate client instance
article_level = viking.collections.get("ArticleLevel")

# Add a reference to the ArticleContent
article_level.data.reference_add(
    from_uuid='204fe6c1-6d03-4347-896d-8f0fbb62638a',  # UUID of the ArticleLevel object
    from_property="hasContent",  # The name of the reference property
    to='31cffce4-e7b3-49c8-9b4d-999b71f0c542'  # UUID of the ArticleContent object
)

print(article_level)


from weaviate.classes.query import MetadataQuery

reviews = viking.collections.get("ArticleContent")
response = reviews.query.near_text(
    query="nice",
    limit=1,
    return_metadata=MetadataQuery(distance=True)
)

print(response)

article_content = response.objects[0].properties['content']  # Access the 'content' of the first result
print(article_content)

reviews1 = viking.collections.get("ArticleLevel")
try:
    response1 = reviews1.query.near_text(
        query="good",
        limit=1
    )
    print(response1)
except Exception as e:
    print(f"An error occurred: {e}")


from weaviate.classes.query import QueryReference

response5 = reviews1.query.fetch_objects(
    return_references=[
        QueryReference(
            link_on="hasContent",
            return_properties=["content"]
        ),
    ],
    limit=2
)

print(response5)


from weaviate.classes.query import Filter, QueryReference

jeopardy = viking.collections.get("ArticleLevel")
response10 = jeopardy.query.fetch_objects(
    filters=Filter.by_ref(link_on="hasContent").by_property("content").like("*dog*"),
    return_references=QueryReference(link_on="hasContent", return_properties=["content"]),
    limit=3
)

print(response10)

… What i want is that when i do keyword based search on ArticleLevel with keyword, it should bring content from all Article level title, Summary and also from Article Content… secondly my intent is to create a hybrid query in a way that for a list of dataset in ArticleLevel be linked to specific dataset in ArticleContent. so first keyword based search and then once found from ArticleLevel, it goes to Content and do vector embedding consine search… i saw this and it seemed intiuitive but dont know how to implement: TWO-STAGE QUERIES

Because cross-references do not affect vectors, you cannot use vector searches to filter objects based on properties of the target object.

However, you could use two separate queries to achieve a similar result. For example, you could perform a vector search to identify JeopardyCategory objects that are similar to a given vector, resulting in a list of JeopardyCategory objects. You could then use the unique title properties of these objects in a second query filter the results as shown above. This will result in JeopardyQuestion objects that are cross-referenced to the JeopardyCategory objects identified in the first query.

Hi!

On the scenario you pointed out, you can search for the categories and get all the questions that are cross referenced to that category.

However, the question is added to the category as a cross reference, and not the other way around.

Check this example:

viking.collections.delete("JeopardyQuestion")
question = viking.collections.create(
    name="JeopardyQuestion",
    properties=[
        Property(name="question", data_type=DataType.TEXT),
        Property(name="answer", data_type=DataType.TEXT),        
    ],
)

viking.collections.delete("JeopardyCategory")
category = viking.collections.create(
    name="JeopardyCategory",
    description="A Jeopardy! question",
    properties=[
        Property(name="name", data_type=DataType.TEXT)
    ],
    references=[
        ReferenceProperty(
            name="hasQuestion",
            target_collection="JeopardyQuestion"
        )
    ]
)

q1 = question.data.insert({"question": "q1", "answer": "a1"}, uuid=generate_uuid5("q1"))
q2 = question.data.insert({"question": "q2", "answer": "a2"}, uuid=generate_uuid5("q2"))
# we now add a category, and add two cross references to q1 and q2
c1 = category.data.insert(
    {"name": "c1"}, 
    references={
        "hasQuestion": [
            generate_uuid5("q1"),
            generate_uuid5("q2"),
        ]
    }
)

query = category.query.fetch_objects(
    return_references=QueryReference(link_on="hasQuestion", return_properties=["question", "answer"])
)
print(query.objects[0].properties)
print(query.objects[0].references.get("hasQuestion").objects[0].properties)
print(query.objects[0].references.get("hasQuestion").objects[1].properties)

this will be the output:

{‘name’: ‘c1’}
{‘answer’: ‘a1’, ‘question’: ‘q1’}
{‘answer’: ‘a2’, ‘question’: ‘q2’}

Now, if you want that each referenced vector has an impact in the meaning of that cross referencer - on the above example, each question you add to a category, will contribute to the category vector - you can check if this module is what you are looking for:

Here we also have a more comprehensive blog post about it:

Let me know if this is what you are looking to accomplish.

Thanks!

I tried using your approach but it only lets me do if i explicitly give the strings in the code. Here is my code:

viking.collections.delete(“PdfContent”)
documentcontent = viking.collections.create(
name=“PdfContent”,
properties=[
Property(name=“pdfcontent”, data_type=DataType.TEXT)
],
)

viking.collections.delete(“PdfLevel”)
documenttop = viking.collections.create(
name=“PdfLevel”,
description=“This is Pdf Level Information”,
properties=[
Property(name=“pdftitle”, data_type=DataType.TEXT)
],
references=[
ReferenceProperty(
name=“hasContent”,
target_collection=“PdfContent”
)
]
)

q1 = documentcontent.data.insert({“pdfcontent”: “This content is related in pdfititle Insurance”}, uuid=generate_uuid5(“q1”))

we now add a category, and add two cross references to q1 and q2

c1 = documenttop.data.insert(
{“pdftitle”: “Insurance also”},
references={
“hasContent”: [
generate_uuid5(“q1”)
]
}
)

query = category.query.fetch_objects(
return_references=QueryReference(link_on=“hasContent”, return_properties=[“pdfcontent”])
)
print(query.objects[0].properties)
print(query.objects[0].references.get(“hasContent”).objects[0].properties)

This is where i am getting error and bugs while i use .csv file with Column Section to input all pdfcontent linked to a single pdf title…

#!pip install pandas
import pandas as pd
from uuid import uuid5, NAMESPACE_DNS

def generate_uuid5(name):
return str(uuid5(NAMESPACE_DNS, name))

csv_file_path = ‘mypath’
data = pd.read_csv(csv_file_path)

uuids =
for index, row in data.iterrows():
content_uuid = generate_uuid5(f"pdfcontent{index}")
documentcontent.data.insert({“pdfcontent”: row[‘Section’]}, uuid=content_uuid)
uuids.append(content_uuid)

documenttop.data.update_by_id(
c1, # Assuming this is the ID of your existing PdfLevel entry
updates={
“references”: {
“hasContent”: uuids
}
}
)

Optionally, fetch to verify

query = category.query.fetch_objects(
return_references=QueryReference(link_on=“hasContent”, return_properties=[“pdfcontent”])
)
print(query.objects[0].properties)
for ref in query.objects[0].references.get(“hasContent”).objects:
print(ref.properties)