I don't understand the weaviate schema structure

I am not a data engineer, so I am not familiar with the schema structure, as I don’t have prior experience in big query, pyspark, etc. (I only know SQL as a data scientist.) So, I don’t really know what schema structure do I write for my multi-PDF querying app.

For example, here are 3 schemas I used for my app:

class_obj = {
	    "class": "doc-query",
	    "vectorizer": "text2vec-openai",
	}
class_obj = {
    "class": "doc-query",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": True
        }
    } 
}
class_obj = {
    "class": "doc-query",
    "description": "Documents for chatbot",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {"model": "ada", "type": "text"},
    },
    "properties": [
        {
            "dataType": ["text"],
            "description": "The content of the paragraph",
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": False,
                    "vectorizePropertyName": False,
                }
            },
            "name": "content",
        },
    ],
}

And in my experiments with all three, I see pretty much the same quality of retrieval responses - all 3 give similar level of correctness in their responses, and all 3 return pretty much the same sources too (when using RetrievalQAWithSourcesChain for example). So, what’s the difference between these different schema structures then? Or does it not matter what schema we use, and its just a placeholder of sorts?

P.S. There’s not much explanation about schemas in the Weaviate documentation too, only a handful of examples have been provided without going into the nitty gritties and the subtleties of how, when and where different schema structures can and should be used. To that end, once I understand the schema structure, I want to create a comprehensive comparative analysis of different schema structures - how to know which properties should be present in the schema based on the problem at hand, compare different schemas in the same problem to quantify the differences in results, etc. I’d love if anyone is up to collab.

1 Like

Hi @Kristada673 !

I think I might be able to clarify this for you.

When you create the class with the definition you stated as 1, Weaviate will figure out all the default configuration.

For example, vectorizer, will come from DEFAULT_VECTORIZER_MODULE with all it’s default values.

So your 3 classes are basically the same.

And because you probably have the auto schema feature turned on, whenever you throw data into new properties that don’t exist, Weaviate will figure that out too.

Let me know if this helps :slight_smile:

Ok, understood that these 3 schemas are pretty much the same. But what would a different schema look like? How do I know if its the best/correct schema for the multi-pdf query application? How about for a different application, say an app to upload a single PDF and query on that - how would the schema be different for that? You know what I mean? I don’t know which properties to include in a schema for a given task, and on what basis to select those properties, and what values to set them? Like I said, there’s no exhaustive explanation about the schema structure in Weaviate’s documentation. Maybe the assumption is that Weaviate is best for people coming from data engineering backgrounds or ML Ops experts who are already familiar with the schema structure?

The key would be changing the above.

Meaning the vectorizer strongly affects search. However, skipping be turned on or off in properties will also affect search. So setting “skip: True” in your 3rd example should give you very different results.

For a multiple pdf/document, you can do the same langchain does, for example.

You can have a propety called source that will identify from what file that chunk came from and a text property that store the data. So when you search, you can limit the objects to the specific sources you want. For example, this is the properties search LangChain will create when you use it:

'properties': [{'dataType': ['text'],
   'indexFilterable': True,
   'indexSearchable': True,
   'moduleConfig': {'text2vec-openai': {'skip': False,
     'vectorizePropertyName': False}},
   'name': 'text',
   'tokenization': 'word'},
  {'dataType': ['text'],
   'description': "This property was generated by Weaviate's auto-schema feature on Wed Nov  1 17:34:53 2023",
   'indexFilterable': True,
   'indexSearchable': True,
   'moduleConfig': {'text2vec-openai': {'skip': False,
     'vectorizePropertyName': False}},
   'name': 'source',
   'tokenization': 'word'},
  {'dataType': ['number'],
   'description': "This property was generated by Weaviate's auto-schema feature on Wed Nov  1 17:34:53 2023",
   'indexFilterable': True,
   'indexSearchable': False,
   'moduleConfig': {'text2vec-openai': {'skip': False,
     'vectorizePropertyName': False}},
   'name': 'page'}]

I have just crafted a PR to LangChain to allow schema creation beforehand

This will allow us to create the class beforehand, specifying vectorizer and modules, so we can import data with LangChain and use Weaviate without it too.

You can add any properties to your Class. Note that, if you set it to skip, that property will not be part of the vector generated for that object and will not have it’s meaning in the vector search

By the way, we have a free, weekly intro workshop:

We would love to have you there.

Let me know if this helps :slight_smile:

Hi @Kristada673,

Welcome to the community :slight_smile:

Yes, you are right on both accounts:

Choosing Vectorizer

Changing text2cvec-openai to a different vectorizer (like text2vec-cohere) or changing the model in the vectorizer – will affect your results, as some models or vectorizers are better than other depending on the task. (How are they different, that is a whole lecture :sweat_smile:)

Skip true/false

Setting "skip: True" affects what properties do or don’t get used to generate vector embeddings. So, this is a good way to tell Weaviate, please ignore these fields when it comes to vectors, but I still want to store the corresponding data.

Tbh. I only ever define "properties": [...] when I have properties that I don’t want to vectorize. A good example is a collection of Books, with:

  • titlevectorize, as it is useful to find books based on the title
  • descriptionvectorize, as it can help us find books that match a topic
  • publisher or authorskip, as vectorizers won’t necessarily generate useful vector embeddings based on these properties.

Target audience

As a side note, we are continuously evolving our APIs and how you interact with Weaviate, as we want Weaviate to be accessible to everyone (not just data scientists). In fact, a big part of our audience is developers who don’t have a background in data science or ML.

We may not be there just yet, but there are more and more great improvements arriving with every release, that increase the developer experience, which is my personal mission at the company :wink:
Thank you for sharing your feedback, as it will help us improve the future installments of Weaviate :pray:

1 Like

I think, based on my experience of using Weavite so far, that in fact this is the community Weaviate is more suitable for, as they are already familiar with kubernetes, graph query, etc. A data scientist cannot get going with Weaviate within a day or two, for sure, at least the way it is now.

Hey @Kristada673,
I guess it depends on what tools you are used to work with.
And if a big part of the stack that Weaviate uses is new to you, then I can see why this might feel more challenging.

Tbh. you shouldn’t need to know graphql to use Weaviate.
I don’t know graphql and I interact with Weaviate through either the Python or JavaScript client, and that does the trick for me.

We strive to make Weaviate, as easy to use as possible. Hopefully, over time we will make it super straightforward for everyone. :wink:

1 Like