Recommendations for metadata or knowledge graphs?

As we grow with Weaviate, we realize that retaining document structure is important. So for example, when vectorizing FAQ docs, we want to keep track that a given answer matches to a given question. Or if modeling word documents, tracking which portion of the document is Title vs. Paragraph vs. Section Header etc… all of these document structure details are important.

I realize we can build a schema for each structure, ie:

 FAQSchema
 TitleParagraphSectionSchema

But this feels tedious and manual. Instead, I imagine using metadata tags.

GeneralSchema{
  # FAQ Implementation
   metadata: {
      is-question: True
      answer: <id-of-answer-object>
   }

Each document type would impose structure through this metadata.

The problem is that, as far as I understand, weaviate schemas are fixed. So the tags would either have to be known a-priori, or they would be stored inefficently in a list like:

metadata: ['tag1:value1', 'tag2:value2']

It would then be inefficient or impossible to do searches like “where tag = foo”.

Or am I wrong and schemas can/should be dynamic? IE we can add additional fields and its ok as long as they are roughly the same throughout (ie. all the docs in the schema ave the same metadata?)

Do you guys (weaviate team) have any recommendations for best practices on how to approach something like this. My initial thoughts are:

  1. Go with dynamic schemas but try to keep objects with similar metadata isolated (ie. FAQ docs aren’t mixed with other docs)

  2. Add nameless placeholder tags into the shared schema and let callers manage them

    GeneralSchema{
       ...
       tag1: str
       tag2: str
       tag3: str
    

    }

These can be empty, or optionally used like

   # Faq impl
   GeneralSchema { 
       tag1: "some question"
       tag2: "answer-id"
   }

Where it is up to the caller to understand that tag1 is a question and tag2 is an answer ID and so on. By the way, does adding empty fields like these tags add overhead to indexing/search performance, even if they remain empty for most use cases?

(OP here) I’m requesting to close this because I believe weaviate supports whatever we want to do, and its outside fo the scope for weaviate forum to be asking it.

Ultimately AutoSchemas are one way to accomplish this, the other is to find a way for my calling code to be more intelligent about custom fields in schemas.

1 Like

Hi Adam!

Thanks for following on.

Modeling in Weaviate is a topic that we see a lot of questions around. Also it is always evolving :wink:

For example, in our recently released 1.22 version, we introduced Nested Object Storage, so now you can add data just like you have imagined :star_struck:

It’s the first iteration of this feature, but it will for sure make the modeling part a more delightful experience.

Properties of a collection in Weaviate are not fixed. you can add properties later on. You will not be able to have the same property with different object types, as this will break the logic for filters, for example.

It is a good practice to create your classes beforehand to make sure that a class related config (InvertedIndexConfig, vectorizer, etc) is the way you want. You need to consider that you can only change some of the properties after you created a class, without reindexing/reimporting your data.

Auto Schema will try it’s best to create a “good for all” class definition. What you can do is get that auto created class schema and tweak it to your own necessity.

For example, adding empty fields will only add overhead if you deliberately specify it to index null state. On the other hand, it will allow you to filter by null fields.

This was a feature introduced in 1.16

Let me know if that helps!

Thanks!

This is super helpful thank you.

  1. Can’t wait to try 1.22 - I totally missed this announcement :slight_smile:

  2. Thank you as well for the point about null indexing - that’s very useful and also when right over my head.

Properties of a collection in Weaviate are not fixed. you can add properties later on. You will not be able to have the same property with different object types, as this will break the logic for filters, for example.

Is this bad for performance though? If I add a new property, does it have to re-index all the vectors or do some otherwise slow batch operations in the background to stay in sync? We dont’ anticipate doing this often, but maybe once in a while.

Like for example - would these systems be similarly performant:

  1. A fixed schema that never changes
  2. AutoSchema but all the data always has the same fields, so for all intents and purposes, the schema never changes after it’s originally inferred, but we don’t know it a-priori.
  3. A fixed schema, but an additional property here and there.

When you add a new property, by default it will be marked as to skip: false.

As it will impact the generated vector, the object will need to be revectorized.

So it is up to you to decide. If you do not want that property to be in the vectorized content, you need to set skip to true. So any time you change or add it to an object, Weaviate will not revectorize the object.

So Auto Schema will basically do that. If you throw a new content in a property that is not yet defined, Weaviate will define it for you and try to find the best default configuration for it, considering data type, etc.

The ammount of properties should not degrade performance, AFAIK.

Let me know if that helps :slight_smile:

Thanks!

2 Likes