Schema and vectorization questions

I have a few questions related to vectorization and schema creation.

  1. Does it make sense to have arbitrary number of vectors per object and does Weaviate support that? (NOT talking about named vectors because that is still fixed number of vectors per object). Does the following make sense (if no, why?):
    Consider 4 objects, each object has, say, 5, 1, 10, 8 entries of a certain section A. And 2, 3, 7, 8 entries of another section B respectively, related to it.
    Does it make sense, for object 1 to have 7 vectors, object 2 to be 4 and so on… (Each vector only having a single entry of a single section)? In that case everyone would have different number of vectors.

Is that supported by Weaviate, and more importantly, does it make sense, or would we lose too much context breaking it up into so many small sections?

  1. Is there a way to vectorize array properties? Currently the method I use for this is to make it a single block of text and then later when returning format it back as an array, but is there a way to vectorize text[] properties directly?

Hi @aritraban !!

For what I understood, for 1., It doesn’t make sense to have multiple vectors per object.

Let me know if this is what you are after:

Consider you have 1 object from the Person collection per person.

One object for me, and one object for you :slight_smile:

Now, you want to create other objects, let’s say, a Skill collection, that will be cross referencing a Person object .

Let’s say you have 10 skills, and I have 5 skills.

If you want the Person object to have the average of all (in your case 10, and in mine 5) skills, you are probably looking for the ref2vec module/integration:

Here also a nice blog post about it:

Other than that, you cannot have arbitrary number of vectors tied to one objects in the same vector index, but you can have the centroid of multiple vectors that are cross referencing “condensed” into the vector of the “parent” object.

Let me know if this something closer to what you are looking.

Thanks!

What is the use of cross references? Like what is the benefit of adding Skills as a collection and cross-referencing it with Person, rather than having every object in Person schema have a Skills property? Could you tell me the difference that would convey?

Oops! Forgot about the 2. :see_no_evil:

I was investigating the array vectorization, and had some findings.

I will bring more info here about this soon :slight_smile:

Yep @DudaNogueira that would be great.

ok, array text will be vectorized.

A nice way to check what ends up as the payload to the vectorizer is to change the base url at query time (check here) using the headers and point it to a service like https://webhook.site/

Let me know if this helps.

@DudaNogueira However the weaviate docs dont state that, they say only text is vectorized. Can we update that then?

This is the code I used:

import weaviate
import os
from weaviate import classes as wvc

client = weaviate.connect_to_local(
    headers={
        
        "X-OpenAI-BaseURL": "https://webhook.site/0df6133f-f933-42e4-9a2b-07ff7c5afc69",
        #"X-OpenAI-Api-Key": os.getenv("OPENAI_APIKEY"),
    }
)

collection = client.collections.create(
    "VectorizeArray",
    properties=[
        wvc.config.Property(name="array", data_type=wvc.config.DataType.TEXT_ARRAY)
    ]
)
collection.data.insert({"array": ["this", "is", "an", "array"]})

And this is the payload I received

{
  "input": [
    "vectorize array this is an array"
  ],
  "model": "text-embedding-ada-002"
}

Those are the related docs I have found about this:

Have you seen a different doc around?

Thanks!


Its not immediately obvious if arrays (text[]) are vectorized and how they are vectorized. Eg: is the input of the array stitched together as a block of text and passed to the vectorizer? I guess in this case we are using the array version of openai’s vectorizer, but just want to understand what happens when we have a mixture of text AND text[] in the same collection.

I think explicitly stating that would be helpful to understand how text2vec relates to directly using embeddings as I think the formatting of documents can affect the embeddings/searches.

Also, is there a way to visualize/query the “Documents” that are returned with explainScore? In the following:

I would like to understand what part of the text made this come up, so how to query that “Document”? Is it just the uuid of the object?

@DudaNogueira

This is the resulting payload if you have a text array and a text:

{
  "input": [
    "vectorize array this is an array this is a text"
  ],
  "model": "text-embedding-ada-002"
}

this was the insert instruction:

collection.data.insert({"array_property": ["this", "is", "an", "array"], "text_property": "this is a text"})

I have created awareness about this, and we’ll also improve the statements for the nested objects too.

That explain score text comes attached to the object itself.

here is how to get both the object and the explain score:

query = collection.query.hybrid(query="array", return_metadata=wvc.query.MetadataQuery(explain_score=True))
for object in query.objects:
    print("####")
    print(object, object.metadata.explain_score)

Let me know if this helps.

Thanks!

@DudaNogueira No, I know how to get the explain score with _additional or Metadata. I want to understand what is the “Document uuid” the explainScore is referring to, in my picture above.

Oh. That’s strange as it seems to have two keyword,bm25

Can you give us a reproducible example?