How to manage list of strings

rjalex · March 6, 2024, 4:23pm

Data types | Weaviate - Vector Database lists text as a dataype.

I have some original data which is inherently a list of strings (names of places mentioned in an article. Example [‘Paris’,‘Rome’,‘London’,‘New York’])

Given the available query operators would it be handier to just join them in a single string (" Paris Rome London New York") or there’s some advantage keeping them as a list of separate strings?

Thanks for any suggestions
Is there an example of the Property declaration for such objects and a query?

DudaNogueira · March 6, 2024, 8:37pm

Hi!

When it comes to searching, I don’t think it really makes a difference. Considering that the default tokenization is word, you will end up with an object with tokens like paris, rome, london, new, york.

I did a little lab to check this:

import weaviate
client = weaviate.connect_to_local()
client.collections.delete("MyCollection")
collection = client.collections.create("MyCollection")
content = [
    {"cities": ['Paris','Rome','London','New York'], "cities_string": "Paris,Rome,London,New York"},
    {"cities": ['Rio','London','New Hampshire'], "cities_string": "Rio,London,New Hampshire"},
]
with collection.batch.dynamic() as batch:
    for data_row in content:
        batch.add_object(
            properties=data_row
        )
if client.batch.failed_objects:
    print(client.batch.failed_objects)

for i in collection.query.fetch_objects(include_vector=True).objects:
    print(i.properties, i.vector)

I got the same results with for example:

from weaviate import classes as wvc
collection.query.fetch_objects(
    filters=wvc.query.Filter.by_property("cities_string").contains_any(["Rome", "Rio"])
).objects

and

from weaviate import classes as wvc
collection.query.fetch_objects(
    filters=wvc.query.Filter.by_property("cities").contains_any(["Rome", "Rio"])
).objects

Let me know if this helps

rjalex · March 7, 2024, 9:34am

Dear @DudaNogueira thanks a lot.

I have developed a small collection of 10.000 entries similar to yours (thank you ChatGPT).

With this larger collection I have run several timing runs and unless this is a random result I see slightly faster results when retrieving from the string as opposed to retrieving from the list.

Fetch Rome Rio from string took 0.12191414833068848 seconds to complete.
Fetch Rio Rome from string took 0.12638115882873535 seconds to complete.
Fetch Rome Rio from list took 0.1296849250793457 seconds to complete.
Fetch Rio Rome from list took 0.12851190567016602 seconds to complete.

The above is just one run. Can you think of any reason or it’s just random? Obrigado

DudaNogueira · March 15, 2024, 9:05pm

hi there @rjalex

List will indeed by very slightly slower than string, because we have to run some extra code to process the list.

The difference is also small and should impact much.

By the way, we are planning a large rewrite of the keyword search this year. Stay tuned

Topic		Replies	Views
How do I store Stripe's OpenAPI JSON file (Which is highly unstructured JSON file) on weaviate? General	18	1196	July 6, 2023
Weaviate Use Case with other language Support	6	747	January 31, 2024
Unable to get expected results using BM25 or any search functions Support	8	631	July 3, 2024
How to access/search data ingested through Weaviate client in langchain / langchain-weaviate? Support wcs , python	7	789	July 15, 2024
Query Multiple Data Sets Support	4	230	May 5, 2025

How to manage list of strings

Related topics