Collections have random letter as folder name

00.lope.naughts · May 24, 2024, 5:54pm

Description

Docker container with image cr.weaviate.io/semitechnologies/weaviate:1.24.11

Using weaviate_client v4 to create new collection (e.g “listing_image”), I noticed that it will generate a folder with random name after upload data records to it. I think this is where the majority of the actual data lives.

My question is: how’s this random number generated? is there a way to control it?

the motivation is this scenario:

create a container + collection, insert data (into a folder mapped to something outside of the docker container, such that it won’t be lost even when I delete the container).
delete the container in (1)
create a new container pointing to the exact data folder mount as in (1)
Issue: I dont see the collection.
I proceeded to create the collection with exact same definition.
Issue: I dont see the data.

what I observed is that the new collection is created under a diff folder with another random name. So the hack I did is simply copy everything from the old random folder to the new random folder, then everything worked, I am able to see my data again.

So I am guessing I may have to add some extra metadata info during the recreation of the collection, such that the folder name it generated is the pointing to the old folder.

Screenshot 2024-05-24 at 1.44.41 PM

DudaNogueira · May 31, 2024, 5:56pm

Hi!

What is the version you are using?

Also, why to leverage the backup tool for that?

I just did a test, and the folder name is the same, as the collection, only the internal folder, which refers to the shard id.

Let me know if this helps.

Thanks!

00.lope.naughts · June 3, 2024, 4:38pm

I used the following image: cr.weaviate.io/semitechnologies/weaviate:1.24.11 creating the docker and I am using python weaviate_client v4. if that random looking folder is the shard id, I wonder if there’s a way to enforce a particular one when I tried to recreate the docker container and collections from scratch.

This is for the following scenario

use a particular compose yml to create a weaviate instance container,
do a lot of test, and find out I have to reconfigure the weaviate server, and want to start from scratch.
create a new compose yml with some new config.
want to reuse the data in the collections created by previous container.
if I run client code to create collections on the new container, it will generate those new shard_id, even though I mount the data in the same location as the old container.

You also mentioned backup tool. I am not sure if the scenario above is a backup or not. otherwise, I can take a look at that. I just thought it is far simpler if I have a way to dictate those shard_id without trying yet another tool.

DudaNogueira · June 7, 2024, 6:49pm

Hi! I believe this is a scenario for backup.

You can backup only one specific collection, and restore it.

There isn’t a way to define that folder name

on your step 2, what you mean on reconfigure weaviate?

Because, depending on the changes you do, cluster wide, should not affect the collection.

Now if you want to change some collection config (like indexing nullstates, metadata timestamps, etc), then you need to reindex your data. You can use this migration guide for that:

Let me know if this helps.

Thanks!

00.lope.naughts · June 8, 2024, 11:48am

Thanks. If this is “backup”, then we will focus on investigating if it will solve that scenario.

The Step 2, I think I said it too generically. maybe a concrete example is ef_construction, or anything that cause us to have to re-create either the container, or the collection, from scratch (could be due to ignorance on how to do any other ways).

And this could be another scenario

configure a container to point to a share drive to store data
create your collection, import data (with data ending up in that shared location).
Another team has to take over. All I can give them is the compose yml to spin up a new container, and python code I used to create collection definition, and the shared drive location containing the collection+data.

I guess this is indeed a backup, and I have to use the backup tool to restore? I was thinking just spinning up new weaviate container and rerun the python client code to define the collections is a faster way. But as I mentioned, if I naively do this, it won’t work 'cos of the new shard name the new instance would use. I imagine the backup restore is as slow as a re- import?

I actually try to hack it by copying all the stuff in the old shard folder into the new shard folder (in that shared data location), and the new instance would work just fine.

DudaNogueira · June 10, 2024, 8:54pm

The main difference here is that:

If using the migration guide, you will be reindexing all your data. Meaning that a new index will be created. Hopefully you also pass the vectors, otherwise you need to add also that. Of course, this is very CPU bound, and depending on the size of your collections, the backups is better.

While backup. will indeed copy all data and indexes. So you can restore it faster, as no reindexing will happen.

You can generate backups with specific IDs on a shared bucket or s3.

So all containers can spin with that configuration, and you can just restore from that backup id.

Not sure this would work for your scenario, but those are roughly the options I see that could fit

Let me know if this helps.

Thanks!

Topic		Replies	Views
Random Data deletion Support bug , developer-experience	2	478	February 15, 2024
Issue with Shard Creation After Moving Docker Compose Setup - How to Use Existing Local Data? Support	1	26	January 27, 2025
Add collection level attributes beyond name? Support	1	65	May 30, 2024
Client.collections.create () vs. client.schema.create_class () Support python	2	218	June 3, 2024
Weaviate V4 Python client not able to aggregate newly created collections Support python	2	272	August 22, 2024

Collections have random letter as folder name

Description

Related topics