Using internal VectorIndex and hnsw struct

Hi,

I was able to get some of the lower level functions working together to do hnsw vector searches but I am having some trouble figuring out how to get the store working so that I can save to disk and open and close it like a proper database.

Not interesting in using the go client functions, I’d rather interact directly with the database interfaces if that is possible.

I have attached some of the init code I am using to turn on the db, but I am unsure what needs to be called to load from the same exact path after Flush() and Shutdown() VectorIndex interface functions are called.

Any assistance would be appreciated.
Thank you.

func newStore(rootDir string, logger logrus.FieldLogger) (*lsmkv.Store, error) {
	if err := os.MkdirAll(rootDir, 0755); err != nil {
		return nil, fmt.Errorf("failed to create directory %s: %v", rootDir, err)
	}

	compactCallbacks := cyclemanager.NewCallbackGroup("compact", logger, 1)
	flushCallbacks := cyclemanager.NewCallbackGroup("flush", logger, 1)
	tombstoneCallbacks := cyclemanager.NewCallbackGroup("tombstone", logger, 1)

	store, err := lsmkv.New(
		rootDir, rootDir, logger, nil,
		compactCallbacks, flushCallbacks, tombstoneCallbacks,
	)
	return store, err
}

func (b *BSONTable) initHNSW() error {
	makeCL := func() (hnsw.CommitLogger, error) {
		return hnsw.NewCommitLogger(b.hnswPath, b.Name, b.logger, cyclemanager.NewCallbackGroup("commitLoggerThunk", b.logger, 1))
	}
	index, err := hnsw.New(hnsw.Config{
		RootPath:              b.hnswPath,
		ID:                    fmt.Sprintf("%s_%s", b.Name, b.VectorField),
		MakeCommitLoggerThunk: makeCL,
		DistanceProvider:      distancer.NewL2SquaredProvider(),
		VectorForIDThunk: func(ctx context.Context, id uint64) ([]float32, error) {
			key := make([]byte, 8)
			binary.LittleEndian.PutUint64(key, id)
			data, err := b.GetRow(key)
			if err != nil {
				return nil, err
			}
			if vec, ok := data[b.VectorField].([]float32); ok {
				return vec, nil
			}
			return nil, fmt.Errorf("vector for ID %d not found", id)
		},
	}, ent.UserConfig{
		CleanupIntervalSeconds: 10,
		VectorCacheMaxObjects:  1000000,
		Distance:               "l2-squared",
		DynamicEFMin:           20,
		DynamicEFMax:           100,
		DynamicEFFactor:        8,
		EFConstruction:         200,
		MaxConnections:         10,
	}, cyclemanager.NewCallbackGroup("cml", b.logger, 1), b.store)

	if err != nil {
		return err
	}
	b.HnswIndex = index
	return nil
}

hi @Matthew_Peterkort !!

You mean that after ingesting data into a given collection, and restarting your Weaviate, the data is not persistent?

Also, can you share the version and deployment method?

Thanks!

Yes. I am not sure what functions I need to call to load the data from disk, and I suspect I am not configuring the hnsw index correctly to be able to do this.

I’m trying to use the db.VectorIndex interface functions located at “github.com/weaviate/weaviate/adapters/repos/db

and the hnsw.New( function located at “github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw

I was trying to call hnsw.New( function a second time with the exact same paths that was used the first time, attempting to load the data from disk but wasn’t getting very far doing that.

No deployment method currently, calling the functions directly from the golang source code and using version 1.29.1 of the pkg.

Would greatly appreciate any insight on how to do this, your implementation of hnsw is very fast.

hi @Matthew_Peterkort !!

Ah, I see.

I am not sure I will be able to help you here, as I usually don’t call Weaviate functions on that level, but use a client to interact with it.

For that I will need to escalate with our core team :slight_smile:

I will poke internally. Thanks!

hi @Matthew_Peterkort !!

I got a feedback. It’s definitely not a good idea.

Weaviate was not built to be used like a library, but as a full application in itself and because of that, there is a lot of other components in play.

So this is not recommended, and will certainly be broken at some point in future.

Is that a hard requirement or you are experimenting?

Have you tried using Weaviate as a “stand alone” service?

Thanks!

Ok thank you for asking around.

I was hoping to be able to integrate some of the great performance that Weaviate offers into a command line interface tool that I was building for doing vector searches on genomics data on my local machine.

I understand there is a lot going on under the hood and this type of use case will never be supported, but architecturally, I’m curious where in the source code are the higher level server side read, write, delete, search interface functions defined?

Thank you

Here’s an example for a vector search: weaviate/adapters/repos/db/shard_read.go at 8bb46676ae81dcc77d1f8a6802bff4b4ba5ae442 · weaviate/weaviate · GitHub

And here’s one for a single insert:

In both cases you should be able to to follow the various function definitions. There’s a lot to unpack because for every write there are so many auxiliary indexes (inverted index etc.)

It may actually be easier to just treat a shard as a black box and try and run an entire shard in your CLI, but shards are also not necessarily meant to be run as stand-alone units, because they assume other shards exist (in the cluster) and will try to reach out to them. Also the shard in turn depends on the schema.

The way that all of this bubbles up in the end you’d basically be running an entire Weaviate server, maybe the only thing you wouldn’t be running is the gRPC/REST and cluster APIs.

Long story short, have fun exploring, you’ll definitely have some fun, but as others have said, a lot of the components were built with other components in mind, so it’s not super easy to run something in isolation :wink:

If you are OK with ignoring object storage and inverted indexes and only want to run HNSW, you can check the tests in the HNSW package. They do basically that :wink:

1 Like

Ok thank you for your insight. Interesting I wouldn’t have thought to instantiate the db from one single shard struct. I got sidetracked on some other tasks, but I will explore this strategy further in the future.