Using internal VectorIndex and hnsw struct

Matthew_Peterkort · March 24, 2025, 8:56pm

Hi,

I was able to get some of the lower level functions working together to do hnsw vector searches but I am having some trouble figuring out how to get the store working so that I can save to disk and open and close it like a proper database.

Not interesting in using the go client functions, I’d rather interact directly with the database interfaces if that is possible.

I have attached some of the init code I am using to turn on the db, but I am unsure what needs to be called to load from the same exact path after Flush() and Shutdown() VectorIndex interface functions are called.

Any assistance would be appreciated.
Thank you.

func newStore(rootDir string, logger logrus.FieldLogger) (*lsmkv.Store, error) {
	if err := os.MkdirAll(rootDir, 0755); err != nil {
		return nil, fmt.Errorf("failed to create directory %s: %v", rootDir, err)
	}

	compactCallbacks := cyclemanager.NewCallbackGroup("compact", logger, 1)
	flushCallbacks := cyclemanager.NewCallbackGroup("flush", logger, 1)
	tombstoneCallbacks := cyclemanager.NewCallbackGroup("tombstone", logger, 1)

	store, err := lsmkv.New(
		rootDir, rootDir, logger, nil,
		compactCallbacks, flushCallbacks, tombstoneCallbacks,
	)
	return store, err
}

func (b *BSONTable) initHNSW() error {
	makeCL := func() (hnsw.CommitLogger, error) {
		return hnsw.NewCommitLogger(b.hnswPath, b.Name, b.logger, cyclemanager.NewCallbackGroup("commitLoggerThunk", b.logger, 1))
	}
	index, err := hnsw.New(hnsw.Config{
		RootPath:              b.hnswPath,
		ID:                    fmt.Sprintf("%s_%s", b.Name, b.VectorField),
		MakeCommitLoggerThunk: makeCL,
		DistanceProvider:      distancer.NewL2SquaredProvider(),
		VectorForIDThunk: func(ctx context.Context, id uint64) ([]float32, error) {
			key := make([]byte, 8)
			binary.LittleEndian.PutUint64(key, id)
			data, err := b.GetRow(key)
			if err != nil {
				return nil, err
			}
			if vec, ok := data[b.VectorField].([]float32); ok {
				return vec, nil
			}
			return nil, fmt.Errorf("vector for ID %d not found", id)
		},
	}, ent.UserConfig{
		CleanupIntervalSeconds: 10,
		VectorCacheMaxObjects:  1000000,
		Distance:               "l2-squared",
		DynamicEFMin:           20,
		DynamicEFMax:           100,
		DynamicEFFactor:        8,
		EFConstruction:         200,
		MaxConnections:         10,
	}, cyclemanager.NewCallbackGroup("cml", b.logger, 1), b.store)

	if err != nil {
		return err
	}
	b.HnswIndex = index
	return nil
}

DudaNogueira · March 25, 2025, 5:51pm

hi @Matthew_Peterkort !!

You mean that after ingesting data into a given collection, and restarting your Weaviate, the data is not persistent?

Also, can you share the version and deployment method?

Thanks!

Matthew_Peterkort · March 25, 2025, 6:24pm

Yes. I am not sure what functions I need to call to load the data from disk, and I suspect I am not configuring the hnsw index correctly to be able to do this.

I’m trying to use the db.VectorIndex interface functions located at “github.com/weaviate/weaviate/adapters/repos/db”

and the hnsw.New( function located at “github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw”

I was trying to call hnsw.New( function a second time with the exact same paths that was used the first time, attempting to load the data from disk but wasn’t getting very far doing that.

No deployment method currently, calling the functions directly from the golang source code and using version 1.29.1 of the pkg.

Would greatly appreciate any insight on how to do this, your implementation of hnsw is very fast.

DudaNogueira · March 26, 2025, 2:08pm

hi @Matthew_Peterkort !!

Ah, I see.

I am not sure I will be able to help you here, as I usually don’t call Weaviate functions on that level, but use a client to interact with it.

For that I will need to escalate with our core team

I will poke internally. Thanks!

DudaNogueira · March 26, 2025, 2:27pm

hi @Matthew_Peterkort !!

I got a feedback. It’s definitely not a good idea.

Weaviate was not built to be used like a library, but as a full application in itself and because of that, there is a lot of other components in play.

So this is not recommended, and will certainly be broken at some point in future.

Is that a hard requirement or you are experimenting?

Have you tried using Weaviate as a “stand alone” service?

Thanks!

Matthew_Peterkort · March 26, 2025, 3:25pm

Ok thank you for asking around.

I was hoping to be able to integrate some of the great performance that Weaviate offers into a command line interface tool that I was building for doing vector searches on genomics data on my local machine.

I understand there is a lot going on under the hood and this type of use case will never be supported, but architecturally, I’m curious where in the source code are the higher level server side read, write, delete, search interface functions defined?

Thank you

etiennedi · March 28, 2025, 9:28pm

Here’s an example for a vector search: weaviate/adapters/repos/db/shard_read.go at 8bb46676ae81dcc77d1f8a6802bff4b4ba5ae442 · weaviate/weaviate · GitHub

And here’s one for a single insert:

github.com/weaviate/weaviate

adapters/repos/db/shard_write_put.go

8bb46676a


      
          func (s *Shard) putOne(ctx context.Context, uuid []byte, object *storobj.Object) error {
          	status, err := s.putObjectLSM(object, uuid)
          	if err != nil {
          		return errors.Wrap(err, "store object in LSM store")
          	}
          
          	// object was not changed, no further updates are required
          	// https://github.com/weaviate/weaviate/issues/3949
          	if status.skipUpsert {
          		return nil
          	}
          
          	for targetVector, vector := range object.Vectors {
          		if err := s.updateVectorIndex(ctx, vector, status, targetVector); err != nil {
          			return errors.Wrapf(err, "update vector index for target vector %s", targetVector)
          		}
          	}
          	for targetVector, multiVector := range object.MultiVectors {
          		if err := s.updateMultiVectorIndex(ctx, multiVector, status, targetVector); err != nil {
          			return errors.Wrapf(err, "update multi vector index for target vector %s", targetVector)

This file has been truncated. show original

In both cases you should be able to to follow the various function definitions. There’s a lot to unpack because for every write there are so many auxiliary indexes (inverted index etc.)

It may actually be easier to just treat a shard as a black box and try and run an entire shard in your CLI, but shards are also not necessarily meant to be run as stand-alone units, because they assume other shards exist (in the cluster) and will try to reach out to them. Also the shard in turn depends on the schema.

The way that all of this bubbles up in the end you’d basically be running an entire Weaviate server, maybe the only thing you wouldn’t be running is the gRPC/REST and cluster APIs.

Long story short, have fun exploring, you’ll definitely have some fun, but as others have said, a lot of the components were built with other components in mind, so it’s not super easy to run something in isolation

If you are OK with ignoring object storage and inverted indexes and only want to run HNSW, you can check the tests in the HNSW package. They do basically that

Matthew_Peterkort · April 1, 2025, 8:04pm

Ok thank you for your insight. Interesting I wouldn’t have thought to instantiate the db from one single shard struct. I got sidetracked on some other tasks, but I will explore this strategy further in the future.

Topic		Replies	Views
[Question] Named Vectors with non-default hnsw Support	1	236	May 28, 2024
Migration from compact to hnsw General	1	170	November 30, 2023
Bug creating named vectors using REST Api General bug	1	75	July 3, 2024
Unexpected status code: 502, with response body: None. (Cant continue with project) Support bug , developer-experience	12	1300	December 14, 2023
Vector database Support	5	59	January 30, 2025

Using internal VectorIndex and hnsw struct

Related topics