Distance doesn't correspond to projected points

Hello there,

So after I resolved my previous question (“Visualize Database Contents”) I plotted the data using pythons math plot library and… let’s say I am a tiny bit confused.

I understand that there might be some losses when projecting something from n dimensions to 3.
But I cannot answer 2 questions:

  1. What ANN Algorithm is used? Is it Euclid?
  2. Can I change the AAN Algorithm to other functions (that are built-in) ?

EDIT: The interesting behavior is that 7 is nearer than everything else but only second nearest. 10 should be second nearest, but is 6. (Expected was 1 and 2 to be right near the query which is marked as a red x)

EDIT 2: I just realized that while the distance is always deterministic - the projection is not… I generated the same response several times and realized that the left values remain constant (as expected) but the representation changes, which means that while the distance is the same, the vectors are different. Sometimes the result looks good on the right and sometimes it doesn’t… This might be a huge design flaw if my understanding is right. It doesn’t change anything for queries, though.

EDIT 3: I also generated a 2d view and increased the iterations as it leads to “[…] lead to more stable results […]”.

The same result described in EDIT can be seen.

Thanks in advance :slight_smile:

Hi @micartey,

The library / algorithm used for used for the feature projection feature is t-SNE GitHub - danaugrs/go-tsne: t-Distributed Stochastic Neighbor Embedding (t-SNE) in Go.

More details are provided here t-SNE – Laurens van der Maaten including why the projection is different for each query.

Every time I run t-SNE, I get a (slightly) different result?

In contrast to, e.g., PCA, t-SNE has a non-convex objective function. The objective function is minimized using a gradient descent optimization that is initiated randomly. As a result, it is possible that different runs give you different solutions. Notice that it is perfectly fine to run t-SNE a number of times (with the same data and parameters), and to select the visualization with the lowest value of the objective function as your final visualization.

By ANN Algorithm, I think you are referring to the distance metric. From looking at the go-tsne library it seems to assume Euclidean distance will be used and is not configurable. There is a scikit learn t-NSE library which does have a metric parameter you could test with your data sklearn.manifold.TSNE — scikit-learn 1.3.2 documentation.

Hi @trengrj

Thank you for the response.
That solves half of my problems and helps me with further understanding.

Do you happen to know why the distance doesn’t correspond with the projection at all?
When running the same query several times, it looks different each time, but mainly just from an angle (it rotates). But the distance on the left and the points on the right have nothing to do with each other…

Could it be that the query is not at (0, 0) but somewhere else in space? If that is the case, how do I get the (projected) vector of my query?

Yes that could be the issue. As featureProjection is an _additional property it will only be returned for each object in Weaviate and not the vector supplied to nearVector.

One workaround for this could be to use nearObject instead of nearVector. In this case the original vector / object will usually be returned in the results list (as it will be the closest vector to itself).

I am actually using nearText. Is there a method to get the query vector from within the result (e.g. as an _additional property). I am using the Cohere Model and thus I am not really able to get the vector that is being calculated.