-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HNSW connect components can take an inordinate amount of time #14214
Comments
this can be reproduced with either of the following tests public void testSameVectorIndexedMultipleTimes() throws IOException {
try (Directory d = newDirectory()) {
float[] vector = new vector[16];
Arrays.fill(vector, 0.5f);
try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
for (int j = 1; j <= 1_000_000; j++) {
Document doc = new Document();
doc.add(getKnnVectorField("field", vector, DOT_PRODUCT));
w.addDocument(doc);
if (j % 100 == 0) {
w.flush();
}
}
w.commit();
}
}
}
public void testFewDistinctVectors() throws IOException {
try (Directory d = newDirectory()) {
try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
float[][] f = new float[16][];
for (int i = 0; i < 16; i++) {
f[i] = new float[16];
Arrays.fill(f[i], (float) i);
}
for (int i =0; i < 1_000_000; i++) {
Document doc = new Document();
doc.add(getKnnVectorField("field", f[random().nextInt(16)], DOT_PRODUCT));
w.addDocument(doc);
}
}
}
} they take tens of minutes and thread dump shows thread stuck at whereas a similar test even with random vectors takes 10x less time to finish. |
I see 3 problems why it could be if it is poor ** Inefficient Component Exploration (
|
previously related PR: #12770 While my original change to help move us towards a saner HNSW search behavior, it is will still actually explore a candidate if its score is `==` min accepted. This will devolve in the degenerate case where all vectors are the same. This change adjusts minimum required candidate score to match `Math.nextUp`, similar to TopScoreDocCollector related to (but doesn't fully solve): #14214
previously related PR: #12770 While my original change to help move us towards a saner HNSW search behavior, it is will still actually explore a candidate if its score is `==` min accepted. This will devolve in the degenerate case where all vectors are the same. This change adjusts minimum required candidate score to match `Math.nextUp`, similar to TopScoreDocCollector related to (but doesn't fully solve): #14214
So, verifying the "fewDistinct" slowness, here is how connect components works in this adverse case:
So, for a total of 39139 vectors, we end up with 36597 components to connect (so, almost every node is its own component). This then requires on average 9k vector ops to connect each component (worst case was 18056). |
Interesting , let me quickly run those tests my self also to see what would be impact!! Thanks for logs .. |
So, these adverse scenarios where connect components has to do a ton of work all stem from us keeping the graph very sparse (e.g. only connecting diverse nodes). I wonder if we can augment our building algorithm per node, meaning when we detect a highly clustered area for a node (e.g. majority/all the scores of beam_width candidates are within 1e-7 or something). When we detect very clustered areas, we force MORE connections, by passing the typical diversity forcing on forward connections. I am not sure we need to change the backlink behavior, but maybe we need to do that well? The idea I have in mind is basically similar to "delauney" type things here: https://github.com/nmslib/nmslib/blob/2ae5378027a107474a952edae1e1c2dc2df941d2/similarity_search/src/method/hnsw.cc However, we don't allow it to be configured and we just pick the right one (per node) based on the score distributions. |
In the original diversity impl we had allowed the neighbors array to fill without regard to any diversity criterion, and only started imposing it once the array was full. This means we have to sort at that point IIRC, but it might be a better, more robust choice? |
I am not sure if its "better" I would assume it makes search slower on well distributed and distinguished vectors :/ We should be able to automatically do the right thing as we know the distribution of the beam-width scores. Meaning, keeping well distinguished sparsity & handling silly (clustered/duplicate) data. |
Sorry I didn't understand this. In the normal case, we'd populate the neighbors with non-diverse elements, and then once full, we impose the diversity criterion when adding new neighbors. That is: if an existing neighbor is non-diverse, we prefer a new neighbor that is diverse even if its score is more "distant" |
@msokolov I mean that during search, we would by default have more neighbors to look at. Ensuring diversity eagerly means that its possible to have fewer than 32 connections (with m==16), thus allowing for faster search. |
@benwtrent As far as I understand from your idea is to use Delaunay Triangulation and skip connectComponents() ? |
I see, yes, that's true. |
Description
Connect components on Flush or merge, while good for graphs that are "almost OK" but need to be better connected, can just destroy performance if the vector distribution is poor.
I don't readily have test data, but if you have tightly clustered, or many duplicate vectors, it can take until the "heat death of the universe" to complete.
It seems to me that since connect Components is a "best effort fix up" of the graph, we should add a "cap" on the amount of work this does.
Marking as a bug as I have seen this for real users and it effectively takes a CPU hostage for hours (maybe days).
Version and environment details
No response
The text was updated successfully, but these errors were encountered: