HNSW connect components can take an inordinate amount of time #14214

benwtrent · 2025-02-07T12:19:27Z

Description

Connect components on Flush or merge, while good for graphs that are "almost OK" but need to be better connected, can just destroy performance if the vector distribution is poor.

I don't readily have test data, but if you have tightly clustered, or many duplicate vectors, it can take until the "heat death of the universe" to complete.

It seems to me that since connect Components is a "best effort fix up" of the graph, we should add a "cap" on the amount of work this does.

Marking as a bug as I have seen this for real users and it effectively takes a CPU hostage for hours (maybe days).

Version and environment details

No response

tteofili · 2025-02-07T12:22:36Z

this can be reproduced with either of the following tests

public void testSameVectorIndexedMultipleTimes() throws IOException {
    try (Directory d = newDirectory()) {
      float[] vector = new vector[16];
      Arrays.fill(vector, 0.5f);
      try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
        for (int j = 1; j <= 1_000_000; j++) {
          Document doc = new Document();
          doc.add(getKnnVectorField("field", vector, DOT_PRODUCT));
          w.addDocument(doc);
          if (j % 100 == 0) {
            w.flush();
          }
        }
        w.commit();
      }

    }
  }

  public void testFewDistinctVectors() throws IOException {
    try (Directory d = newDirectory()) {
      try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
        float[][] f = new float[16][];
        for (int i = 0; i < 16; i++) {
          f[i] = new float[16];
          Arrays.fill(f[i], (float) i);
        }
        for (int i =0; i < 1_000_000; i++) {
          Document doc = new Document();
          doc.add(getKnnVectorField("field", f[random().nextInt(16)], DOT_PRODUCT));
          w.addDocument(doc);
        }
      }
    }
  }

they take tens of minutes and thread dump shows thread stuck at hnsw.HnswGraphBuilder.connectComponents

whereas a similar test even with random vectors takes 10x less time to finish.

Vikasht34 · 2025-02-09T00:54:11Z

private void connectComponents() {
    BitSet visited = new BitSet(graph.size());
    List<Integer> entryPoints = new ArrayList<>();
    
    for (int node = 0; node < graph.size(); node++) {
        if (!visited.get(node)) {
            List<Integer> component = new ArrayList<>();
            exploreComponent(node, visited, component);

            if (!entryPoints.isEmpty()) {
                int bestCandidate = findBestCandidate(entryPoints, component);
                connectNodes(entryPoints.get(0), bestCandidate);
            }

            entryPoints.add(component.get(0));  // Add first node as entry
        }
    }
}

I see 3 problems why it could be if it is poor

Inefficient Component Exploration (`exploreComponent`)

Recursively visits every unconnected node, leading to O(N²) complexity in worst-case scenarios.
Causes excessive CPU utilization when dealing with large numbers of unconnected components.

Unbounded Connection Attempts (`findBestCandidate`)

The method tries to optimally connect every component, which becomes extremely expensive for highly similar vectors.
Leads to exponential slowdowns when vector clusters are densely packed or contain duplicates.

Repeated Work (`connectNodes`)

If multiple small components exist, the function makes too many unnecessary connections.
This results in high CPU overhead as the method attempts to fully optimize the graph, even when minimal connectivity is sufficient.

Idea of CAP is as a quick fix to prevent infinite execution in connectComponents(), but it does not solve the root problem and remains computationally expensive. The function still performs O(N²) connectivity checks but gets forcefully terminated instead of completing properly.This means that most of the CPU time is still wasted on redundant checks, and the graph may remain unoptimized or disconnected.

What do u think of union find to solve this ?

We replace the brute-force merging with Union-Find, which tracks components dynamically:

Reduces Complexity to O(N log N)
- Union-Find efficiently tracks connected components, avoiding repeated work.
Eliminates Unnecessary Checks
- Instead of checking every node again, we only merge truly disconnected components.
Stops Early When Graph is Mostly Connected
- We stop merging early if 95% of the graph is already connected, preventing wasted work.

Some Sudo code

public class HnswGraphConnectivity {
    private final int[] parent;
    private final int[] rank;
    private final int size;

    public HnswGraphConnectivity(int size) {
        this.size = size;
        parent = new int[size];
        rank = new int[size];
        for (int i = 0; i < size; i++) parent[i] = i; // Each node starts in its own component
    }

    // Path compression: Makes find() O(1) amortized
    public int find(int x) {
        if (parent[x] != x) parent[x] = find(parent[x]); 
        return parent[x];
    }

    // Union by rank: Ensures efficient merging
    public void union(int x, int y) {
        int rootX = find(x);
        int rootY = find(y);
        if (rootX != rootY) {
            if (rank[rootX] > rank[rootY]) parent[rootY] = rootX;
            else if (rank[rootX] < rank[rootY]) parent[rootX] = rootY;
            else {
                parent[rootY] = rootX;
                rank[rootX]++;
            }
        }
    }

    public boolean isConnected(int x, int y) {
        return find(x) == find(y);
    }

    // Returns size of the largest connected component
    public int largestComponentSize() {
        Map<Integer, Integer> componentSize = new HashMap<>();
        for (int i = 0; i < size; i++) {
            int root = find(i);
            componentSize.put(root, componentSize.getOrDefault(root, 0) + 1);
        }
        return Collections.max(componentSize.values());
    }
}

private boolean connectComponents(int level) throws IOException {
        int graphSize = hnsw.size();
        HnswGraphConnectivity connectivity = new HnswGraphConnectivity(graphSize);
        
        // Step 1: Initialize connectivity tracking
        FixedBitSet notFullyConnected = new FixedBitSet(graphSize);
        int maxConn = (level == 0) ? M * 2 : M;
        List<Component> components = HnswUtil.components(hnsw, level, notFullyConnected, maxConn);

        if (infoStream.isEnabled("HNSW")) {
            infoStream.message("HNSW", "Found " + components.size() + " components on level=" + level);
        }

        // Step 2: Use parallel stream to process connections efficiently
        IntStream.range(0, components.size()).parallel().forEach(i -> {
            Component c = components.get(i);
            for (int neighbor : c.nodes()) {
                if (neighbor != c.start()) {
                    connectivity.union(c.start(), neighbor);
                }
            }
        });

        // Step 3: Stop early if graph is sufficiently connected (~95%)
        if (connectivity.largestComponentSize() > (CONNECTIVITY_THRESHOLD_PERCENT / 100.0) * graphSize) {
            if (infoStream.isEnabled("HNSW")) {
                infoStream.message("HNSW", "Early stopping: " + CONNECTIVITY_THRESHOLD_PERCENT + "% of components are connected.");
            }
            return true;
        }

        // Step 4: Connect remaining components intelligently
        GraphBuilderKnnCollector beam = new GraphBuilderKnnCollector(2);
        int[] eps = new int[1];
        UpdateableRandomVectorScorer scorer = scorerSupplier.scorer();

        for (Component c : components) {
            if (!connectivity.isConnected(c.start(), eps[0])) {
                beam.clear();
                eps[0] = c.start();
                scorer.setScoringOrdinal(eps[0]);
                
                // Find the best connection candidate
                try {
                    graphSearcher.searchLevel(beam, scorer, level, eps, hnsw, notFullyConnected);
                    int bestNode = beam.popNode();
                    float score = beam.minimumScore();
                    link(level, bestNode, c.start(), score, notFullyConnected);
                    connectivity.union(bestNode, c.start());
                } catch (Exception e) {
                    if (infoStream.isEnabled("HNSW")) {
                        infoStream.message("HNSW", "Failed to connect component: " + c.start());
                    }
                }
            }
        }

        return true;
    }

previously related PR: #12770 While my original change to help move us towards a saner HNSW search behavior, it is will still actually explore a candidate if its score is `==` min accepted. This will devolve in the degenerate case where all vectors are the same. This change adjusts minimum required candidate score to match `Math.nextUp`, similar to TopScoreDocCollector related to (but doesn't fully solve): #14214

benwtrent · 2025-02-12T20:18:35Z

So, verifying the "fewDistinct" slowness, here is how connect components works in this adverse case:

 1> HNSW 1 [2025-02-12T20:14:45.641640Z; TEST-TestKnnFloatVectorQuery.testFewDistinctVectors-seed#[29F3C0E3B32A8A0D]]: connect 36597 max component size=2543 min component_size=1 avg component size=1.06945924529333 components on level=0 notFullyConnected=39137 numNodesOnLevel=39139
  1> HNSW 1 [2025-02-12T20:15:00.406824Z; TEST-TestKnnFloatVectorQuery.testFewDistinctVectors-seed#[29F3C0E3B32A8A0D]]: vectorOps: max=18056 min=4 std=5352.391988634614 avg=9055.092
  1> HNSW 1 [2025-02-12T20:15:00.408499Z; TEST-TestKnnFloatVectorQuery.testFewDistinctVectors-seed#[29F3C0E3B32A8A0D]]: connect 2294 max component size=189 min component size=1 avg component size=1.0819529206625982 components on level=1 notFullyConnected=2480 numNodesOnLevel=2482
  1> HNSW 1 [2025-02-12T20:15:00.515135Z; TEST-TestKnnFloatVectorQuery.testFewDistinctVectors-seed#[29F3C0E3B32A8A0D]]: vectorOps: max=1677 min=11 std=517.4268668130792 avg=871.3201
  1> HNSW 1 [2025-02-12T20:15:00.515518Z; TEST-TestKnnFloatVectorQuery.testFewDistinctVectors-seed#[29F3C0E3B32A8A0D]]: connect 117 max component size=35 min component size=1 avg component size=1.2905982905982907 components on level=2 notFullyConnected=150 numNodesOnLevel=151
  1> HNSW 1 [2025-02-12T20:15:00.516252Z; TEST-TestKnnFloatVectorQuery.testFewDistinctVectors-seed#[29F3C0E3B32A8A0D]]: vectorOps: max=116 min=26 std=42.01213116516541 avg=56.77587
  1> HNSW 1 [2025-02-12T20:15:00.516475Z; TEST-TestKnnFloatVectorQuery.testFewDistinctVectors-seed#[29F3C0E3B32A8A0D]]: connect 1 max component size=8 min component size=8 avg component size=8.0 components on level=3 notFullyConnected=8 numNodesOnLevel=8

So, for a total of 39139 vectors, we end up with 36597 components to connect (so, almost every node is its own component). This then requires on average 9k vector ops to connect each component (worst case was 18056).

Vikasht34 · 2025-02-13T18:34:35Z

Interesting , let me quickly run those tests my self also to see what would be impact!! Thanks for logs ..

benwtrent · 2025-02-20T15:13:24Z

So, these adverse scenarios where connect components has to do a ton of work all stem from us keeping the graph very sparse (e.g. only connecting diverse nodes).

I wonder if we can augment our building algorithm per node, meaning when we detect a highly clustered area for a node (e.g. majority/all the scores of beam_width candidates are within 1e-7 or something). When we detect very clustered areas, we force MORE connections, by passing the typical diversity forcing on forward connections.

I am not sure we need to change the backlink behavior, but maybe we need to do that well?

The idea I have in mind is basically similar to "delauney" type things here: https://github.com/nmslib/nmslib/blob/2ae5378027a107474a952edae1e1c2dc2df941d2/similarity_search/src/method/hnsw.cc

However, we don't allow it to be configured and we just pick the right one (per node) based on the score distributions.

msokolov · 2025-02-21T12:56:47Z

In the original diversity impl we had allowed the neighbors array to fill without regard to any diversity criterion, and only started imposing it once the array was full. This means we have to sort at that point IIRC, but it might be a better, more robust choice?

benwtrent · 2025-02-21T13:17:41Z

This means we have to sort at that point IIRC, but it might be a better, more robust choice?

I am not sure if its "better" I would assume it makes search slower on well distributed and distinguished vectors :/

We should be able to automatically do the right thing as we know the distribution of the beam-width scores. Meaning, keeping well distinguished sparsity & handling silly (clustered/duplicate) data.

msokolov · 2025-02-21T13:33:56Z

I am not sure if its "better" I would assume it makes search slower on well distributed and distinguished vectors :/

Sorry I didn't understand this. In the normal case, we'd populate the neighbors with non-diverse elements, and then once full, we impose the diversity criterion when adding new neighbors. That is: if an existing neighbor is non-diverse, we prefer a new neighbor that is diverse even if its score is more "distant"

benwtrent · 2025-02-21T16:47:17Z

Sorry I didn't understand this.

@msokolov I mean that during search, we would by default have more neighbors to look at. Ensuring diversity eagerly means that its possible to have fewer than 32 connections (with m==16), thus allowing for faster search.

Vikasht34 · 2025-02-21T16:52:53Z

@benwtrent As far as I understand from your idea is to use Delaunay Triangulation and skip connectComponents() ?

msokolov · 2025-02-21T17:30:59Z

I mean that during search, we would by default have more neighbors to look at.

I see, yes, that's true.

benwtrent added the type:bug label Feb 7, 2025

benwtrent mentioned this issue Feb 7, 2025

Bugfix/fix hnsw search termination check #14215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNSW connect components can take an inordinate amount of time #14214

HNSW connect components can take an inordinate amount of time #14214

benwtrent commented Feb 7, 2025

tteofili commented Feb 7, 2025 •

edited

Loading

Vikasht34 commented Feb 9, 2025

benwtrent commented Feb 12, 2025

Vikasht34 commented Feb 13, 2025

benwtrent commented Feb 20, 2025

msokolov commented Feb 21, 2025

benwtrent commented Feb 21, 2025

msokolov commented Feb 21, 2025

benwtrent commented Feb 21, 2025

Vikasht34 commented Feb 21, 2025

msokolov commented Feb 21, 2025

HNSW connect components can take an inordinate amount of time #14214

HNSW connect components can take an inordinate amount of time #14214

Comments

benwtrent commented Feb 7, 2025

Description

Version and environment details

tteofili commented Feb 7, 2025 • edited Loading

Vikasht34 commented Feb 9, 2025

** Inefficient Component Exploration (exploreComponent)**

** Unbounded Connection Attempts (findBestCandidate)**

** Repeated Work (connectNodes)**

What do u think of union find to solve this ?

benwtrent commented Feb 12, 2025

Vikasht34 commented Feb 13, 2025

benwtrent commented Feb 20, 2025

msokolov commented Feb 21, 2025

benwtrent commented Feb 21, 2025

msokolov commented Feb 21, 2025

benwtrent commented Feb 21, 2025

Vikasht34 commented Feb 21, 2025

msokolov commented Feb 21, 2025

tteofili commented Feb 7, 2025 •

edited

Loading

Inefficient Component Exploration (`exploreComponent`)

Unbounded Connection Attempts (`findBestCandidate`)

Repeated Work (`connectNodes`)