You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+45-37
Original file line number
Diff line number
Diff line change
@@ -1,34 +1,22 @@
1
1
# Hnswlib - fast approximate nearest neighbor search
2
-
Header-only C++ HNSW implementation with python bindings.
2
+
Header-only C++ HNSW implementation with python bindings, insertions and updates.
3
3
4
4
**NEWS:**
5
5
6
+
**version 0.7.0**
6
7
7
-
**version 0.6.2**
8
-
9
-
* Fixed a bug in saving of large pickles. The pickles with > 4GB could have been corrupted. Thanks Kai Wohlfahrt for reporting.
10
-
* Thanks to ([@GuyAv46](https://github.com/GuyAv46)) hnswlib inner product now is more consitent accross architectures (SSE, AVX, etc).
11
-
*
12
-
13
-
**version 0.6.1**
14
-
15
-
* Thanks to ([@tony-kuo](https://github.com/tony-kuo)) hnswlib AVX512 and AVX builds are not backwards-compatible with older SSE and non-AVX512 architectures.
16
-
* Thanks to ([@psobot](https://github.com/psobot)) there is now a sencible message instead of segfault when passing a scalar to get_items.
17
-
* Thanks to ([@urigoren](https://github.com/urigoren)) hnswlib has a lazy index creation python wrapper.
18
-
19
-
**version 0.6.0**
20
-
* Thanks to ([@dyashuni](https://github.com/dyashuni)) hnswlib now uses github actions for CI, there is a search speedup in some scenarios with deletions. `unmark_deleted(label)` is now also a part of the python interface (note now it throws an exception for double deletions).
21
-
* Thanks to ([@slice4e](https://github.com/slice4e)) we now support AVX512; thanks to ([@LTLA](https://github.com/LTLA)) the cmake interface for the lib is now updated.
22
-
* Thanks to ([@alonre24](https://github.com/alonre24)) we now have a python bindings for brute-force (and examples for recall tuning: [TESTING_RECALL.md](TESTING_RECALL.md).
23
-
* Thanks to ([@dorosy-yeong](https://github.com/dorosy-yeong)) there is a bug fixed in the handling large quantities of deleted elements and large K.
24
-
25
-
8
+
* Added support to filtering (#402, #430) by [@kishorenc](https://github.com/kishorenc)
9
+
* Added python interface for filtering (though note its performance is limited by GIL) (#417) by [@gtsoukas](https://github.com/gtsoukas)
10
+
* Added support for replacing the elements that were marked as delete with newly inserted elements (to control the size of the index, #418) by [@dyashuni](https://github.com/dyashuni)
11
+
* Fixed data races/deadlocks in updates/insertion, added stress test for multithreaded operation (#418) by [@dyashuni](https://github.com/dyashuni)
* global linkages (#383) by [@MasterAler](https://github.com/MasterAler), USE_SSE usage in MSVC (#408) by [@alxvth](https://github.com/alxvth)
26
14
27
15
28
16
### Highlights:
29
17
1) Lightweight, header-only, no dependencies other than C++ 11
30
-
2) Interfaces for C++, Java, Python and R (https://github.com/jlmelville/rcpphnsw).
31
-
3) Has full support for incremental index construction. Has support for element deletions
18
+
2) Interfaces for C++, Python, external support for Java and R (https://github.com/jlmelville/rcpphnsw).
19
+
3) Has full support for incremental index construction and updating the elements. Has support for element deletions
32
20
(by marking them in index). Index is picklable.
33
21
4) Can work with custom user defined distances (C++).
34
22
5) Significantly less memory footprint and faster build time compared to current nmslib's implementation.
@@ -50,37 +38,42 @@ Note that inner product is not an actual metric. An element can be closer to som
50
38
51
39
For other spaces use the nmslib library https://github.com/nmslib/nmslib.
52
40
53
-
#### Short API description
41
+
#### API description
54
42
*`hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.
55
43
56
44
`hnswlib.Index` methods:
57
-
*`init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100)` initializes the index from with no elements.
45
+
*`init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100, allow_replace_deleted = False)` initializes the index from with no elements.
58
46
*`max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
59
47
*`ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
60
48
*`M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)).
49
+
*`allow_replace_deleted` enables replacing of deleted elements with new added ones.
61
50
62
-
*`add_items(data, ids, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
51
+
*`add_items(data, ids, num_threads = -1, replace_deleted = False)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
63
52
*`num_threads` sets the number of cpu threads to use (-1 means use default).
64
53
*`ids` are optional N-size numpy array of integer labels for all elements in `data`.
65
54
- If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
55
+
*`replace_deleted` replaces deleted elements. Note it allows to save memory.
56
+
- to use it `init_index` should be called with `allow_replace_deleted=True`
66
57
* Thread-safe with other `add_items` calls, but not with `knn_query`.
67
58
68
59
*`mark_deleted(label)` - marks the element as deleted, so it will be omitted from search results. Throws an exception if it is already deleted.
69
-
*
60
+
70
61
*`unmark_deleted(label)` - unmarks the element as deleted, so it will be not be omitted from search results.
71
62
72
63
*`resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.
73
64
74
65
*`set_ef(ef)` - sets the query time accuracy/speed trade-off, defined by the `ef` parameter (
75
66
[ALGO_PARAMS.md](ALGO_PARAMS.md)). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.
76
67
77
-
*`knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closest elements for each element of the
68
+
*`knn_query(data, k = 1, num_threads = -1, filter = None)` make a batch query for `k` closest elements for each element of the
78
69
*`data` (shape:`N*dim`). Returns a numpy array of (shape:`N*k`).
79
70
*`num_threads` sets the number of cpu threads to use (-1 means use default).
71
+
*`filter` filters elements by its labels, returns elements with allowed ids. Note that search with a filter works slow in python in multithreaded mode. It is recommended to set `num_threads=1`
80
72
* Thread-safe with other `knn_query` calls, but not with `add_items`.
81
73
82
-
*`load_index(path_to_index, max_elements = 0)` loads the index from persistence to the uninitialized index.
74
+
*`load_index(path_to_index, max_elements = 0, allow_replace_deleted = False)` loads the index from persistence to the uninitialized index.
83
75
*`max_elements`(optional) resets the maximum number of elements in the structure.
76
+
*`allow_replace_deleted` specifies whether the index being loaded has enabled replacing of deleted elements.
84
77
85
78
*`save_index(path_to_index)` saves the index from persistence.
86
79
@@ -118,6 +111,12 @@ Properties of `hnswlib.Index` that support reading and writing:
118
111
119
112
120
113
#### Python bindings examples
114
+
[See more examples here](examples/python/EXAMPLES.md):
* Rust implementation for memory and thread safety purposes and There is A Trait to enable the user to implement its own distances. It takes as data slices of types T satisfying T:Serialize+Clone+Send+Sync.: https://github.com/jean-pierreBoth/hnswlib-rs
271
279
272
280
### 200M SIFT test reproduction
273
281
To download and extract the bigann dataset (from root directory):
274
282
```bash
275
-
python3 download_bigann.py
283
+
python tests/cpp/download_bigann.py
276
284
```
277
285
To compile:
278
286
```bash
@@ -292,7 +300,7 @@ The size of the BigANN subset (in millions) is controlled by the variable **subs
0 commit comments