Skip to content

Commit 8297326

Browse files
authored
Merge pull request #279 from nmslib/develop
Merge 0.5.0 changes to master
2 parents 2571bdb + e03162b commit 8297326

25 files changed

+1066
-455
lines changed

.gitignore

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
hnswlib.egg-info/
2+
build/
3+
dist/
4+
tmp/
5+
python_bindings/tests/__pycache__/
6+
*.pyd
7+
hnswlib.cpython*.so
8+
var/

.travis.yml

+29-8
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,37 @@
11
language: python
22

3-
matrix:
3+
jobs:
44
include:
5-
- python: 3.6
6-
- python: 3.7
5+
- name: Linux Python 3.6
6+
os: linux
7+
python: 3.6
8+
9+
- name: Linux Python 3.7
10+
os: linux
11+
python: 3.7
12+
13+
- name: Windows Python 3.6
14+
os: windows
15+
language: shell # 'language: python' is an error on Travis CI Windows
16+
before_install:
17+
- choco install python --version 3.6.0
18+
- python -m pip install --upgrade pip
19+
- python --version
20+
env: PATH=/c/Python36:/c/Python36/Scripts:$PATH
21+
22+
- name: Windows Python 3.7
23+
os: windows
24+
language: shell # 'language: python' is an error on Travis CI Windows
25+
before_install:
26+
- choco install python --version 3.7.0
27+
- python -m pip install --upgrade pip
28+
- python --version
29+
env: PATH=/c/Python37:/c/Python37/Scripts:$PATH
30+
731
install:
832
- |
9-
cd python_bindings
10-
pip install -r requirements.txt
11-
python setup.py install
33+
python -m pip install .
1234
1335
script:
1436
- |
15-
cd python_bindings
16-
python setup.py test
37+
python -m unittest discover --start-directory python_bindings/tests --pattern "*_test*.py"

CMakeLists.txt

+2
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,6 @@ endif()
2323

2424
add_executable(test_updates examples/updates_test.cpp)
2525

26+
add_executable(searchKnnCloserFirst_test examples/searchKnnCloserFirst_test.cpp)
27+
2628
target_link_libraries(main sift_test)

MANIFEST.in

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include hnswlib/*.h
2+
include LICENSE

Makefile

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
pypi: dist
2+
twine upload dist/*
3+
4+
dist:
5+
-rm dist/*
6+
pip install build
7+
python3 -m build --sdist
8+
9+
test:
10+
python3 -m unittest discover --start-directory python_bindings/tests --pattern "*_test*.py"
11+
12+
clean:
13+
rm -rf *.egg-info build dist tmp var tests/__pycache__ hnswlib.cpython*.so
14+
15+
.PHONY: dist

README.md

+56-18
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,12 @@ Header-only C++ HNSW implementation with python bindings. Paper's code for the H
33

44
**NEWS:**
55

6-
* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the perfromance/memory should not degrade as you update the element embeddinds).**
76

8-
* **Thanks to Dmitry [@2ooom](https://github.com/2ooom), hnswlib got a boost in performance for vector dimensions that are not mutiple of 4**
7+
* **hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!**
8+
9+
* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).**
10+
11+
* **Thanks to Dmitry [@2ooom](https://github.com/2ooom), hnswlib got a boost in performance for vector dimensions that are not multiple of 4**
912

1013
* **Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib can now be installed via pip!**
1114

@@ -37,7 +40,7 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib.
3740
#### Short API description
3841
* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.
3942

40-
Index methods:
43+
`hnswlib.Index` methods:
4144
* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements.
4245
* `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
4346
* `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
@@ -49,14 +52,14 @@ Index methods:
4952
* `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
5053
* Thread-safe with other `add_items` calls, but not with `knn_query`.
5154

52-
* `mark_deleted(data_label)` - marks the element as deleted, so it will be ommited from search results.
55+
* `mark_deleted(data_label)` - marks the element as deleted, so it will be omitted from search results.
5356

5457
* `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.
5558

5659
* `set_ef(ef)` - sets the query time accuracy/speed trade-off, defined by the `ef` parameter (
5760
[ALGO_PARAMS.md](ALGO_PARAMS.md)). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.
5861

59-
* `knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closests elements for each element of the
62+
* `knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closest elements for each element of the
6063
* `data` (shape:`N*dim`). Returns a numpy array of (shape:`N*k`).
6164
* `num_threads` sets the number of cpu threads to use (-1 means use default).
6265
* Thread-safe with other `knn_query` calls, but not with `add_items`.
@@ -76,14 +79,34 @@ Index methods:
7679

7780
* `get_current_count()` - returns the current number of element stored in the index
7881

79-
80-
82+
Read-only properties of `hnswlib.Index` class:
83+
84+
* `space` - name of the space (can be one of "l2", "ip", or "cosine").
85+
86+
* `dim` - dimensionality of the space.
87+
88+
* `M` - parameter that defines the maximum number of outgoing connections in the graph.
89+
90+
* `ef_construction` - parameter that controls speed/accuracy trade-off during the index construction.
91+
92+
* `max_elements` - current capacity of the index. Equivalent to `p.get_max_elements()`.
93+
94+
* `element_count` - number of items in the index. Equivalent to `p.get_current_count()`.
95+
96+
Properties of `hnswlib.Index` that support reading and writing:
97+
98+
* `ef` - parameter controlling query time/accuracy trade-off.
99+
100+
* `num_threads` - default number of threads to use in `add_items` or `knn_query`. Note that calling `p.set_num_threads(3)` is equivalent to `p.num_threads=3`.
101+
102+
81103
82104

83105
#### Python bindings examples
84106
```python
85107
import hnswlib
86108
import numpy as np
109+
import pickle
87110

88111
dim = 128
89112
num_elements = 10000
@@ -95,7 +118,7 @@ data_labels = np.arange(num_elements)
95118
# Declaring index
96119
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
97120

98-
# Initing index - the maximum number of elements should be known beforehand
121+
# Initializing index - the maximum number of elements should be known beforehand
99122
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
100123

101124
# Element insertion (can be called several times):
@@ -106,6 +129,18 @@ p.set_ef(50) # ef should always be > k
106129

107130
# Query dataset, k - number of closest elements (returns 2 numpy arrays)
108131
labels, distances = p.knn_query(data, k = 1)
132+
133+
# Index objects support pickling
134+
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
135+
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
136+
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip
137+
138+
### Index parameters are exposed as class properties:
139+
print(f"Parameters passed to constructor: space={p_copy.space}, dim={p_copy.dim}")
140+
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
141+
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
142+
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")
143+
109144
```
110145

111146
An example with updates after serialization/deserialization:
@@ -126,7 +161,7 @@ data2 = data[num_elements // 2:]
126161
# Declaring index
127162
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
128163

129-
# Initing index
164+
# Initializing index
130165
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
131166
# during insertion of an element.
132167
# The capacity can be increased by saving/loading the index, see below.
@@ -160,7 +195,7 @@ print("Saving index to '%s'" % index_path)
160195
p.save_index("first_half.bin")
161196
del p
162197

163-
# Reiniting, loading the index
198+
# Re-initializing, loading the index
164199
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function.
165200

166201
print("\nLoading index from 'first_half.bin'\n")
@@ -181,17 +216,17 @@ print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(dat
181216
You can install from sources:
182217
```bash
183218
apt-get install -y python-setuptools python-pip
184-
pip3 install pybind11 numpy setuptools
185-
cd python_bindings
186-
python3 setup.py install
219+
git clone https://github.com/nmslib/hnswlib.git
220+
cd hnswlib
221+
pip install .
187222
```
188223

189224
or you can install via pip:
190225
`pip install hnswlib`
191226

192227
### Other implementations
193228
* Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
194-
* Faiss libary by facebook, uses own HNSW implementation for coarse quantization (python, C++):
229+
* Faiss library by facebook, uses own HNSW implementation for coarse quantization (python, C++):
195230
https://github.com/facebookresearch/faiss
196231
* Code for the paper
197232
["Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors"](https://arxiv.org/abs/1802.02422)
@@ -203,21 +238,24 @@ https://github.com/dbaranchuk/ivf-hnsw
203238
* Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
204239
* Java implementation: https://github.com/jelmerk/hnswlib
205240
* Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
206-
* .Net implementation: https://github.com/microsoft/HNSW.Net
241+
* .Net implementation: https://github.com/microsoft/HNSW.Net
242+
* CUDA implementation: https://github.com/js1010/cuhnsw
207243

208244
### Contributing to the repository
209245
Contributions are highly welcome!
210246

211247
Please make pull requests against the `develop` branch.
212248

213249
### 200M SIFT test reproduction
214-
To download and extract the bigann dataset:
250+
To download and extract the bigann dataset (from root directory):
215251
```bash
216252
python3 download_bigann.py
217253
```
218254
To compile:
219255
```bash
220-
cmake .
256+
mkdir build
257+
cd build
258+
cmake ..
221259
make all
222260
```
223261

@@ -226,7 +264,7 @@ To run the test on 200M SIFT subset:
226264
./main
227265
```
228266

229-
The size of the bigann subset (in millions) is controlled by the variable **subset_size_milllions** hardcoded in **sift_1b.cpp**.
267+
The size of the BigANN subset (in millions) is controlled by the variable **subset_size_millions** hardcoded in **sift_1b.cpp**.
230268

231269
### Updates test
232270
To generate testing data (from root directory):
+84
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
// This is a test file for testing the interface
2+
// >>> virtual std::vector<std::pair<dist_t, labeltype>>
3+
// >>> searchKnnCloserFirst(const void* query_data, size_t k) const;
4+
// of class AlgorithmInterface
5+
6+
#include "../hnswlib/hnswlib.h"
7+
8+
#include <assert.h>
9+
10+
#include <vector>
11+
#include <iostream>
12+
13+
namespace
14+
{
15+
16+
using idx_t = hnswlib::labeltype;
17+
18+
void test() {
19+
int d = 4;
20+
idx_t n = 100;
21+
idx_t nq = 10;
22+
size_t k = 10;
23+
24+
std::vector<float> data(n * d);
25+
std::vector<float> query(nq * d);
26+
27+
std::mt19937 rng;
28+
rng.seed(47);
29+
std::uniform_real_distribution<> distrib;
30+
31+
for (idx_t i = 0; i < n * d; ++i) {
32+
data[i] = distrib(rng);
33+
}
34+
for (idx_t i = 0; i < nq * d; ++i) {
35+
query[i] = distrib(rng);
36+
}
37+
38+
39+
hnswlib::L2Space space(d);
40+
hnswlib::AlgorithmInterface<float>* alg_brute = new hnswlib::BruteforceSearch<float>(&space, 2 * n);
41+
hnswlib::AlgorithmInterface<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, 2 * n);
42+
43+
for (size_t i = 0; i < n; ++i) {
44+
alg_brute->addPoint(data.data() + d * i, i);
45+
alg_hnsw->addPoint(data.data() + d * i, i);
46+
}
47+
48+
// test searchKnnCloserFirst of BruteforceSearch
49+
for (size_t j = 0; j < nq; ++j) {
50+
const void* p = query.data() + j * d;
51+
auto gd = alg_brute->searchKnn(p, k);
52+
auto res = alg_brute->searchKnnCloserFirst(p, k);
53+
assert(gd.size() == res.size());
54+
size_t t = gd.size();
55+
while (!gd.empty()) {
56+
assert(gd.top() == res[--t]);
57+
gd.pop();
58+
}
59+
}
60+
for (size_t j = 0; j < nq; ++j) {
61+
const void* p = query.data() + j * d;
62+
auto gd = alg_hnsw->searchKnn(p, k);
63+
auto res = alg_hnsw->searchKnnCloserFirst(p, k);
64+
assert(gd.size() == res.size());
65+
size_t t = gd.size();
66+
while (!gd.empty()) {
67+
assert(gd.top() == res[--t]);
68+
gd.pop();
69+
}
70+
}
71+
72+
delete alg_brute;
73+
delete alg_hnsw;
74+
}
75+
76+
} // namespace
77+
78+
int main() {
79+
std::cout << "Testing ..." << std::endl;
80+
test();
81+
std::cout << "Test ok" << std::endl;
82+
83+
return 0;
84+
}

hnswlib/bruteforce.h

-18
Original file line numberDiff line numberDiff line change
@@ -111,24 +111,6 @@ namespace hnswlib {
111111
return topResults;
112112
};
113113

114-
template <typename Comp>
115-
std::vector<std::pair<dist_t, labeltype>>
116-
searchKnn(const void* query_data, size_t k, Comp comp) {
117-
std::vector<std::pair<dist_t, labeltype>> result;
118-
if (cur_element_count == 0) return result;
119-
120-
auto ret = searchKnn(query_data, k);
121-
122-
while (!ret.empty()) {
123-
result.push_back(ret.top());
124-
ret.pop();
125-
}
126-
127-
std::sort(result.begin(), result.end(), comp);
128-
129-
return result;
130-
}
131-
132114
void saveIndex(const std::string &location) {
133115
std::ofstream output(location, std::ios::binary);
134116
std::streampos position;

0 commit comments

Comments
 (0)