Skip to content

Commit 1866a1d

Browse files
authoredJun 30, 2021
Merge pull request #323 from nmslib/develop
Merge 0.5.2 changes into master
2 parents d59f8d9 + 2235aad commit 1866a1d

11 files changed

+147
-111
lines changed
 

‎.travis.yml

+27-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,15 @@ jobs:
99
- name: Linux Python 3.7
1010
os: linux
1111
python: 3.7
12-
12+
13+
- name: Linux Python 3.8
14+
os: linux
15+
python: 3.8
16+
17+
- name: Linux Python 3.9
18+
os: linux
19+
python: 3.9
20+
1321
- name: Windows Python 3.6
1422
os: windows
1523
language: shell # 'language: python' is an error on Travis CI Windows
@@ -28,6 +36,24 @@ jobs:
2836
- python --version
2937
env: PATH=/c/Python37:/c/Python37/Scripts:$PATH
3038

39+
- name: Windows Python 3.8
40+
os: windows
41+
language: shell # 'language: python' is an error on Travis CI Windows
42+
before_install:
43+
- choco install python --version 3.8.0
44+
- python -m pip install --upgrade pip
45+
- python --version
46+
env: PATH=/c/Python38:/c/Python38/Scripts:$PATH
47+
48+
- name: Windows Python 3.9
49+
os: windows
50+
language: shell # 'language: python' is an error on Travis CI Windows
51+
before_install:
52+
- choco install python --version 3.9.0
53+
- python -m pip install --upgrade pip
54+
- python --version
55+
env: PATH=/c/Python39:/c/Python39/Scripts:$PATH
56+
3157
install:
3258
- |
3359
python -m pip install .

‎README.md

+20-10
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
# Hnswlib - fast approximate nearest neighbor search
2-
Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment
2+
Header-only C++ HNSW implementation with python bindings.
33

44
**NEWS:**
55

6+
* **Hnswlib is now 0.5.2**. Bugfixes - thanks [@marekhanus](https://github.com/marekhanus) for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; [@apoorv-sharma](https://github.com/apoorv-sharma) for fixing the bug int the insertion/deletion logic; [@shengjun1985](https://github.com/shengjun1985) for simplifying the memory reallocation logic; [@TakaakiFuruse](https://github.com/TakaakiFuruse) for improved description of `add_items`; [@psobot ](https://github.com/psobot) for improving error handling; [@ShuAiii](https://github.com/ShuAiii) for reporting the bug in the python interface
67

7-
* **hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!**
8+
* **Hnswlib is now 0.5.0**. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!
89

910
* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).**
1011

@@ -41,18 +42,18 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib.
4142
* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.
4243

4344
`hnswlib.Index` methods:
44-
* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements.
45+
* `init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100)` initializes the index from with no elements.
4546
* `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
4647
* `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
4748
* `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)).
4849

49-
* `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
50-
* `labels` is an optional N-size numpy array of integer labels for all elements in `data`.
50+
* `add_items(data, ids, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
5151
* `num_threads` sets the number of cpu threads to use (-1 means use default).
52-
* `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
52+
* `ids` are optional N-size numpy array of integer labels for all elements in `data`.
53+
- If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
5354
* Thread-safe with other `add_items` calls, but not with `knn_query`.
5455

55-
* `mark_deleted(data_label)` - marks the element as deleted, so it will be omitted from search results.
56+
* `mark_deleted(label)` - marks the element as deleted, so it will be omitted from search results.
5657

5758
* `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.
5859

@@ -113,7 +114,7 @@ num_elements = 10000
113114

114115
# Generating sample data
115116
data = np.float32(np.random.random((num_elements, dim)))
116-
data_labels = np.arange(num_elements)
117+
ids = np.arange(num_elements)
117118

118119
# Declaring index
119120
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
@@ -122,7 +123,7 @@ p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or
122123
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
123124

124125
# Element insertion (can be called several times):
125-
p.add_items(data, data_labels)
126+
p.add_items(data, ids)
126127

127128
# Controlling the recall by setting ef:
128129
p.set_ef(50) # ef should always be > k
@@ -295,4 +296,13 @@ To run test **with** updates (from `build` directory)
295296

296297
### References
297298

298-
Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320
299+
@article{malkov2018efficient,
300+
title={Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs},
301+
author={Malkov, Yu A and Yashunin, Dmitry A},
302+
journal={IEEE transactions on pattern analysis and machine intelligence},
303+
volume={42},
304+
number={4},
305+
pages={824--836},
306+
year={2018},
307+
publisher={IEEE}
308+
}

‎examples/pyw_hnswlib.py

+5-6
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ def __init__(self, space, dim):
1111
self.dict_labels = {}
1212
self.cur_ind = 0
1313

14-
def init_index(self, max_elements, ef_construction = 200, M = 16):
15-
self.index.init_index(max_elements = max_elements, ef_construction = ef_construction, M = M)
14+
def init_index(self, max_elements, ef_construction=200, M=16):
15+
self.index.init_index(max_elements=max_elements, ef_construction=ef_construction, M=M)
1616

1717
def add_items(self, data, ids=None):
1818
if ids is not None:
@@ -55,8 +55,7 @@ def knn_query(self, data, k=1):
5555
labels_int, distances = self.index.knn_query(data=data, k=k)
5656
labels = []
5757
for li in labels_int:
58-
line = []
59-
for l in li:
60-
line.append(self.dict_labels[l])
61-
labels.append(line)
58+
labels.append(
59+
[self.dict_labels[l] for l in li]
60+
)
6261
return labels, distances

‎hnswlib/hnswalg.h

+10-12
Original file line numberDiff line numberDiff line change
@@ -573,29 +573,23 @@ namespace hnswlib {
573573
visited_list_pool_ = new VisitedListPool(1, new_max_elements);
574574

575575

576-
577576
element_levels_.resize(new_max_elements);
578577

579578
std::vector<std::mutex>(new_max_elements).swap(link_list_locks_);
580579

581580
// Reallocate base layer
582-
char * data_level0_memory_new = (char *) malloc(new_max_elements * size_data_per_element_);
581+
char * data_level0_memory_new = (char *) realloc(data_level0_memory_, new_max_elements * size_data_per_element_);
583582
if (data_level0_memory_new == nullptr)
584583
throw std::runtime_error("Not enough memory: resizeIndex failed to allocate base layer");
585-
memcpy(data_level0_memory_new, data_level0_memory_,cur_element_count * size_data_per_element_);
586-
free(data_level0_memory_);
587-
data_level0_memory_=data_level0_memory_new;
584+
data_level0_memory_ = data_level0_memory_new;
588585

589586
// Reallocate all other layers
590-
char ** linkLists_new = (char **) malloc(sizeof(void *) * new_max_elements);
587+
char ** linkLists_new = (char **) realloc(linkLists_, sizeof(void *) * new_max_elements);
591588
if (linkLists_new == nullptr)
592589
throw std::runtime_error("Not enough memory: resizeIndex failed to allocate other layers");
593-
memcpy(linkLists_new, linkLists_,cur_element_count * sizeof(void *));
594-
free(linkLists_);
595-
linkLists_=linkLists_new;
596-
597-
max_elements_=new_max_elements;
590+
linkLists_ = linkLists_new;
598591

592+
max_elements_ = new_max_elements;
599593
}
600594

601595
void saveIndex(const std::string &location) {
@@ -987,11 +981,15 @@ namespace hnswlib {
987981
auto search = label_lookup_.find(label);
988982
if (search != label_lookup_.end()) {
989983
tableint existingInternalId = search->second;
990-
991984
templock_curr.unlock();
992985

993986
std::unique_lock <std::mutex> lock_el_update(link_list_update_locks_[(existingInternalId & (max_update_element_locks - 1))]);
987+
988+
if (isMarkedDeleted(existingInternalId)) {
989+
unmarkDeletedInternal(existingInternalId);
990+
}
994991
updatePoint(data_point, existingInternalId, 1.0);
992+
995993
return existingInternalId;
996994
}
997995

‎python_bindings/bindings.cpp

+3
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ class Index {
9797
else if(space_name=="cosine") {
9898
l2space = new hnswlib::InnerProductSpace(dim);
9999
normalize=true;
100+
} else {
101+
throw new std::runtime_error("Space name must be one of l2, ip, or cosine.");
100102
}
101103
appr_alg = NULL;
102104
ep_added = true;
@@ -162,6 +164,7 @@ class Index {
162164
}
163165
appr_alg = new hnswlib::HierarchicalNSW<dist_t>(l2space, path_to_index, false, max_elements);
164166
cur_l = appr_alg->cur_element_count;
167+
index_inited = true;
165168
}
166169

167170
void normalize_vector(float *data, float *norm_array){

‎python_bindings/tests/bindings_test.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,15 @@ def testRandomSelf(self):
1818
# Declaring index
1919
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
2020

21-
# Initing index
21+
# Initiating index
2222
# max_elements - the maximum number of elements, should be known beforehand
2323
# (probably will be made optional in the future)
2424
#
2525
# ef_construction - controls index search speed/build speed tradeoff
2626
# M - is tightly connected with internal dimensionality of the data
27-
# stronlgy affects the memory consumption
27+
# strongly affects the memory consumption
2828

29-
p.init_index(max_elements = num_elements, ef_construction = 100, M = 16)
29+
p.init_index(max_elements=num_elements, ef_construction=100, M=16)
3030

3131
# Controlling the recall by setting ef:
3232
# higher ef leads to better accuracy, but slower search
@@ -51,7 +51,7 @@ def testRandomSelf(self):
5151
p.save_index(index_path)
5252
del p
5353

54-
# Reiniting, loading the index
54+
# Re-initiating, loading the index
5555
p = hnswlib.Index(space='l2', dim=dim) # you can change the sa
5656

5757
print("\nLoading index from '%s'\n" % index_path)

‎python_bindings/tests/bindings_test_getdata.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@ def testGettingItems(self):
1919
# Declaring index
2020
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
2121

22-
# Initing index
22+
# Initiating index
2323
# max_elements - the maximum number of elements, should be known beforehand
2424
# (probably will be made optional in the future)
2525
#
2626
# ef_construction - controls index search speed/build speed tradeoff
2727
# M - is tightly connected with internal dimensionality of the data
28-
# stronlgy affects the memory consumption
28+
# strongly affects the memory consumption
2929

3030
p.init_index(max_elements=num_elements, ef_construction=100, M=16)
3131

‎python_bindings/tests/bindings_test_labels.py

+8-8
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,13 @@ def testRandomSelf(self):
2121
# Declaring index
2222
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
2323

24-
# Initing index
24+
# Initiating index
2525
# max_elements - the maximum number of elements, should be known beforehand
2626
# (probably will be made optional in the future)
2727
#
2828
# ef_construction - controls index search speed/build speed tradeoff
2929
# M - is tightly connected with internal dimensionality of the data
30-
# stronlgy affects the memory consumption
30+
# strongly affects the memory consumption
3131

3232
p.init_index(max_elements=num_elements, ef_construction=100, M=16)
3333

@@ -47,7 +47,7 @@ def testRandomSelf(self):
4747
# Query the elements for themselves and measure recall:
4848
labels, distances = p.knn_query(data1, k=1)
4949

50-
items=p.get_items(labels)
50+
items = p.get_items(labels)
5151

5252
# Check the recall:
5353
self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data1))), 1.0, 3)
@@ -67,8 +67,8 @@ def testRandomSelf(self):
6767
print("Deleted")
6868

6969
print("\n**** Mark delete test ****\n")
70-
# Reiniting, loading the index
71-
print("Reiniting")
70+
# Re-initiating, loading the index
71+
print("Re-initiating")
7272
p = hnswlib.Index(space='l2', dim=dim)
7373

7474
print("\nLoading index from '%s'\n" % index_path)
@@ -80,17 +80,17 @@ def testRandomSelf(self):
8080

8181
# Query the elements for themselves and measure recall:
8282
labels, distances = p.knn_query(data, k=1)
83-
items=p.get_items(labels)
83+
items = p.get_items(labels)
8484

8585
# Check the recall:
8686
self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3)
8787

8888
# Check that the returned element data is correct:
89-
diff_with_gt_labels=np.mean(np.abs(data-items))
89+
diff_with_gt_labels = np.mean(np.abs(data-items))
9090
self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) # deleting index.
9191

9292
# Checking that all labels are returned correctly:
93-
sorted_labels=sorted(p.get_ids_list())
93+
sorted_labels = sorted(p.get_ids_list())
9494
self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0)
9595

9696
# Delete data1

0 commit comments

Comments
 (0)
Please sign in to comment.