Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge 0.5.2 changes into master #323

Merged
merged 39 commits into from
Jun 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
5b115a7
Merge pull request #1 from nmslib/master
apoorv-sharma Aug 17, 2020
fa974f8
Update the documentation of HNSWLib
Aug 18, 2020
42541d8
Update
Aug 18, 2020
b65f5e8
Update
Aug 18, 2020
034201f
Update
Aug 18, 2020
5d16c1f
Update
Aug 18, 2020
f445124
Update
Aug 26, 2020
1a6878a
Add support for Python 3.8
marekhanus Feb 2, 2021
0869ed5
Add support for Python 3.9
marekhanus Feb 2, 2021
2437830
List comprehension improves speed of KNN query
marekhanus Feb 2, 2021
dfcade7
Fix PEP 8: E251 unexpected spaces around keyword / parameter equals
marekhanus Feb 2, 2021
76347df
Fix PEP 8: E111 indentation is not a multiple of four
marekhanus Feb 2, 2021
177c362
Fix PEP 8: E228 missing whitespace around modulo operator
marekhanus Feb 2, 2021
4643db6
Fix PEP 8: E225 missing whitespace around operator
marekhanus Feb 2, 2021
0d15d86
Typo fixes
marekhanus Feb 2, 2021
20555b7
Fix PEP 8: E266 too many leading '#' for block comment
marekhanus Feb 2, 2021
87f2dc4
Merge pull request #283 from marekhanus/feature/travis
yurymalkov Feb 5, 2021
cd6c0fc
Merge pull request #285 from marekhanus/feature/improve-knn-query
yurymalkov Feb 5, 2021
0e3845f
Fix missed typo
marekhanus Feb 8, 2021
8481a4b
Fix PEP 8: E225 missing whitespace around operator
marekhanus Feb 8, 2021
cc2b94f
Merge pull request #284 from marekhanus/feature/pep
yurymalkov Feb 12, 2021
e2000e9
Improved description of `add_items`
TakaakiFuruse Feb 25, 2021
95d6b02
data_labels => ids
TakaakiFuruse Mar 6, 2021
9ba16e2
data_label => label
TakaakiFuruse Mar 6, 2021
1afdce0
fixed sample code, data_labels => ids
TakaakiFuruse Mar 6, 2021
af284e6
changed order of args
TakaakiFuruse Mar 6, 2021
8a295df
Merge pull request #289 from TakaakiFuruse/fix-add_item-doc-description
yurymalkov Mar 8, 2021
afaaeb5
Use realloc to simplify the code
Mar 10, 2021
1437b1e
Throw an exception if passed an unrecognized space_name.
psobot Mar 23, 2021
300c2b5
Merge pull request #297 from psobot/patch-1
yurymalkov Apr 5, 2021
6d14f35
Merge remote-tracking branch 'upstream/develop' into fix_resize
Apr 19, 2021
6300d58
Merge pull request #293 from shengjun1985/fix_resize
yurymalkov Apr 20, 2021
2f55fe4
Merge branch 'master' of https://github.com/nmslib/hnswlib
May 10, 2021
1492527
Modify hnsw update logic to unmark the deleted element
May 10, 2021
1e0e002
Merge pull request #311 from apoorv-sharma/master
yurymalkov May 17, 2021
6ec9bad
fix forgotten flag
yurymalkov Jun 2, 2021
aa3de3e
Bump version
yurymalkov Jun 29, 2021
8992ebb
add information about the 0.5.2 release
yurymalkov Jun 29, 2021
2235aad
Update README.md
yurymalkov Jun 30, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,15 @@ jobs:
- name: Linux Python 3.7
os: linux
python: 3.7


- name: Linux Python 3.8
os: linux
python: 3.8

- name: Linux Python 3.9
os: linux
python: 3.9

- name: Windows Python 3.6
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
Expand All @@ -28,6 +36,24 @@ jobs:
- python --version
env: PATH=/c/Python37:/c/Python37/Scripts:$PATH

- name: Windows Python 3.8
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
before_install:
- choco install python --version 3.8.0
- python -m pip install --upgrade pip
- python --version
env: PATH=/c/Python38:/c/Python38/Scripts:$PATH

- name: Windows Python 3.9
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
before_install:
- choco install python --version 3.9.0
- python -m pip install --upgrade pip
- python --version
env: PATH=/c/Python39:/c/Python39/Scripts:$PATH

install:
- |
python -m pip install .
Expand Down
30 changes: 20 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Hnswlib - fast approximate nearest neighbor search
Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment
Header-only C++ HNSW implementation with python bindings.

**NEWS:**

* **Hnswlib is now 0.5.2**. Bugfixes - thanks [@marekhanus](https://github.com/marekhanus) for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; [@apoorv-sharma](https://github.com/apoorv-sharma) for fixing the bug int the insertion/deletion logic; [@shengjun1985](https://github.com/shengjun1985) for simplifying the memory reallocation logic; [@TakaakiFuruse](https://github.com/TakaakiFuruse) for improved description of `add_items`; [@psobot ](https://github.com/psobot) for improving error handling; [@ShuAiii](https://github.com/ShuAiii) for reporting the bug in the python interface

* **hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!**
* **Hnswlib is now 0.5.0**. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!

* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).**

Expand Down Expand Up @@ -41,18 +42,18 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib.
* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.

`hnswlib.Index` methods:
* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements.
* `init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100)` initializes the index from with no elements.
* `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
* `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
* `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)).

* `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
* `labels` is an optional N-size numpy array of integer labels for all elements in `data`.
* `add_items(data, ids, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
* `num_threads` sets the number of cpu threads to use (-1 means use default).
* `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
* `ids` are optional N-size numpy array of integer labels for all elements in `data`.
- If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
* Thread-safe with other `add_items` calls, but not with `knn_query`.

* `mark_deleted(data_label)` - marks the element as deleted, so it will be omitted from search results.
* `mark_deleted(label)` - marks the element as deleted, so it will be omitted from search results.

* `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.

Expand Down Expand Up @@ -113,7 +114,7 @@ num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
data_labels = np.arange(num_elements)
ids = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
Expand All @@ -122,7 +123,7 @@ p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, data_labels)
p.add_items(data, ids)

# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k
Expand Down Expand Up @@ -295,4 +296,13 @@ To run test **with** updates (from `build` directory)

### References

Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320
@article{malkov2018efficient,
title={Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs},
author={Malkov, Yu A and Yashunin, Dmitry A},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={42},
number={4},
pages={824--836},
year={2018},
publisher={IEEE}
}
11 changes: 5 additions & 6 deletions examples/pyw_hnswlib.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ def __init__(self, space, dim):
self.dict_labels = {}
self.cur_ind = 0

def init_index(self, max_elements, ef_construction = 200, M = 16):
self.index.init_index(max_elements = max_elements, ef_construction = ef_construction, M = M)
def init_index(self, max_elements, ef_construction=200, M=16):
self.index.init_index(max_elements=max_elements, ef_construction=ef_construction, M=M)

def add_items(self, data, ids=None):
if ids is not None:
Expand Down Expand Up @@ -55,8 +55,7 @@ def knn_query(self, data, k=1):
labels_int, distances = self.index.knn_query(data=data, k=k)
labels = []
for li in labels_int:
line = []
for l in li:
line.append(self.dict_labels[l])
labels.append(line)
labels.append(
[self.dict_labels[l] for l in li]
)
return labels, distances
22 changes: 10 additions & 12 deletions hnswlib/hnswalg.h
Original file line number Diff line number Diff line change
Expand Up @@ -573,29 +573,23 @@ namespace hnswlib {
visited_list_pool_ = new VisitedListPool(1, new_max_elements);



element_levels_.resize(new_max_elements);

std::vector<std::mutex>(new_max_elements).swap(link_list_locks_);

// Reallocate base layer
char * data_level0_memory_new = (char *) malloc(new_max_elements * size_data_per_element_);
char * data_level0_memory_new = (char *) realloc(data_level0_memory_, new_max_elements * size_data_per_element_);
if (data_level0_memory_new == nullptr)
throw std::runtime_error("Not enough memory: resizeIndex failed to allocate base layer");
memcpy(data_level0_memory_new, data_level0_memory_,cur_element_count * size_data_per_element_);
free(data_level0_memory_);
data_level0_memory_=data_level0_memory_new;
data_level0_memory_ = data_level0_memory_new;

// Reallocate all other layers
char ** linkLists_new = (char **) malloc(sizeof(void *) * new_max_elements);
char ** linkLists_new = (char **) realloc(linkLists_, sizeof(void *) * new_max_elements);
if (linkLists_new == nullptr)
throw std::runtime_error("Not enough memory: resizeIndex failed to allocate other layers");
memcpy(linkLists_new, linkLists_,cur_element_count * sizeof(void *));
free(linkLists_);
linkLists_=linkLists_new;

max_elements_=new_max_elements;
linkLists_ = linkLists_new;

max_elements_ = new_max_elements;
}

void saveIndex(const std::string &location) {
Expand Down Expand Up @@ -987,11 +981,15 @@ namespace hnswlib {
auto search = label_lookup_.find(label);
if (search != label_lookup_.end()) {
tableint existingInternalId = search->second;

templock_curr.unlock();

std::unique_lock <std::mutex> lock_el_update(link_list_update_locks_[(existingInternalId & (max_update_element_locks - 1))]);

if (isMarkedDeleted(existingInternalId)) {
unmarkDeletedInternal(existingInternalId);
}
updatePoint(data_point, existingInternalId, 1.0);

return existingInternalId;
}

Expand Down
3 changes: 3 additions & 0 deletions python_bindings/bindings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ class Index {
else if(space_name=="cosine") {
l2space = new hnswlib::InnerProductSpace(dim);
normalize=true;
} else {
throw new std::runtime_error("Space name must be one of l2, ip, or cosine.");
}
appr_alg = NULL;
ep_added = true;
Expand Down Expand Up @@ -162,6 +164,7 @@ class Index {
}
appr_alg = new hnswlib::HierarchicalNSW<dist_t>(l2space, path_to_index, false, max_elements);
cur_l = appr_alg->cur_element_count;
index_inited = true;
}

void normalize_vector(float *data, float *norm_array){
Expand Down
8 changes: 4 additions & 4 deletions python_bindings/tests/bindings_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ def testRandomSelf(self):
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

# Initing index
# Initiating index
# max_elements - the maximum number of elements, should be known beforehand
# (probably will be made optional in the future)
#
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data
# stronlgy affects the memory consumption
# strongly affects the memory consumption

p.init_index(max_elements = num_elements, ef_construction = 100, M = 16)
p.init_index(max_elements=num_elements, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
Expand All @@ -51,7 +51,7 @@ def testRandomSelf(self):
p.save_index(index_path)
del p

# Reiniting, loading the index
# Re-initiating, loading the index
p = hnswlib.Index(space='l2', dim=dim) # you can change the sa

print("\nLoading index from '%s'\n" % index_path)
Expand Down
4 changes: 2 additions & 2 deletions python_bindings/tests/bindings_test_getdata.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ def testGettingItems(self):
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

# Initing index
# Initiating index
# max_elements - the maximum number of elements, should be known beforehand
# (probably will be made optional in the future)
#
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data
# stronlgy affects the memory consumption
# strongly affects the memory consumption

p.init_index(max_elements=num_elements, ef_construction=100, M=16)

Expand Down
16 changes: 8 additions & 8 deletions python_bindings/tests/bindings_test_labels.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ def testRandomSelf(self):
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

# Initing index
# Initiating index
# max_elements - the maximum number of elements, should be known beforehand
# (probably will be made optional in the future)
#
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data
# stronlgy affects the memory consumption
# strongly affects the memory consumption

p.init_index(max_elements=num_elements, ef_construction=100, M=16)

Expand All @@ -47,7 +47,7 @@ def testRandomSelf(self):
# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)

items=p.get_items(labels)
items = p.get_items(labels)

# Check the recall:
self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data1))), 1.0, 3)
Expand All @@ -67,8 +67,8 @@ def testRandomSelf(self):
print("Deleted")

print("\n**** Mark delete test ****\n")
# Reiniting, loading the index
print("Reiniting")
# Re-initiating, loading the index
print("Re-initiating")
p = hnswlib.Index(space='l2', dim=dim)

print("\nLoading index from '%s'\n" % index_path)
Expand All @@ -80,17 +80,17 @@ def testRandomSelf(self):

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
items=p.get_items(labels)
items = p.get_items(labels)

# Check the recall:
self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3)

# Check that the returned element data is correct:
diff_with_gt_labels=np.mean(np.abs(data-items))
diff_with_gt_labels = np.mean(np.abs(data-items))
self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) # deleting index.

# Checking that all labels are returned correctly:
sorted_labels=sorted(p.get_ids_list())
sorted_labels = sorted(p.get_ids_list())
self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0)

# Delete data1
Expand Down
Loading