Merge pull request #323 from nmslib/develop

yurymalkov · web-flow · commit 1866a1df7961 · 2021-06-29T22:17:40.000-07:00
Merge 0.5.2 changes into master
diff --git a/.travis.yml b/.travis.yml
@@ -9,7 +9,15 @@ jobs:
     - name: Linux Python 3.7
       os: linux
       python: 3.7
-    
+
+    - name: Linux Python 3.8
+      os: linux
+      python: 3.8
+
+    - name: Linux Python 3.9
+      os: linux
+      python: 3.9
+
     - name: Windows Python 3.6
       os: windows
       language: shell    # 'language: python' is an error on Travis CI Windows
@@ -28,6 +36,24 @@ jobs:
         - python --version
       env: PATH=/c/Python37:/c/Python37/Scripts:$PATH
 
+    - name: Windows Python 3.8
+      os: windows
+      language: shell    # 'language: python' is an error on Travis CI Windows
+      before_install:
+        - choco install python --version 3.8.0
+        - python -m pip install --upgrade pip
+        - python --version
+      env: PATH=/c/Python38:/c/Python38/Scripts:$PATH
+
+    - name: Windows Python 3.9
+      os: windows
+      language: shell    # 'language: python' is an error on Travis CI Windows
+      before_install:
+        - choco install python --version 3.9.0
+        - python -m pip install --upgrade pip
+        - python --version
+      env: PATH=/c/Python39:/c/Python39/Scripts:$PATH
+
 install:
   - |
     python -m pip install .
diff --git a/README.md b/README.md
@@ -1,10 +1,11 @@
 # Hnswlib - fast approximate nearest neighbor search
-Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment
+Header-only C++ HNSW implementation with python bindings.
 
 **NEWS:**
 
+* **Hnswlib is now 0.5.2**. Bugfixes - thanks [@marekhanus](https://github.com/marekhanus) for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; [@apoorv-sharma](https://github.com/apoorv-sharma) for fixing the bug int the insertion/deletion logic; [@shengjun1985](https://github.com/shengjun1985) for simplifying the memory reallocation logic; [@TakaakiFuruse](https://github.com/TakaakiFuruse) for improved description of `add_items`; [@psobot ](https://github.com/psobot) for improving error handling; [@ShuAiii](https://github.com/ShuAiii) for reporting the bug in the python interface
 
-* **hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!**
+* **Hnswlib is now 0.5.0**. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!
 
 * **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).**
 
@@ -41,18 +42,18 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib.
 * `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.
 
 `hnswlib.Index` methods:
-* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements. 
+* `init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100)` initializes the index from with no elements. 
     * `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
     * `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
     * `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)).
     
-* `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure. 
-    * `labels` is an optional N-size numpy array of integer labels for all elements in `data`.
+* `add_items(data, ids, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure. 
     * `num_threads` sets the number of cpu threads to use (-1 means use default).
-    * `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
+    * `ids` are optional N-size numpy array of integer labels for all elements in `data`. 
+      - If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
     * Thread-safe with other `add_items` calls, but not with `knn_query`.
     
-* `mark_deleted(data_label)`  - marks the element as deleted, so it will be omitted from search results.
+* `mark_deleted(label)`  - marks the element as deleted, so it will be omitted from search results.
 
 * `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.
 
@@ -113,7 +114,7 @@ num_elements = 10000
 
 # Generating sample data
 data = np.float32(np.random.random((num_elements, dim)))
-data_labels = np.arange(num_elements)
+ids = np.arange(num_elements)
 
 # Declaring index
 p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
@@ -122,7 +123,7 @@ p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or
 p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
 
 # Element insertion (can be called several times):
-p.add_items(data, data_labels)
+p.add_items(data, ids)
 
 # Controlling the recall by setting ef:
 p.set_ef(50) # ef should always be > k
@@ -295,4 +296,13 @@ To run test **with** updates (from `build` directory)
 
 ### References
 
-Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320
+@article{malkov2018efficient,
+  title={Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs},
+  author={Malkov, Yu A and Yashunin, Dmitry A},
+  journal={IEEE transactions on pattern analysis and machine intelligence},
+  volume={42},
+  number={4},
+  pages={824--836},
+  year={2018},
+  publisher={IEEE}
+}
diff --git a/examples/pyw_hnswlib.py b/examples/pyw_hnswlib.py
@@ -11,8 +11,8 @@ def __init__(self, space, dim):
         self.dict_labels = {}
         self.cur_ind = 0
 
-    def init_index(self, max_elements, ef_construction = 200, M = 16):
-        self.index.init_index(max_elements = max_elements, ef_construction = ef_construction, M = M)
+    def init_index(self, max_elements, ef_construction=200, M=16):
+        self.index.init_index(max_elements=max_elements, ef_construction=ef_construction, M=M)
 
     def add_items(self, data, ids=None):
         if ids is not None:
@@ -55,8 +55,7 @@ def knn_query(self, data, k=1):
         labels_int, distances = self.index.knn_query(data=data, k=k)
         labels = []
         for li in labels_int:
-            line = []
-            for l in li:
-                line.append(self.dict_labels[l])
-            labels.append(line)
+            labels.append(
+                [self.dict_labels[l] for l in li]
+            )
         return labels, distances
diff --git a/hnswlib/hnswalg.h b/hnswlib/hnswalg.h
@@ -573,29 +573,23 @@ namespace hnswlib {
             visited_list_pool_ = new VisitedListPool(1, new_max_elements);
 
 
-
             element_levels_.resize(new_max_elements);
 
             std::vector<std::mutex>(new_max_elements).swap(link_list_locks_);
 
             // Reallocate base layer
-            char * data_level0_memory_new = (char *) malloc(new_max_elements * size_data_per_element_);
+            char * data_level0_memory_new = (char *) realloc(data_level0_memory_, new_max_elements * size_data_per_element_);
             if (data_level0_memory_new == nullptr)
                 throw std::runtime_error("Not enough memory: resizeIndex failed to allocate base layer");
-            memcpy(data_level0_memory_new, data_level0_memory_,cur_element_count * size_data_per_element_);
-            free(data_level0_memory_);
-            data_level0_memory_=data_level0_memory_new;
+            data_level0_memory_ = data_level0_memory_new;
 
             // Reallocate all other layers
-            char ** linkLists_new = (char **) malloc(sizeof(void *) * new_max_elements);
+            char ** linkLists_new = (char **) realloc(linkLists_, sizeof(void *) * new_max_elements);
             if (linkLists_new == nullptr)
                 throw std::runtime_error("Not enough memory: resizeIndex failed to allocate other layers");
-            memcpy(linkLists_new, linkLists_,cur_element_count * sizeof(void *));
-            free(linkLists_);
-            linkLists_=linkLists_new;
-
-            max_elements_=new_max_elements;
+            linkLists_ = linkLists_new;
 
+            max_elements_ = new_max_elements;
         }
 
         void saveIndex(const std::string &location) {
@@ -987,11 +981,15 @@ namespace hnswlib {
                 auto search = label_lookup_.find(label);
                 if (search != label_lookup_.end()) {
                     tableint existingInternalId = search->second;
-
                     templock_curr.unlock();
 
                     std::unique_lock <std::mutex> lock_el_update(link_list_update_locks_[(existingInternalId & (max_update_element_locks - 1))]);
+
+                    if (isMarkedDeleted(existingInternalId)) {
+                        unmarkDeletedInternal(existingInternalId);
+                    }
                     updatePoint(data_point, existingInternalId, 1.0);
+                    
                     return existingInternalId;
                 }
 
diff --git a/python_bindings/bindings.cpp b/python_bindings/bindings.cpp
@@ -97,6 +97,8 @@ class Index {
     else if(space_name=="cosine") {
       l2space = new hnswlib::InnerProductSpace(dim);
       normalize=true;
+    } else {
+      throw new std::runtime_error("Space name must be one of l2, ip, or cosine.");
     }
     appr_alg = NULL;
     ep_added = true;
@@ -162,6 +164,7 @@ class Index {
       }
       appr_alg = new hnswlib::HierarchicalNSW<dist_t>(l2space, path_to_index, false, max_elements);
       cur_l = appr_alg->cur_element_count;
+      index_inited = true;
     }
 
     void normalize_vector(float *data, float *norm_array){
diff --git a/python_bindings/tests/bindings_test.py b/python_bindings/tests/bindings_test.py
@@ -18,15 +18,15 @@ def testRandomSelf(self):
         # Declaring index
         p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip
 
-        # Initing index
+        # Initiating index
         # max_elements - the maximum number of elements, should be known beforehand
         #     (probably will be made optional in the future)
         #
         # ef_construction - controls index search speed/build speed tradeoff
         # M - is tightly connected with internal dimensionality of the data
-        #     stronlgy affects the memory consumption
+        #     strongly affects the memory consumption
 
-        p.init_index(max_elements = num_elements, ef_construction = 100, M = 16)
+        p.init_index(max_elements=num_elements, ef_construction=100, M=16)
 
         # Controlling the recall by setting ef:
         # higher ef leads to better accuracy, but slower search
@@ -51,7 +51,7 @@ def testRandomSelf(self):
         p.save_index(index_path)
         del p
 
-        # Reiniting, loading the index
+        # Re-initiating, loading the index
         p = hnswlib.Index(space='l2', dim=dim)  # you can change the sa
 
         print("\nLoading index from '%s'\n" % index_path)
diff --git a/python_bindings/tests/bindings_test_getdata.py b/python_bindings/tests/bindings_test_getdata.py
@@ -19,13 +19,13 @@ def testGettingItems(self):
         # Declaring index
         p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip
 
-        # Initing index
+        # Initiating index
         # max_elements - the maximum number of elements, should be known beforehand
         #     (probably will be made optional in the future)
         #
         # ef_construction - controls index search speed/build speed tradeoff
         # M - is tightly connected with internal dimensionality of the data
-        #     stronlgy affects the memory consumption
+        #     strongly affects the memory consumption
 
         p.init_index(max_elements=num_elements, ef_construction=100, M=16)
 
diff --git a/python_bindings/tests/bindings_test_labels.py b/python_bindings/tests/bindings_test_labels.py
@@ -21,13 +21,13 @@ def testRandomSelf(self):
             # Declaring index
             p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip
 
-            # Initing index
+            # Initiating index
             # max_elements - the maximum number of elements, should be known beforehand
             #     (probably will be made optional in the future)
             #
             # ef_construction - controls index search speed/build speed tradeoff
             # M - is tightly connected with internal dimensionality of the data
-            #     stronlgy affects the memory consumption
+            #     strongly affects the memory consumption
 
             p.init_index(max_elements=num_elements, ef_construction=100, M=16)
 
@@ -47,7 +47,7 @@ def testRandomSelf(self):
             # Query the elements for themselves and measure recall:
             labels, distances = p.knn_query(data1, k=1)
 
-            items=p.get_items(labels)
+            items = p.get_items(labels)
 
             # Check the recall:
             self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data1))), 1.0, 3)
@@ -67,8 +67,8 @@ def testRandomSelf(self):
             print("Deleted")
 
             print("\n**** Mark delete test ****\n")
-            # Reiniting, loading the index
-            print("Reiniting")
+            # Re-initiating, loading the index
+            print("Re-initiating")
             p = hnswlib.Index(space='l2', dim=dim)
 
             print("\nLoading index from '%s'\n" % index_path)
@@ -80,17 +80,17 @@ def testRandomSelf(self):
 
             # Query the elements for themselves and measure recall:
             labels, distances = p.knn_query(data, k=1)
-            items=p.get_items(labels)
+            items = p.get_items(labels)
 
             # Check the recall:
             self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3)
 
             # Check that the returned element data is correct:
-            diff_with_gt_labels=np.mean(np.abs(data-items))
+            diff_with_gt_labels = np.mean(np.abs(data-items))
             self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) # deleting index.
 
             # Checking that all labels are returned correctly:
-            sorted_labels=sorted(p.get_ids_list())
+            sorted_labels = sorted(p.get_ids_list())
             self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0)
 
             # Delete data1
diff --git a/python_bindings/tests/bindings_test_pickle.py b/python_bindings/tests/bindings_test_pickle.py
diff --git a/python_bindings/tests/bindings_test_resize.py b/python_bindings/tests/bindings_test_resize.py
diff --git a/setup.py b/setup.py

Original file line number	Diff line number	Diff line change
`@@ -97,6 +97,8 @@ class Index {`
`97`	`97`	`else if(space_name=="cosine") {`
`98`	`98`	`l2space = new hnswlib::InnerProductSpace(dim);`
`99`	`99`	`normalize=true;`
	`100`	`+ } else {`
	`101`	`+ throw new std::runtime_error("Space name must be one of l2, ip, or cosine.");`
`100`	`102`	`}`
`101`	`103`	`appr_alg = NULL;`
`102`	`104`	`ep_added = true;`
`@@ -162,6 +164,7 @@ class Index {`
`162`	`164`	`}`
`163`	`165`	`appr_alg = new hnswlib::HierarchicalNSW<dist_t>(l2space, path_to_index, false, max_elements);`
`164`	`166`	`cur_l = appr_alg->cur_element_count;`
	`167`	`+ index_inited = true;`
`165`	`168`	`}`
`166`	`169`
`167`	`170`	`void normalize_vector(float data, float norm_array){`