Skip to content

Commit 3c6a84f

Browse files
authored
Merge pull request #219 from nmslib/develop
Update master to 0.4.0
2 parents a97ec89 + 92e5b74 commit 3c6a84f

10 files changed

+765
-112
lines changed

CMakeLists.txt

+10-1
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,22 @@ include_directories("${PROJECT_BINARY_DIR}")
55

66

77

8-
set(SOURCE_EXE main.cpp)
8+
set(SOURCE_EXE main.cpp)
99

1010
set(SOURCE_LIB sift_1b.cpp)
1111

1212
add_library(sift_test STATIC ${SOURCE_LIB})
1313

1414

1515
add_executable(main ${SOURCE_EXE})
16+
if (CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
17+
SET( CMAKE_CXX_FLAGS "-Ofast -DNDEBUG -std=c++11 -DHAVE_CXX0X -openmp -march=native -fpic -ftree-vectorize")
18+
elseif (CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
1619
SET( CMAKE_CXX_FLAGS "-Ofast -lrt -DNDEBUG -std=c++11 -DHAVE_CXX0X -openmp -march=native -fpic -w -fopenmp -ftree-vectorize -ftree-vectorizer-verbose=0" )
20+
elseif (CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
21+
SET( CMAKE_CXX_FLAGS "-Ofast -lrt -DNDEBUG -std=c++11 -DHAVE_CXX0X -openmp -march=native -fpic -w -fopenmp -ftree-vectorize" )
22+
endif()
23+
24+
add_executable(test_updates examples/updates_test.cpp)
25+
1726
target_link_libraries(main sift_test)

README.md

+32-4
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
# Hnswlib - fast approximate nearest neighbor search
2-
Header-only C++ HNSW implementation with python bindings. Paper code for the HNSW 200M SIFT experiment
2+
Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment
33

44
**NEWS:**
55

6-
**Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib is now can be installed via pip!**
6+
* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the perfromance/memory should not degrade as you update the element embeddinds).**
7+
8+
* **Thanks to Dmitry [@2ooom](https://github.com/2ooom), hnswlib got a boost in performance for vector dimensions that are not mutiple of 4**
9+
10+
* **Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib can now be installed via pip!**
711

812
Highlights:
913
1) Lightweight, header-only, no dependencies other than C++ 11.
@@ -23,10 +27,10 @@ Description of the algorithm parameters can be found in [ALGO_PARAMS.md](ALGO_PA
2327
| Distance | parameter | Equation |
2428
| ------------- |:---------------:| -----------------------:|
2529
|Squared L2 |'l2' | d = sum((Ai-Bi)^2) |
26-
|Inner product |'ip' | d = 1.0 - sum(Ai\*Bi)) |
30+
|Inner product |'ip' | d = 1.0 - sum(Ai\*Bi) |
2731
|Cosine similarity |'cosine' | d = 1.0 - sum(Ai\*Bi) / sqrt(sum(Ai\*Ai) * sum(Bi\*Bi))|
2832

29-
Note that inner product is not an actual metric. An element can be closer to some other element than to itself.
33+
Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index.
3034

3135
For other spaces use the nmslib library https://github.com/nmslib/nmslib.
3236

@@ -42,6 +46,7 @@ Index methods:
4246
* `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
4347
* `labels` is an optional N-size numpy array of integer labels for all elements in `data`.
4448
* `num_threads` sets the number of cpu threads to use (-1 means use default).
49+
* `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
4550
* Thread-safe with other `add_items` calls, but not with `knn_query`.
4651

4752
* `mark_deleted(data_label)` - marks the element as deleted, so it will be ommited from search results.
@@ -223,6 +228,29 @@ To run the test on 200M SIFT subset:
223228

224229
The size of the bigann subset (in millions) is controlled by the variable **subset_size_milllions** hardcoded in **sift_1b.cpp**.
225230

231+
### Updates test
232+
To generate testing data (from root directory):
233+
```bash
234+
cd examples
235+
python update_gen_data.py
236+
```
237+
To compile (from root directory):
238+
```bash
239+
mkdir build
240+
cd build
241+
cmake ..
242+
make
243+
```
244+
To run test **without** updates (from `build` directory)
245+
```bash
246+
./test_updates
247+
```
248+
249+
To run test **with** updates (from `build` directory)
250+
```bash
251+
./test_updates update
252+
```
253+
226254
### HNSW example demos
227255

228256
- Visual search engine for 1M amazon products (MXNet + HNSW): [website](https://thomasdelteil.github.io/VisualSearch_MXNet/), [code](https://github.com/ThomasDelteil/VisualSearch_MXNet), demo by [@ThomasDelteil](https://github.com/ThomasDelteil)

examples/update_gen_data.py

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import numpy as np
2+
import os
3+
4+
def normalized(a, axis=-1, order=2):
5+
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
6+
l2[l2==0] = 1
7+
return a / np.expand_dims(l2, axis)
8+
9+
N=100000
10+
dummy_data_multiplier=3
11+
N_queries = 1000
12+
d=8
13+
K=5
14+
15+
np.random.seed(1)
16+
17+
print("Generating data...")
18+
batches_dummy= [ normalized(np.float32(np.random.random( (N,d)))) for _ in range(dummy_data_multiplier)]
19+
batch_final = normalized (np.float32(np.random.random( (N,d))))
20+
queries = normalized(np.float32(np.random.random( (N_queries,d))))
21+
print("Computing distances...")
22+
dist=np.dot(queries,batch_final.T)
23+
topk=np.argsort(-dist)[:,:K]
24+
print("Saving...")
25+
26+
try:
27+
os.mkdir("data")
28+
except OSError as e:
29+
pass
30+
31+
for idx, batch_dummy in enumerate(batches_dummy):
32+
batch_dummy.tofile('data/batch_dummy_%02d.bin' % idx)
33+
batch_final.tofile('data/batch_final.bin')
34+
queries.tofile('data/queries.bin')
35+
np.int32(topk).tofile('data/gt.bin')
36+
with open("data/config.txt", "w") as file:
37+
file.write("%d %d %d %d %d" %(N, dummy_data_multiplier, N_queries, d, K))

0 commit comments

Comments
 (0)