This directory adds support running experiments against pgvector - a PostgreSQL extension for vector similarity search.
It supports both HNSW and IVFFlat index types, with configurable index creation and search parameters.
To run VSB against a pgvector instance, simply provide the --database=pgvector
flag:
vsb --database=pgvector --workload=mnist-test
This will use the default connection parameters for a pgvector instance running in
the local Docker container - see docker/pgvector/
for the Docker compose file.
VSB is setup to connect to a pgvector instance running in a local Docker container.
See docker/pgvector/
for the Docker compose file.
You can connect to this instance via psql
with the password postgres
:
> psql --host=0.0.0.0 --port=5432 --dbname=postgres --username=postgres
To connect to a different instance, specify the appropriate connection parameters:
--pgvector_host
: Hostname of the pgvector instance--pgvector_port
: Port of the pgvector instance--pgvector_user
: Username to connect to the pgvector instance. Can also be set via theVSB__PGVECTOR_USERNAME
environment variable.--pgvector_password
: Password to connect to the pgvector instance. Can also be set the via theVSB__PGVECTOR_PASSWORD
environment variable.
By default, the hnsw index type is used. To use the ivfflat index type, specify:
--pgvector_index_type=ivfflat
Possible values for --pgvector_index_type
are hnsw, ivfflat, gin, hnsw+gin,
ivfflat+gin, and none (which will default to an exhaustive kNN search).
The amount of memory to use when creating the index (maintenance_work_mem
) can be set
via --pgvector_maintenance_work_mem
. This defaults to 4GB (50% of the memory assigned
to the Docker container). This should generally be set to at least as large as
the index to avoid excessive index build times.
By default, the number of lists the IVFFlat index is split into is set automatically based on the suggested values(1) for the workload:
Choose an appropriate number of lists - a good place to start is
rows / 1000
for up to 1M rows andsqrt(rows)
for over 1M rows
However this value can be overridden via:
--pgvector_ivfflat_lists=<int>
By default, the number of search candidates (ef_search
for hnsw, nprobes
for
ivfflat) is set automatically based on the recommended values(2,1) for the workload:
- hnsw:
2 * top_k
- ivfflat:
sqrt(lists)
This can be overridden and an explicit value set via:
--pgvector_search_candidates=<int>
.