⚠️ This library is in active development: there is currently no release schedule!
This package is developed by Well-Typed LLP on behalf of Input Output Global, Inc. (IOG) and INTERSECT. The main contributors are Duncan Coutts, Joris Dral, Matthias Heinzel, Wolfgang Jeltsch, Wen Kokke, and Alex Washburn.
This package contains an efficient implementation of on-disk key–value
storage, implemented as a log-structured merge-tree or LSM-tree. An
LSM-tree is a data structure for key–value mappings, similar to
Data.Map
, but optimized for large tables with a high insertion volume.
It has support for:
-
Basic key–value operations, such as lookup, insert, and delete.
-
Range lookups, which efficiently retrieve the values for all keys in a given range.
-
Monoidal upserts which combine the stored and new values.
-
BLOB storage which assocates a large auxiliary BLOB with a key.
-
Durable on-disk persistence and rollback via named snapshots.
-
Cheap table duplication where all duplicates can be independently accessed and modified.
-
High-performance lookups on SSDs using I/O batching and parallelism.
This package exports two modules:
-
Database.LSMTree.Simple
This module exports a simplified API which picks sensible defaults for a number of configuration parameters.
It does not support upserts or BLOBs, due to their unintuitive interaction, see Upsert and BLOB.
If you are looking at this package for the first time, it is strongly recommended that you start by reading this module.
-
Database.LSMTree
This module exports the full API.
The interaction between upserts and BLOBs is unintuitive. A upsert updates the value associated with the key by combining the new and old values with a user-specified function. However, any BLOB associated with the key is simply deleted.
-
This package only supports 64-bit, little-endian systems.
-
On Windows, the package has only been tested with NTFS filesystems.
-
On Linux, executables using this package, including test and benchmark suites, must be compiled with the
-threaded
RTS option enabled.
LSM-trees can be used concurrently, but with a few restrictions:
-
Each session locks its session directory. This means that a database cannot be accessed from different processes at the same time.
-
Tables can be used concurrently and concurrent use of read operations such as lookups is determinstic. However, concurrent use of write operations such as insert or delete with any other operation results in a race condition.
The worst-case behaviour of the library is described using big-O notation. The documentation provides two measures of complexity:
-
The time complexity of operations is described in terms of the number of disk I/O operations and referred to as the disk I/O complexity. In practice, the time of the operations on LSM-trees is dominated by the number of disk I/O actions.
-
The space complexity of tables is described in terms of the in-memory size of an LSM-tree table. Both the in-memory and on-disk size of an LSM-tree table scale linearly with the number of physical entries. However, the in-memory size of an LSM-tree table is smaller than its on-disk size by a significant constant. This is discussed in detail below, under In-memory size of tables.
The complexities are described in terms of the following variables and constants:
-
The variable n refers to the number of physical table entries. A physical table entry is any key–operation pair, e.g.,
Insert k v
orDelete k
, whereas a logical table entry is determined by all physical entries with the same key. If the variable n is used to describe the complexity of an operation that involves multiple tables, it refers to the sum of all table entries. -
The variable o refers to the number of open tables and cursors in the session.
-
The variable s refers to the number of snapshots in the session.
-
The variable b usually refers to the size of a batch of inputs/outputs. Its precise meaning is explained for each occurrence.
-
The constant B refers to the size of the write buffer, which is a configuration parameter.
-
The constant T refers to the size ratio of the table, which is a configuration parameter.
-
The constant P refers to the the average number of key–value pairs that fit in a page of memory.
The following table summarises the cost of the operations on LSM-trees measured in the number of disk I/O operations. If the cost depends on the merge policy, the table contains one entry for each merge policy. Otherwise, the merge policy is listed as N/A.
Resource | Operation | Merge policy | Cost in disk I/O operations |
---|---|---|---|
Session | Create/Open | N/A | O(1) |
Close | MergePolicyLazyLevelling |
||
Table | Create | N/A | O(1) |
Close | MergePolicyLazyLevelling |
||
Lookup | MergePolicyLazyLevelling |
||
Range Lookup | MergePolicyLazyLevelling |
|
|
Insert/Delete/Update | MergePolicyLazyLevelling |
||
Duplicate | N/A | O(0) | |
Union | N/A | ||
Snapshot | Save | MergePolicyLazyLevelling |
|
Open | N/A | ||
Delete | MergePolicyLazyLevelling |
||
List | N/A | O(s) | |
Cursor | Create | MergePolicyLazyLevelling |
|
Close | MergePolicyLazyLevelling |
||
Read next entry | N/A |
(*The variable b refers to the number of entries retrieved by the range lookup.)
TODO: Document the average-case behaviour of lookups.
The in-memory size of an LSM-tree is described in terms of the variable
n, which refers to the number of physical database entries. A
physical database entry is any key–operation pair, e.g., Insert k v
or Delete k
, whereas a logical database entry is determined by all
physical entries with the same key.
The worst-case in-memory size of an LSM-tree is O(n).
-
The worst-case in-memory size of the write buffer is O(B).
The maximum size of the write buffer on the write buffer allocation strategy, which is determined by the
confWriteBufferAlloc
field ofTableConfig
. Regardless of write buffer allocation strategy, the size of the write buffer may never exceed 4GiB.AllocNumEntries maxEntries
The maximum size of the write buffer is the maximum number of entries multiplied by the average size of a key–operation pair. -
The worst-case in-memory size of the Bloom filters is O(n).
The total in-memory size of all Bloom filters is the number of bits per physical entry multiplied by the number of physical entries. The required number of bits per physical entry is determined by the Bloom filter allocation strategy, which is determined by the
confBloomFilterAlloc
field ofTableConfig
.AllocFixed bitsPerPhysicalEntry
The number of bits per physical entry is specified asbitsPerPhysicalEntry
.AllocRequestFPR requestedFPR
The number of bits per physical entry is determined by the requested false-positive rate, which is specified asrequestedFPR
.The false-positive rate scales exponentially with the number of bits per entry:
False-positive rate Bits per entry 1 in 10 ≈ 4.77 1 in 100 ≈ 9.85 1 in 1, 000 ≈ 15.79 1 in 10, 000 ≈ 22.58 1 in 100, 000 ≈ 30.22 -
The worst-case in-memory size of the indexes is O(n).
The total in-memory size of all indexes depends on the index type, which is determined by the
confFencePointerIndex
field ofTableConfig
. The in-memory size of the various indexes is described in reference to the size of the database in memory pages.OrdinaryIndex
An ordinary index stores the maximum serialised key for each memory page. The total in-memory size of all indexes is proportional to the average size of one serialised key per memory page.CompactIndex
A compact index stores the 64 most significant bits of the minimum serialised key for each memory page, as well as 1 bit per memory page to resolve clashes, 1 bit per memory page to mark overflow pages, and a negligable amount of memory for tie breakers. The total in-memory size of all indexes is approximately 66 bits per memory page.
The total size of an LSM-tree must not exceed 241 physical
entries. Violation of this condition is checked and will throw a
TableTooLargeError
.
The implementation of LSM-trees in this package draws inspiration from:
-
Chris Okasaki. 1998. "Purely Functional Data Structures" doi:10.1017/CBO9780511530104
-
Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. "Monkey: Optimal Navigable Key-Value Store." doi:10.1145/3035918.3064054
-
Subhadeep Sarkar, Dimitris Staratzis, Ziehen Zhu, and Manos Athanassoulis. 2021. "Constructing and analyzing the LSM compaction design space." doi:10.14778/3476249.3476274