Skip to content

IntersectMBO/lsm-tree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

lsm-tree

Cardano Handbook Build Haddocks

⚠️ This library is in active development: there is currently no release schedule!

This package is developed by Well-Typed LLP on behalf of Input Output Global, Inc. (IOG) and INTERSECT. The main contributors are Duncan Coutts, Joris Dral, Matthias Heinzel, Wolfgang Jeltsch, Wen Kokke, and Alex Washburn.

Description

This package contains an efficient implementation of on-disk key–value storage, implemented as a log-structured merge-tree or LSM-tree. An LSM-tree is a data structure for key–value mappings, similar to Data.Map, but optimized for large tables with a high insertion volume. It has support for:

  • Basic key–value operations, such as lookup, insert, and delete.

  • Range lookups, which efficiently retrieve the values for all keys in a given range.

  • Monoidal upserts which combine the stored and new values.

  • BLOB storage which assocates a large auxiliary BLOB with a key.

  • Durable on-disk persistence and rollback via named snapshots.

  • Cheap table duplication where all duplicates can be independently accessed and modified.

  • High-performance lookups on SSDs using I/O batching and parallelism.

This package exports two modules:

  • Database.LSMTree.Simple

    This module exports a simplified API which picks sensible defaults for a number of configuration parameters.

    It does not support upserts or BLOBs, due to their unintuitive interaction, see Upsert and BLOB.

    If you are looking at this package for the first time, it is strongly recommended that you start by reading this module.

  • Database.LSMTree

    This module exports the full API.

Upsert and BLOB

The interaction between upserts and BLOBs is unintuitive. A upsert updates the value associated with the key by combining the new and old values with a user-specified function. However, any BLOB associated with the key is simply deleted.

Portability

  • This package only supports 64-bit, little-endian systems.

  • On Windows, the package has only been tested with NTFS filesystems.

  • On Linux, executables using this package, including test and benchmark suites, must be compiled with the -threaded RTS option enabled.

Concurrency

LSM-trees can be used concurrently, but with a few restrictions:

  • Each session locks its session directory. This means that a database cannot be accessed from different processes at the same time.

  • Tables can be used concurrently and concurrent use of read operations such as lookups is determinstic. However, concurrent use of write operations such as insert or delete with any other operation results in a race condition.

Performance

The worst-case behaviour of the library is described using big-O notation. The documentation provides two measures of complexity:

  • The time complexity of operations is described in terms of the number of disk I/O operations and referred to as the disk I/O complexity. In practice, the time of the operations on LSM-trees is dominated by the number of disk I/O actions.

  • The space complexity of tables is described in terms of the in-memory size of an LSM-tree table. Both the in-memory and on-disk size of an LSM-tree table scale linearly with the number of physical entries. However, the in-memory size of an LSM-tree table is smaller than its on-disk size by a significant constant. This is discussed in detail below, under In-memory size of tables.

The complexities are described in terms of the following variables and constants:

  • The variable n refers to the number of physical table entries. A physical table entry is any key–operation pair, e.g., Insert k v or Delete k, whereas a logical table entry is determined by all physical entries with the same key. If the variable n is used to describe the complexity of an operation that involves multiple tables, it refers to the sum of all table entries.

  • The variable o refers to the number of open tables and cursors in the session.

  • The variable s refers to the number of snapshots in the session.

  • The variable b usually refers to the size of a batch of inputs/outputs. Its precise meaning is explained for each occurrence.

  • The constant B refers to the size of the write buffer, which is a configuration parameter.

  • The constant T refers to the size ratio of the table, which is a configuration parameter.

  • The constant P refers to the the average number of key–value pairs that fit in a page of memory.

Disk I/O cost of operations

The following table summarises the cost of the operations on LSM-trees measured in the number of disk I/O operations. If the cost depends on the merge policy, the table contains one entry for each merge policy. Otherwise, the merge policy is listed as N/A.

Resource Operation Merge policy Cost in disk I/O operations
Session Create/Open N/A O(1)
Close MergePolicyLazyLevelling $O(o \: T \: \log_T \frac{n}{B})$
Table Create N/A O(1)
Close MergePolicyLazyLevelling $O(T \: \log_T \frac{n}{B})$
Lookup MergePolicyLazyLevelling $O(T \: \log_T \frac{n}{B})$
Range Lookup MergePolicyLazyLevelling $O(T \: \log_T \frac{n}{B} + \frac{b}{P})$ *
Insert/Delete/Update MergePolicyLazyLevelling $O(\frac{1}{P} \: \log_T \frac{n}{B})$
Duplicate N/A O(0)
Union N/A $O(\frac{n}{P})$
Snapshot Save MergePolicyLazyLevelling $O(T \: \log_T \frac{n}{B})$
Open N/A $O(\frac{n}{P})$
Delete MergePolicyLazyLevelling $O(T \: \log_T \frac{n}{B})$
List N/A O(s)
Cursor Create MergePolicyLazyLevelling $O(T \: \log_T \frac{n}{B})$
Close MergePolicyLazyLevelling $O(T \: \log_T \frac{n}{B})$
Read next entry N/A $O(\frac{1}{P})$

(*The variable b refers to the number of entries retrieved by the range lookup.)

TODO: Document the average-case behaviour of lookups.

In-memory size of tables

The in-memory size of an LSM-tree is described in terms of the variable n, which refers to the number of physical database entries. A physical database entry is any key–operation pair, e.g., Insert k v or Delete k, whereas a logical database entry is determined by all physical entries with the same key.

The worst-case in-memory size of an LSM-tree is O(n).

  • The worst-case in-memory size of the write buffer is O(B).

    The maximum size of the write buffer on the write buffer allocation strategy, which is determined by the confWriteBufferAlloc field of TableConfig. Regardless of write buffer allocation strategy, the size of the write buffer may never exceed 4GiB.

    AllocNumEntries maxEntries
    The maximum size of the write buffer is the maximum number of entries multiplied by the average size of a key–operation pair.

  • The worst-case in-memory size of the Bloom filters is O(n).

    The total in-memory size of all Bloom filters is the number of bits per physical entry multiplied by the number of physical entries. The required number of bits per physical entry is determined by the Bloom filter allocation strategy, which is determined by the confBloomFilterAlloc field of TableConfig.

    AllocFixed bitsPerPhysicalEntry
    The number of bits per physical entry is specified as bitsPerPhysicalEntry.

    AllocRequestFPR requestedFPR
    The number of bits per physical entry is determined by the requested false-positive rate, which is specified as requestedFPR.

    The false-positive rate scales exponentially with the number of bits per entry:

    False-positive rate Bits per entry
    1 in 10  ≈ 4.77
    1 in 100  ≈ 9.85
    1 in 1, 000  ≈ 15.79
    1 in 10, 000  ≈ 22.58
    1 in 100, 000  ≈ 30.22
  • The worst-case in-memory size of the indexes is O(n).

    The total in-memory size of all indexes depends on the index type, which is determined by the confFencePointerIndex field of TableConfig. The in-memory size of the various indexes is described in reference to the size of the database in memory pages.

    OrdinaryIndex
    An ordinary index stores the maximum serialised key for each memory page. The total in-memory size of all indexes is proportional to the average size of one serialised key per memory page.

    CompactIndex
    A compact index stores the 64 most significant bits of the minimum serialised key for each memory page, as well as 1 bit per memory page to resolve clashes, 1 bit per memory page to mark overflow pages, and a negligable amount of memory for tie breakers. The total in-memory size of all indexes is approximately 66 bits per memory page.

The total size of an LSM-tree must not exceed 241 physical entries. Violation of this condition is checked and will throw a TableTooLargeError.

Implementation

The implementation of LSM-trees in this package draws inspiration from:

About

A Haskell library for on-disk tables based on LSM-Trees

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages