Save order of secondary indexes to snapshot #11001

mkostoevr · 2024-12-26T09:58:11Z

mkostoevr
Dec 26, 2024
Collaborator

Reviewers

The problem

The issue is described in #10847: since we save tuples in snapshot in PK order, we have to sort them to build each secondary key (and the sorting process has n*log(n) complexity). Let's save the order of secondary keys in the snapshot to reduce the build complexity to O(n) (this, in theory, should speed-up recovery of secondary keys).

The algorithm

As the issue suggests, let's write the order of secondary indexes to the snapshot. The algorithm is:

Iterate over each space index and write tuple addresses in its order. After this step, such entry in snapshot will appear:

+--------------------------------------------------+
| [0xffff00001111, 0xffff00002222, 0xffff00003333] | - primary index
| [0xffff00001111, 0xffff00003333, 0xffff00002222] | - secondary index
+--------------------------------------------------+

After recovering the primary index (or even during its recovery), we can use its serialized representation to build a mapping (hash-table) from the old tuple addresses to the new ones. Imagine that the new primary index has such tuple addresses in this order: [0xa3, 0xb2, 0xc1]. Then we know that the first tuple in primary index had address 0x01 and its new address is 0xa3, the second one had 0x02 and the new address is 0xb2 and so on. So we can easily build the mapping:

{0xffff00001111: 0xffff0000aaaa, 0xffff00002222: 0xffff0000bbbb, 0xffff00003333: 0xffff0000cccc}

Update all arrays of orders in secondary indexes using the tuple address mapping - and now we have an array of actual tuples sorted in a secondary index order, so we don't need to use parallel qsort and can simply build an index. In our example, the secondary index in snapshot was [0xffff00001111, 0xffff00003333, 0xffff00002222] and now it becomes [0xffff0000aaaa, 0xffff0000cccc, 0xffff0000bbbb] - array of actual tuples in required order.

Extra: hints

If the index we're loading the new way is TREE with hints enabled then it's not reasonable to use this approach unless we save the tuple hints along with tuple pointers in the index sort data files. The reason is some CPU cache effects described below.

Details on the cache effects.

Currently we perform the following steps on recovery:

Read the snapshot, create tuples in PK order and put them into the PK build array to perform fast tree build.
Then we fill the SK build array with tuples in the same order as PK.
1. When we insert a tuple into the build array of the index with hints enabled we have to calculate the hints of the added tuple.
2. In order to calculate the hint we have to access the tuple data.
3. Since the tuples are loaded in PK order, they're located in memory sequentially in this order.
4. So when we insert tuples into SK build array in PK order, we traverse the memory in one direction to calculate hints and have the CPU prefetcher reduce the amount of cache misses.
5. Only after that we do tt_sort to reorder the array the way it should be in SK.

But if we use the new approach, we insert the index data into the build array right in SK order. That means, since the tuples are located in memory in PK order, in order to calculate hint of each tuple inserted into the SK, we need a lot of random memory accesses. This destructs the gain we've received for O(n) build array generation: in our PoC we had 10 seconds wasted in a single data load instruction.

So the approach is useless for indexes with hints unless we save the hints in the same index file, so we don't have to calculate them in runtime.

Extra: multikey indexes

The multikey indexes have the same problem since we have to access the tuple data in order to decide if we should include a particular copy of the tuple into the index (due to the exclude_null option), but also, the hints are required to specify to tuple order in the index (the tuple pointer by itself does not give the information about which multikey member of the tuple is meant to be located here). So the multikey hints are to be saved in the sort data files too.

Extra: functional keys

Here we have the same problem as for multikey and regular indexes with hints, but this one does not seem to have a solution, so let's totally disable the feature for functional indexes.

Summary: the data to be stored

8 bytes per tuple in PK (tuple pointers in index order).
8/16 bytes per tuple in Sks (tuple pointers along with hints if enabled).

Implementation details

Here's the proposed sequence of events:

On initial recovery: start building the old to new tuple addresses map by reading the PK sort data during recovery.
On final recovery: build all indexes with the feature enabled and the sort data existing and proceed the recovery.
Replace in both PK and all indexes that had been loaded the new way until recovery is complete.
Once the recovery is complete we build other SKs as before.

The sort data must only be used on non-system spaces if the they have no before_replace triggers and no force_recovery specified (system spaces are less likely to benefit from it). Also, it can only be safely used during recovery if the _index space has no before_replace triggers registered, cause in the opposite case it could change the index we saved sort data for so that the data is not applicable to it anymore.

The storage

Let's store the sort data (a binary sequence of 8-byte pointers[ and hints]) in the <vclock_signature>.sortdata file. The file is created along with the regular snapshot file in memtx_engine_begin_checkpoint, but only for TREE index. The structure:

SORTDATA\n
1\n
Version: 3.4.0-entrypoint-138-ge86fc856f3\n
Instance: 7d10aabd-ebd9-47e5-a889-79b691c9d2a7\n
Entries: 2\n
512/0: 0x0000000000000041, 0x0000000000003000, 00000000000000001536\n
512/1: 0x0000000000003041, 0x0000000000006000, 00000000000000001536\n
\n
<the sort data of specified offsets>

More thorougly:

The first line is the file magic, specifying the file format.
The second one is the format version used (as it is the case for XLOG files).
The next ones are similar to "Version" and "Instance" in XLOG files.
The "Entries" key specifies the amount of keys specifying the sort data properties after it.
The next keys specify (space, index) sort data's (offset, size, tuple count) specifiers. This line is meant to be extensible by new tags, like 512/1: 0x0000000000000041, 0x0000000000000526, 00000000000000001536, gz, 0x0000000000003000\n to specify e. g. the compression algorithm and the original size.
The empty line is the end of keys.

Alternativs considered.

Save the information in the snapshot file metadata: this way we're limited by the max metadata header size, which is pretty small (~2KB as for ef3775a).
Save the information as one of entries in the snapshot file: we can't create a new fixheader type since it's strictly checked. A new request type can't be used in it either, cause memtx only accepts inserts (along with RAFT stuff) there.
A system blackhole space.

The idea is to insert the information into a system blackhole space in the end of the snapshot. If the snapshot does not contain the space, then the recovery is performed the old way (compatibility with old snapshots). Don't write the space on box.snapshot() after the downgrade (for backward compatibility). The blackhole space tuple format:

Field type Description

MP_BOOL Set to true if the tuple is the last one for the given index.

MP_UINT Space ID.

MP_UINT Index ID.

MP_STRING The sort data chunk.

The SK sort data of a particular index consists of a number of such tuples, this is required to reduce the tuple arena usage. Last of the tuples has the is_last flag set to true to mark the point where the index has all the information required and can be built using the new approach.

Since the space with the sort data is filled in the end of the snapshot, the old tuple addresses must be saved in some another way, so that we can create the old to new address map during PK recovery. Let's save the PK tuple pointers in the snapshot right inside the headers of INSERT entries (right next to the timestamp).

Alternative considered.

We could create another space for such data and write it before user spaces, but it will require extra RAM to have the PK tuple addresses persistently until we got to build the space the information is supposed for. So we better to store the information along with the tuples we load and forget about it right after use.

Comparison to option 0:
➖ The sort data is built into the snapshot, so the latter can't be moved (backed-up) separately.
➖ The sort data is first instantiated into a tuple and then provided to indexes (extra indirection).
➖ No nice way to send the data to replicas unless we generate it from scratch or hack the xlog reader.
➖ Additional ~1 byte per PK tuple of persistent storage.

Getting tuple pointers from the read view

It would be nice (although, probably not necessary, need to be measured) to be able to receive both tuple data and pointer to the raw tuple from the index in one step. So let's return the tuple pointer along with the data in the struct read_view_tuple (add a new struct tuple *ptr field).

Alternative considered.

Introduce a new struct tuple **ptr output parameter to the index_read_view_iterator_base::next_raw callback.

The configuration

The approach only has sense if a database has many secondary keys of TREE type and a performant persistent storage, because it increases the initial recovery time and relies on the fast direct sort data read, so a new boolean memtx_sort_data_enabled (or memtx.sort_data_enabled) configuration option is proposed to specify whether to use or not to use the new aproach in building secondary keys and writing the snapshot.

The configuration variable is to be changeable in runtime. If one does not want to use the sort data on recovery but want to write it during snapshoting, it can be set to false initially and then recofigured to true prior to box.snapshot(). Works the opposite way too.

Expected results

Checkpointing

➖ Additional data to be written to the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ Additional RAM to be required if we change the secondary keys during checkpointing (now we only read-view PK, but will have to read-view SKs too).

Recovery

➖ Additional data to be read from the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ The old to new tuple address map is to be created and filled (more RAM and CPU time required).
➕ Build of secondary indexes is significantly sped-up.

Practical results (storage option 0, but they're simple binary files)

Configuration

CPU: Zen 5, 8 dedicated cores
Storage: NVME (7000 MBps read, 5000 MBps write)
Part 1 PK and part 2 SK, Tuples with 2 random unsigned fields.

Checkpointing

Tuple count	Regular approach	New approach
1 000 000	1.070s 0.119GB	1.070s 0.119GB
10 000 000	12.400s 0.771GB	12.520s 0.771GB
100 000 000	2m31.930s 7.305GB	2m33.270s 7.305GB
1000 000 000	34m42s 72.623GB	35m07s 72.623GB

Recovery

Tuple count	`tt_sort`, 1 thread	`tt_sort`, 2 threads	New approach
1 000 000	Time: 0.400s Initial recovery: 0.302s SK build: 0.098s Mem: 0.131GB	Time: 0.375s Initial recovery: 0.314s SK build: 0.061s Mem: 0.131GB	Time: 0.400s Initial recovery: 0.357s SK build: 0.043s Mem: 0.139GB
10 000 000	Time: 3.370s Initial recovery: 2.266s SK build: 1.104s Mem: 0.852GB	Time: 2.980s Initial recovery: 2.281s SK build: 0.699s Mem: 0.852GB	Time: 2.890s Initial recovery: 2.666s SK build: 0.224s Mem: 1.209GB
100 000 000	Time: 35.420s Initial recovery: 21.776s SK build: 13.644s Mem: 8.121GB	Time: 29.500s Initial recovery: 21.694s SK build: 7.806s Mem: 8.120GB	Time: 28.920s Initial recovery: 25.619s SK build: 3.301s Mem: 11.312GB
1000 000 000	Time: 7m8s Initial recovery: 3m19s SK build: 3m49s Mem: 80.897GB	Time: 5m33s Initial recovery: 3m36s SK build: 1m57s Mem: 80.897GB	Time: 4m56s Initial recovery: 4m10s SK build: 46s Mem: 101.628GB

Perf details

Memory overhead:

1M: not reliable, too fast
10M: 36 bytes per tuple
100M: 32 bytes per tuple
1000M: 21 bytes per tuple
min got in side tests: ~21 bytes per tuple.
max got in side tests: ~46 bytes per tuple.

SK build time:

1M: similar to 4 thread sort
10M: faster than sort can get with 1-16 cores
100M: similar to 12 thread sort
1000M: similar to 8 thread sort

Initial recovery overhead:

1M: 0.04s
10M: 0.4s
100M: 4s
1000M: 40s

mkostoevr · 2025-01-09T14:14:09Z

mkostoevr
Jan 9, 2025
Collaborator Author

It might be better to read the secondary key build data files in a separated thread[s] and wait for them before starting building the secondary keys (if we use the approach with separated index recovery data files).

0 replies

locker · 2025-01-09T16:01:54Z

locker
Jan 9, 2025
Maintainer

Update all arrays of orders in secondary indexes using the tuple address mapping - and now we have an array of actual tuples sorted in a secondary index order, so we don't need to use parallel qsort and can simply build an index. In our example, the secondary index in snapshot was [0x01, 0x03, 0x02] and now it becomes [0xa3, 0xc1, 0xb2] - array of actual tuples in required order.

It'd be good to do a small experiment proving the viability of the new approach:

Sort a large (> 1Gb) array of randomly shuffled integers in range 1..N using our version of quick sort.
Sort the same array using the merge sort algorithm.
Sort the same array using integer sorting: iterate over the input array and store value input[i] at output[input[i]].
Compare the execution time.

(AFAIU using a hash table isn't really cache-friendly; I'm curious what performance impact of random accesses would be)

6 replies

locker Jan 15, 2025
Maintainer

tt_sort (1 thread): 1m47s

The suggested approach, but the .index data is preloaded (using mhash as old2new addr map): 1m35s

The suggested approach, but the .index data is preloaded (using a regular array of a static size as old2new addr map): 17s

Some thoughts:

Sorting of tuples may be slower than sorting of integers because tuples are stored by pointer.
To store old2new addr map in an array, which seems to speed up recovery quite a bit, we have to build a hash table mapping tuple addresses to positions in the primary index during checkpointing. Not sure if it's worth it because this would increase memory consumption at runtime.

Most of the time in the suggested approach with hash tables used is consumed while building the old2new tuple address association table (we do it while reading PK tuple addresses).

What's the ratio?

drewdzzz Jan 15, 2025
Collaborator

Most of the time in the suggested approach with hash tables used is consumed while building the old2new tuple address association table (we do it while reading PK tuple addresses).

Let's save number of elements in header of .index file in order to do mhash_reserve before populating it - it should speed up the approach.

mkostoevr Feb 4, 2025
Collaborator Author

@drewdzzz the reserving made the model 33% faster (1m35s vs 1m1s).

@locker, regarding the ratio: created a PoC with option 0 storage and loaded 1000M tuples (reserving applied, of course):

mh_ptrptr_put_slot: 17.9s
mh_ptrptr_put_find: 6.4s

More detailed performance analisys is coming soon (once I finish investigating the blackhole system space approach).

drewdzzz Feb 5, 2025
Collaborator

I had a thought about a better hash table.

Actually, mhash is kind of lightweight hash-table. We always use it for small workloads - active trackers in MVCC, iproto override handlers and so on. A billion tuples is a completely different workload, so I would try to take another hash table and compare the performance.

As the first step, we could call mh_stat after the table is built to estimate potential amount of collisions.

By the way, there are lots of benchmarks measuring open-source hash-tables' performance. For example, I found this one - and it shows that for our workload (inserts of nonexistent keys and lookups of existent ones, keys are 64-bit) dict from mlib suits perfectly. Also, it is written in pure C.

mkostoevr Feb 11, 2025
Collaborator Author

Tested with these implementations (m-dict and m-dict in open addressing mode). Both happen to be slower and consume more memory than mhash in all cases I've tested (with indentical tuple pointer hash function).

The link to the PoC: https://github.com/mkostoevr/tarantool/tree/m.kostoev/o-n-snapshot-poc

drewdzzz · 2025-01-09T18:13:13Z

drewdzzz
Jan 9, 2025
Collaborator

Could you please provide more information about how and when you will use these *.index files.

To be more specific, let's say you have two spaces: space1 with TREE and HASH secondary indexes, space2 with only HASH secondary index. What indexes will be built after snapshot recovery? Variants I consider:

Only TREE of space1 is built after snapshot recovery since it's the only index with saved order. All other indexes will be built after WAL is recovered.
All indexes of space1 will be built after snapshot recovery since it has at least one pre-saved secondary index. All indexes of space2 will be built after WAL recovery.
All indexes of all Tarantool spaces will be built after snapshot recovery. We do that because we have at least one pre-saved index for the whole database.

By the way, does building secondary indexes only after WAL recovery really provides any sensible performance gains? It would be simpler to always build secondary indexes after snapshot recovery.

3 replies

mkostoevr Jan 27, 2025
Collaborator Author

The first solution is less invasive and pretty simple to implement.

drewdzzz Feb 5, 2025
Collaborator

As for me, the third one is actually less invasive. Now we build all secondary indexes before playing WAL if MVCC is enabled - we could just do it always. Then, there is no need to mess with building not all indexes before playing WAL.

mkostoevr Feb 11, 2025
Collaborator Author

Under less invasive I mean, with smaller amount of changes of the existing functionality.

drewdzzz · 2025-01-09T18:42:08Z

drewdzzz
Jan 9, 2025
Collaborator

Option 1: a system space
The idea is to save the information in a system space: create the space on startup and remove it after recovery.
Have the data required for sorting secondary keys in the space in the snapshot.
If the snapshot does not contain the space, the the recovery is performed the old way (compatibility with old snapshots).
Don't write the space on box.snapshot() after the downgrade (for backward compatibility).

I would elaborate a bit. The idea is to have a "fake" space (let's call it _index_order for now) in snapshot that will store tuples with index order. For example, such tuples would have format [space_id, index_id, part_id, [<tuple addresses...>]]. Order of the whole index will be stored in several tuples to keep them relatively small. The main disadvantage is that we use have to use arena here.

Such space shouldn't actually store anything in-memory, only process the tuples in a system on_replace trigger. For that purpose, we could use blackhole engine (never worked with it, so I can be wrong here 😄). If we cannot use blackhole, we can try to create this fake space right before recovery (just as we do with all other system spaces) and delete it after snapshot recovery is finished.

So, basically, the flow is:

On snapshot creation - space _index_order is actually empty. But we write it to snapshot in a special way - we write a bunch of IPROTO_INSERT statements with tuples describing secondary index order (details about such tuples can be found above).
On snapshot recovery - when all other spaces are recovered, we should return to the _index_order space (or write it to the end of the snapshot on its creation) and build all indexes with saved tuple order.

1 reply

mkostoevr Feb 5, 2025
Collaborator Author

Described the approach more thoroughly, PTAL.

Totktonada · 2025-01-10T10:42:55Z

Totktonada
Jan 10, 2025
Maintainer

A new configuration variable is proposed: memtx_use_sk_recovery_data. If the variable is disabled, we sort the secondary key tuples as before and don't write recovery data on snapshot.

As I understood there are variants that are possible to implement without any options. For this we need to ensure that the secondary keys order data don't break old tarantool versions anyhow (including waste of memory). I think that we can consider the following variants:

Separate files.
Blackhole engine.
NOP operations with a payload.

In my opinion, backhole engine/NOP operations are tricks, while holding the extra data in separate files looks like a direct solution. So, it seems better for me.

OTOH, we should verify how snapshotting time is changed in case of several secondary indexes. If it becomes much longer, maybe the option to refuse to write it is useful (but I'm not sure).

It is also important to take into account that modern SSD is able to serve several parallel write requests with the speed near to the single write. I think that parallel write of several files may be very profitable here.

These files are created along with the regular snapshot file in memtx_engine_begin_checkpoint, but only for TREE index and if the space has no before_replace triggers.

Technically speaking, shouldn't we check presence of such triggers while reading the data, not writing?

I guess that if we have the trigger on writing a snapshot, we likely have it after restart too, but there is no such a guarantee.

1 reply

mkostoevr Feb 4, 2025
Collaborator Author

The problem is that we don't always benefit from the feature, because it slows down the initial recovery of a space if it uses the functionality, so it only benefits if the space has a significant amount of TREE secondary keys. A space config option had been introduced instead, PTAL.

Technically speaking, shouldn't we check presence of such triggers while reading the data, not writing?

You're right, it's also been noted in a comment above. Fixed.

locker · 2025-01-15T11:41:52Z

locker
Jan 15, 2025
Maintainer

Will this feature work for functional and multikey indexes? Please add this information to the RFC.

3 replies

mkostoevr Feb 4, 2025
Collaborator Author

We have perf degradation if we access the tuple data during build array generation using the new approach (see the spoiler under the Extra: hints section).

So it's better to be disabled for functional indexes, but we can safely use the feature for multikey indexes by saving hints (multikey indices) in the .sortdata file with along with tuple pointers. More information on that in the Extra: multikey indexes section.

drewdzzz Feb 5, 2025
Collaborator

I don't get the problem with functional indexes - couldn't we just save the hint (extracted functional key) next to the tuple?

mkostoevr Feb 5, 2025
Collaborator Author

Technically we can, can't say a thing about the performance impact of this though: this will require a bit more sophisticated procedure to build the index such way (we can't directly transform the sort data into the build array as for non-pointer hints). This must be investigated further.

I didn't bother with it for now as it seems as a rarely used functionality not worth the effort to me, but I'm open for discussion if it's the opposite.

unera · 2025-01-28T12:37:00Z

unera
Jan 28, 2025
Collaborator

Hi, there!

let's consider the following alternative:

The snapshot process stores tuple addresses into additional fields in the INSERT-commands:

$ tarantoolctl cat 00000000000000000004.snap
---
HEADER:
  lsn: 393
  type: INSERT
  timestamp: 1738067327.6443
BODY:
  space_id: 512
  tuple: [1, 2, 3]
---
HEADER:
  lsn: 394
  type: INSERT
  timestamp: 1738067327.6443
BODY:
  space_id: 512
  tuple: [2, 3, 4]

we can use additional field in HEADER or BODY for each record. The way must be backward compatible.
Also at the store moment we know tuple address, so the way has minimum overhead.

1 reply

mkostoevr Feb 4, 2025
Collaborator Author

The solution is required in case we decide to save the information in the snapshot it seeems. More info in the corresponding section.

unera · 2025-01-28T13:32:09Z

unera
Jan 28, 2025
Collaborator

Option 1: a system space

I vote the variant.

But the space must have format that is compatible with oldest tarantool version (downgrade process).

So, I think that its format can be:

{ space_id, index_id, $serial } - primary key
tuple_address

There are two ways here:

download the content into real space and use the dataset while secondary keys are built
- good: introspection while errors
- good: space:api while indexes are built
- bad: slab_alloc for the space
process the content with on-demand mode while the dataset is reading
- good: avoid slab_alloc
- good: window processing
- bad: additional complexity

Also it would be nice to have an instruction :drop() or :truncate() for the space (for schema.downgrade variant)

1 reply

mkostoevr Feb 5, 2025
Collaborator Author

The data is better to be inserted into a blackhole space, as it's been suggested in the comment above, instead of having it persistently in RAM (cleaning it up by the way is not possible during the initial recovery now, it must be hacked). The sort data is a lot of information.

drewdzzz · 2025-01-30T13:00:54Z

drewdzzz
Jan 30, 2025
Collaborator

Will this feature work with tuple compression?
AFAIR, memtx stores compressed tuples and creates uncompressed ones on each read.

3 replies

mkostoevr Feb 4, 2025
Collaborator Author

That's not a problem as far as the tuple addresses are consistent during the Tarantool runtime (they are as far as I know), but there's another problem with read views - the read view iterators only return[ decompressed] tuple data, but not tuple pointers. That's an issue to be addressed in the RFC soon.

drewdzzz Feb 5, 2025
Collaborator

That's not a problem as far as the tuple addresses are consistent during the Tarantool runtime (they are as far as I know)

Are they? Tarantool stores tuples with compressed fields, and, AFAICS, allocates new tuples on each read in order to return decompressed result. So the pointers are not consistent if simple index_create_iterator is used.

So I think that tuple compression is a tricky place and deserves to be thought of 😄

mkostoevr Feb 5, 2025
Collaborator Author

Tarantool stores tuples with compressed fields, and, AFAICS, allocates new tuples on each read in order to return decompressed result.

Sure, I mean, in indexes' PoV: if we have a compressed tuple pointer in PK (phisically in the BPS tree), then we have the same pointer in SK and will have it up until the tuple is deleted. And the pointer can be retrieved from the index read view along with its uncompressed data, but this will require some API update, not sure about the details.

That was the problem I mentioned above: we need a way to get the[ compressed] tuple pointer from an index read view so we could write it into the sort data file (now the read view iterator API only provides the tuple data, but we also need the raw tuple pointer).

drewdzzz · 2025-02-05T08:42:28Z

drewdzzz
Feb 5, 2025
Collaborator

Summary: the data required - 8 bytes per tuple in PK (tuple pointers in index order), ...

Isn't it 16 bytes? 8 bytes for key (old address) and 8 bytes for value (new address)?
That's how mhash reserve works - it resizes the table to size / MH_DENSITY (currently it is size / 0.7 = size * 1.43) - it is needed to reduce probability of collisions.

So, actually, without hints, 16 / 0.7 = 22,857 bytes per tuple are needed.

UPD: I've read this section again and understood what you've meant. Let's just write below that we should multiply memory consumption by 1.43 for mhash efficiency purpose (the factor seems quite large, so we should consider it). By the way, I'm talking about RAM consumption on recovery here.

1 reply

mkostoevr Feb 11, 2025
Collaborator Author

Thanks, clarified the section name. The section describes the data required to be written into the snapshot.

The memory consumption overhead is described separately in the "Practical results" section.

locker · 2025-02-11T13:11:47Z

locker
Feb 11, 2025
Maintainer

The index sort data includes tuple pointers, so we need a way to retrieve the tuple pointers from the index read views.

Let's add a tuple pointer to the struct.

Sort data file format is not approved yet.

Looks OK to me.

We may need to consider possibly upcoming memcs snapshots: will they be stored in another separated file?

I think memcs should store its data in different files.

The configuration is not approved yet.

Looks OK to me.

0 replies

unera · 2025-02-11T17:11:29Z

unera
Feb 11, 2025
Collaborator

The configuration

I think it is enough to have

an option that enable/disable storing TREEs into .index files
an option that disable loading .index files

The other variants (per space, etc) are overkill.

PS: the second option is optional: a user could remove .index files, if he wants

3 replies

locker Feb 12, 2025
Maintainer

Agree, per space options aren't really needed. Actually, I overlooked them while reviewing the RFC - I thought there would only be one global option - memtx_enable_sort_data.

Totktonada Feb 13, 2025
Maintainer

We don't have 'enable' or 'disable' words in other option names (in box.cfg options). There are some <smth>_enabled options, though.
Also, it is not clear from the option name, whether it affects reading of the secondary index data or only writing. If it is only about writing, I would use explicit 'write' or 'store' instead of the vague 'enable'.
'Sort data' doesn't reveal much about what the option is doing, IMHO. 'Store indexes' has more sense for me, for example: memtx_store_index. In the YAML configuration we can make it even more explicit: memtx.store_index = one-of(primary, all).
The configuration section in the RFC doesn't make it clear how box.cfg option is named and how YAML config option is named. It is better to show both names in the RFC to avoid ambiguity.

mkostoevr Feb 13, 2025
Collaborator Author

What about having a composite one: memtx_sk_sort_data = { use = true, save = true } (memtx.sk_sort_data.[use|save] in the YAML config). This will keep the window for per-space specification open.

sergepetrenko · 2025-02-20T13:20:51Z

sergepetrenko
Feb 20, 2025
Maintainer

8/16 bytes per tuple in Sks (tuple pointers along with hints if enabled).

Please don't forget to handle corner cases when hints are enabled/disabled by a trigger on an _index space prior to recovery. You might end up with an index which has hints stored on disk but the hints are not requested by index definition and vice versa.

Same (although rather unlikely) case: when an index is dropped completely by a before_replace trigger on _index space. This might be needed in some disaster recovery scenarios.

What I'm trying to say we must be ready that the data saved on disk is not needed, and we must be ready that the expected data is not present on disk. In both cases some warning should be logged, like index "secondary" on space "space" has hints, but no hints found in index file or index "someindex" found in index file, but not in schema. Ignoring.

<vclock_signature>.sortdata file.

Why did you choose to name the file .sortdata instead of .index or probably .indexdata ?

Instance: 7d10aabd-ebd9-47e5-a889-79b691c9d2a7\n
Entries: 2\n

Let's please add Version section there to keep meta keys consistent between .snap, .xlog and .sortdata files.

0 replies

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

Tarantool

Save order of secondary indexes to snapshot #11001

mkostoevr Dec 26, 2024 Collaborator

Reviewers

ToC

The problem

The algorithm

Extra: hints

Extra: multikey indexes

Extra: functional keys

Summary: the data to be stored

Implementation details

The storage

Getting tuple pointers from the read view

The configuration

Expected results

Practical results (storage option 0, but they're simple binary files)

Replies: 17 comments · 31 replies

mkostoevr Jan 9, 2025 Collaborator Author

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

locker Jan 9, 2025 Maintainer

locker Jan 15, 2025 Maintainer

drewdzzz Jan 15, 2025 Collaborator

mkostoevr Feb 4, 2025 Collaborator Author

drewdzzz Feb 5, 2025 Collaborator

mkostoevr Feb 11, 2025 Collaborator Author

This comment has been hidden.

This comment has been hidden.

drewdzzz Jan 9, 2025 Collaborator

mkostoevr Jan 27, 2025 Collaborator Author

drewdzzz Feb 5, 2025 Collaborator

mkostoevr Feb 11, 2025 Collaborator Author

drewdzzz Jan 9, 2025 Collaborator

mkostoevr Feb 5, 2025 Collaborator Author

Totktonada Jan 10, 2025 Maintainer

mkostoevr Feb 4, 2025 Collaborator Author

locker Jan 15, 2025 Maintainer

mkostoevr Feb 4, 2025 Collaborator Author

drewdzzz Feb 5, 2025 Collaborator

mkostoevr Feb 5, 2025 Collaborator Author

unera Jan 28, 2025 Collaborator

mkostoevr Feb 4, 2025 Collaborator Author

unera Jan 28, 2025 Collaborator

mkostoevr Feb 5, 2025 Collaborator Author

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

drewdzzz Jan 30, 2025 Collaborator

mkostoevr Feb 4, 2025 Collaborator Author

drewdzzz Feb 5, 2025 Collaborator

mkostoevr Feb 5, 2025 Collaborator Author

drewdzzz Feb 5, 2025 Collaborator

mkostoevr Feb 11, 2025 Collaborator Author

locker Feb 11, 2025 Maintainer

unera Feb 11, 2025 Collaborator

locker Feb 12, 2025 Maintainer

Totktonada Feb 13, 2025 Maintainer

mkostoevr Feb 13, 2025 Collaborator Author

sergepetrenko Feb 20, 2025 Maintainer

mkostoevr
Dec 26, 2024
Collaborator

Replies: 17 comments 31 replies

mkostoevr
Jan 9, 2025
Collaborator Author

locker
Jan 9, 2025
Maintainer

locker Jan 15, 2025
Maintainer

drewdzzz Jan 15, 2025
Collaborator

mkostoevr Feb 4, 2025
Collaborator Author

drewdzzz Feb 5, 2025
Collaborator

mkostoevr Feb 11, 2025
Collaborator Author

drewdzzz
Jan 9, 2025
Collaborator

mkostoevr Jan 27, 2025
Collaborator Author

drewdzzz Feb 5, 2025
Collaborator

mkostoevr Feb 11, 2025
Collaborator Author

drewdzzz
Jan 9, 2025
Collaborator

mkostoevr Feb 5, 2025
Collaborator Author

Totktonada
Jan 10, 2025
Maintainer

mkostoevr Feb 4, 2025
Collaborator Author

locker
Jan 15, 2025
Maintainer

mkostoevr Feb 4, 2025
Collaborator Author

drewdzzz Feb 5, 2025
Collaborator

mkostoevr Feb 5, 2025
Collaborator Author

unera
Jan 28, 2025
Collaborator

mkostoevr Feb 4, 2025
Collaborator Author

unera
Jan 28, 2025
Collaborator

mkostoevr Feb 5, 2025
Collaborator Author

drewdzzz
Jan 30, 2025
Collaborator

mkostoevr Feb 4, 2025
Collaborator Author

drewdzzz Feb 5, 2025
Collaborator

mkostoevr Feb 5, 2025
Collaborator Author

drewdzzz
Feb 5, 2025
Collaborator

mkostoevr Feb 11, 2025
Collaborator Author

locker
Feb 11, 2025
Maintainer

unera
Feb 11, 2025
Collaborator

locker Feb 12, 2025
Maintainer

Totktonada Feb 13, 2025
Maintainer

mkostoevr Feb 13, 2025
Collaborator Author

sergepetrenko
Feb 20, 2025
Maintainer