Replies: 17 comments 31 replies
-
It might be better to read the secondary key build data files in a separated thread[s] and wait for them before starting building the secondary keys (if we use the approach with separated index recovery data files). |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
It'd be good to do a small experiment proving the viability of the new approach:
(AFAIU using a hash table isn't really cache-friendly; I'm curious what performance impact of random accesses would be) |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
Could you please provide more information about how and when you will use these To be more specific, let's say you have two spaces:
By the way, does building secondary indexes only after WAL recovery really provides any sensible performance gains? It would be simpler to always build secondary indexes after snapshot recovery. |
Beta Was this translation helpful? Give feedback.
-
I would elaborate a bit. The idea is to have a "fake" space (let's call it Such space shouldn't actually store anything in-memory, only process the tuples in a system So, basically, the flow is:
|
Beta Was this translation helpful? Give feedback.
-
As I understood there are variants that are possible to implement without any options. For this we need to ensure that the secondary keys order data don't break old tarantool versions anyhow (including waste of memory). I think that we can consider the following variants:
In my opinion, backhole engine/NOP operations are tricks, while holding the extra data in separate files looks like a direct solution. So, it seems better for me. OTOH, we should verify how snapshotting time is changed in case of several secondary indexes. If it becomes much longer, maybe the option to refuse to write it is useful (but I'm not sure). It is also important to take into account that modern SSD is able to serve several parallel write requests with the speed near to the single write. I think that parallel write of several files may be very profitable here.
Technically speaking, shouldn't we check presence of such triggers while reading the data, not writing? I guess that if we have the trigger on writing a snapshot, we likely have it after restart too, but there is no such a guarantee. |
Beta Was this translation helpful? Give feedback.
-
Will this feature work for functional and multikey indexes? Please add this information to the RFC. |
Beta Was this translation helpful? Give feedback.
-
Hi, there! let's consider the following alternative: The snapshot process stores tuple addresses into additional fields in the INSERT-commands: $ tarantoolctl cat 00000000000000000004.snap
---
HEADER:
lsn: 393
type: INSERT
timestamp: 1738067327.6443
BODY:
space_id: 512
tuple: [1, 2, 3]
---
HEADER:
lsn: 394
type: INSERT
timestamp: 1738067327.6443
BODY:
space_id: 512
tuple: [2, 3, 4]
we can use additional field in HEADER or BODY for each record. The way must be backward compatible. |
Beta Was this translation helpful? Give feedback.
-
I vote the variant. But the space must have format that is compatible with oldest tarantool version (downgrade process). So, I think that its format can be:
There are two ways here:
Also it would be nice to have an instruction |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
Will this feature work with tuple compression? |
Beta Was this translation helpful? Give feedback.
-
So, actually, without hints, UPD: I've read this section again and understood what you've meant. Let's just write below that we should multiply memory consumption by |
Beta Was this translation helpful? Give feedback.
-
Let's add a tuple pointer to the struct.
Looks OK to me.
I think memcs should store its data in different files.
Looks OK to me. |
Beta Was this translation helpful? Give feedback.
-
I think it is enough to have
The other variants (per space, etc) are overkill. PS: the second option is optional: a user could remove |
Beta Was this translation helpful? Give feedback.
-
Please don't forget to handle corner cases when hints are enabled/disabled by a trigger on an _index space prior to recovery. You might end up with an index which has hints stored on disk but the hints are not requested by index definition and vice versa. Same (although rather unlikely) case: when an index is dropped completely by a before_replace trigger on _index space. This might be needed in some disaster recovery scenarios. What I'm trying to say we must be ready that the data saved on disk is not needed, and we must be ready that the expected data is not present on disk. In both cases some warning should be logged, like
Why did you choose to name the file
Let's please add |
Beta Was this translation helpful? Give feedback.
-
Reviewers
ToC
The problem
The issue is described in #10847: since we save tuples in snapshot in PK order, we have to sort them to build each secondary key (and the sorting process has
n*log(n)
complexity). Let's save the order of secondary keys in the snapshot to reduce the build complexity toO(n)
(this, in theory, should speed-up recovery of secondary keys).The algorithm
As the issue suggests, let's write the order of secondary indexes to the snapshot. The algorithm is:
[0xa3, 0xb2, 0xc1]
. Then we know that the first tuple in primary index had address0x01
and its new address is0xa3
, the second one had0x02
and the new address is0xb2
and so on. So we can easily build the mapping:[0xffff00001111, 0xffff00003333, 0xffff00002222]
and now it becomes[0xffff0000aaaa, 0xffff0000cccc, 0xffff0000bbbb]
- array of actual tuples in required order.Extra: hints
If the index we're loading the new way is TREE with hints enabled then it's not reasonable to use this approach unless we save the tuple hints along with tuple pointers in the index sort data files. The reason is some CPU cache effects described below.
Details on the cache effects.
Currently we perform the following steps on recovery:
tt_sort
to reorder the array the way it should be in SK.But if we use the new approach, we insert the index data into the build array right in SK order. That means, since the tuples are located in memory in PK order, in order to calculate hint of each tuple inserted into the SK, we need a lot of random memory accesses. This destructs the gain we've received for O(n) build array generation: in our PoC we had 10 seconds wasted in a single data load instruction.
So the approach is useless for indexes with hints unless we save the hints in the same index file, so we don't have to calculate them in runtime.
Extra: multikey indexes
The multikey indexes have the same problem since we have to access the tuple data in order to decide if we should include a particular copy of the tuple into the index (due to the
exclude_null
option), but also, the hints are required to specify to tuple order in the index (the tuple pointer by itself does not give the information about which multikey member of the tuple is meant to be located here). So the multikey hints are to be saved in the sort data files too.Extra: functional keys
Here we have the same problem as for multikey and regular indexes with hints, but this one does not seem to have a solution, so let's totally disable the feature for functional indexes.
Summary: the data to be stored
Implementation details
Here's the proposed sequence of events:
The sort data must only be used on non-system spaces if the they have no
before_replace
triggers and noforce_recovery
specified (system spaces are less likely to benefit from it). Also, it can only be safely used during recovery if the_index
space has nobefore_replace
triggers registered, cause in the opposite case it could change the index we saved sort data for so that the data is not applicable to it anymore.The storage
Let's store the sort data (a binary sequence of 8-byte pointers[ and hints]) in the
<vclock_signature>.sortdata
file. The file is created along with the regular snapshot file inmemtx_engine_begin_checkpoint
, but only for TREE index. The structure:More thorougly:
512/1: 0x0000000000000041, 0x0000000000000526, 00000000000000001536, gz, 0x0000000000003000\n
to specify e. g. the compression algorithm and the original size.Alternativs considered.
Save the information in the snapshot file metadata: this way we're limited by the max metadata header size, which is pretty small (~2KB as for ef3775a).
Save the information as one of entries in the snapshot file: we can't create a new fixheader type since it's strictly checked. A new request type can't be used in it either, cause memtx only accepts inserts (along with RAFT stuff) there.
A system blackhole space.
The idea is to insert the information into a system blackhole space in the end of the snapshot. If the snapshot does not contain the space, then the recovery is performed the old way (compatibility with old snapshots). Don't write the space on
box.snapshot()
after the downgrade (for backward compatibility). The blackhole space tuple format:MP_BOOL
true
if the tuple is the last one for the given index.MP_UINT
MP_UINT
MP_STRING
The SK sort data of a particular index consists of a number of such tuples, this is required to reduce the tuple arena usage. Last of the tuples has the
is_last
flag set totrue
to mark the point where the index has all the information required and can be built using the new approach.Since the space with the sort data is filled in the end of the snapshot, the old tuple addresses must be saved in some another way, so that we can create the old to new address map during PK recovery. Let's save the PK tuple pointers in the snapshot right inside the headers of
INSERT
entries (right next to the timestamp).Alternative considered.
We could create another space for such data and write it before user spaces, but it will require extra RAM to have the PK tuple addresses persistently until we got to build the space the information is supposed for. So we better to store the information along with the tuples we load and forget about it right after use.
Comparison to option 0:
➖ The sort data is built into the snapshot, so the latter can't be moved (backed-up) separately.
➖ The sort data is first instantiated into a tuple and then provided to indexes (extra indirection).
➖ No nice way to send the data to replicas unless we generate it from scratch or hack the xlog reader.
➖ Additional ~1 byte per PK tuple of persistent storage.
Getting tuple pointers from the read view
It would be nice (although, probably not necessary, need to be measured) to be able to receive both tuple data and pointer to the raw tuple from the index in one step. So let's return the tuple pointer along with the data in the
struct read_view_tuple
(add a newstruct tuple *ptr
field).Alternative considered.
Introduce a new
struct tuple **ptr
output parameter to theindex_read_view_iterator_base::next_raw
callback.The configuration
The approach only has sense if a database has many secondary keys of TREE type and a performant persistent storage, because it increases the initial recovery time and relies on the fast direct sort data read, so a new boolean
memtx_sort_data_enabled
(ormemtx.sort_data_enabled
) configuration option is proposed to specify whether to use or not to use the new aproach in building secondary keys and writing the snapshot.The configuration variable is to be changeable in runtime. If one does not want to use the sort data on recovery but want to write it during snapshoting, it can be set to
false
initially and then recofigured totrue
prior tobox.snapshot()
. Works the opposite way too.Expected results
Checkpointing
➖ Additional data to be written to the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ Additional RAM to be required if we change the secondary keys during checkpointing (now we only read-view PK, but will have to read-view SKs too).
Recovery
➖ Additional data to be read from the persistent storage: 8 bytes per tuple + 8 or 16 bytes per tuple in SK.
➖ The old to new tuple address map is to be created and filled (more RAM and CPU time required).
➕ Build of secondary indexes is significantly sped-up.
Practical results (storage option 0, but they're simple binary files)
Configuration
CPU: Zen 5, 8 dedicated cores
Storage: NVME (7000 MBps read, 5000 MBps write)
Part 1 PK and part 2 SK, Tuples with 2 random unsigned fields.
Checkpointing
0.119GB
0.119GB
0.771GB
0.771GB
7.305GB
7.305GB
72.623GB
72.623GB
Recovery
tt_sort
, 1 threadtt_sort
, 2 threadsInitial recovery: 0.302s
SK build: 0.098s
Mem: 0.131GB
Initial recovery: 0.314s
SK build: 0.061s
Mem: 0.131GB
Initial recovery: 0.357s
SK build: 0.043s
Mem: 0.139GB
Initial recovery: 2.266s
SK build: 1.104s
Mem: 0.852GB
Initial recovery: 2.281s
SK build: 0.699s
Mem: 0.852GB
Initial recovery: 2.666s
SK build: 0.224s
Mem: 1.209GB
Initial recovery: 21.776s
SK build: 13.644s
Mem: 8.121GB
Initial recovery: 21.694s
SK build: 7.806s
Mem: 8.120GB
Initial recovery: 25.619s
SK build: 3.301s
Mem: 11.312GB
Initial recovery: 3m19s
SK build: 3m49s
Mem: 80.897GB
Initial recovery: 3m36s
SK build: 1m57s
Mem: 80.897GB
Initial recovery: 4m10s
SK build: 46s
Mem: 101.628GB
Perf details
Memory overhead:
SK build time:
Initial recovery overhead:
Beta Was this translation helpful? Give feedback.
All reactions