You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After fixing #44513 and #45334, I kept looking for possible overflow risks in our Swiss join implementation. And my finding follows.
Background
In the Swiss table, a "block" consists of 8 keys (rows). When the number of rows is large enough, a block occupies 40 bytes, aka. num_block_bytes: 4 bytes for each key and one 8 bytes header. Blocks are stored continuously in a buffer namely uint8_t * blocks_. So locating the address of block_id-th block requires indexing like:
blocks_ + num_block_bytes * block_id
Risks
The limit of number of rows in Swiss table is 2^32. So we can have 2^32 / 8 blocks at most, therefore the block_id is normally represented using uint32_t. The num_block_bytes is represented using regular int. If no explicit type promotion is conducted, num_block_bytes * block_id will perform 32-bit multiplication and overflow may happen (2^32 / 8 * 40 > 2^32).
In our code base, there are places where such calculations are done with promoting to 64-bit multiplication so overflow is avoided, to name a few:
(I wish I could come up with a concrete test case that such overflow results in wrong data - it's possible. But it's non-trivial and wouldn't be practical to run in limited resources.)
Given that such code are either correct but error-prone, or possible for real overflow, and once issues happen I can't imagine how painful the debugging will be, we should refactor them in a more overflow-safe fashion.
Component(s)
C++
The text was updated successfully, but these errors were encountered:
### Rationale for this change
See #45506.
### What changes are included in this PR?
1. Abstract current overflow-prone block data access into functions that do proper type promotion to avoid overflow. Also remove the old block base address accessor.
2. Unify the data types used for various concepts as they naturally are (i.e., w/o explicit promotion): `uint32_t` for `block_id`, `int` for `num_xxx_bits/bytes`, `uint32_t` for `group_id`, `int` for `local_slot_id` and `uint32_t` for `global_slot_id`.
3. Abstract several constants and utility functions for readability and maintainability.
### Are these changes tested?
Existing tests should suffice.
It is really hard (gosh I did try) to create a concrete test case that fails w/o this change and passes w/ this change.
### Are there any user-facing changes?
None.
* GitHub Issue: #45506
Authored-by: Rossi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Describe the enhancement requested
After fixing #44513 and #45334, I kept looking for possible overflow risks in our Swiss join implementation. And my finding follows.
Background
In the Swiss table, a "block" consists of
8
keys (rows). When the number of rows is large enough, a block occupies40
bytes, aka.num_block_bytes
:4
bytes for each key and one8
bytes header. Blocks are stored continuously in a buffer namelyuint8_t * blocks_
. So locating the address ofblock_id
-th block requires indexing like:Risks
The limit of number of rows in Swiss table is
2^32
. So we can have2^32 / 8
blocks at most, therefore theblock_id
is normally represented usinguint32_t
. Thenum_block_bytes
is represented using regularint
. If no explicit type promotion is conducted,num_block_bytes * block_id
will perform 32-bit multiplication and overflow may happen (2^32 / 8 * 40
>2^32
).In our code base, there are places where such calculations are done with promoting to 64-bit multiplication so overflow is avoided, to name a few:
arrow/cpp/src/arrow/compute/key_map_internal.cc
Lines 262 to 263 in e79d60d
arrow/cpp/src/arrow/compute/key_map_internal.cc
Lines 408 to 409 in e79d60d
However requiring such explicit type promotion is error-prone.
What may cause real trouble is where such calculations are still in 32-bit, there is one:
arrow/cpp/src/arrow/compute/key_map_internal.cc
Lines 226 to 227 in e79d60d
(I wish I could come up with a concrete test case that such overflow results in wrong data - it's possible. But it's non-trivial and wouldn't be practical to run in limited resources.)
Given that such code are either correct but error-prone, or possible for real overflow, and once issues happen I can't imagine how painful the debugging will be, we should refactor them in a more overflow-safe fashion.
Component(s)
C++
The text was updated successfully, but these errors were encountered: