Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Acero] Swiss table still has risks of overflow #45506

Closed
zanmato1984 opened this issue Feb 12, 2025 · 1 comment
Closed

[C++][Acero] Swiss table still has risks of overflow #45506

zanmato1984 opened this issue Feb 12, 2025 · 1 comment

Comments

@zanmato1984
Copy link
Contributor

Describe the enhancement requested

After fixing #44513 and #45334, I kept looking for possible overflow risks in our Swiss join implementation. And my finding follows.

Background

In the Swiss table, a "block" consists of 8 keys (rows). When the number of rows is large enough, a block occupies 40 bytes, aka. num_block_bytes: 4 bytes for each key and one 8 bytes header. Blocks are stored continuously in a buffer namely uint8_t * blocks_. So locating the address of block_id-th block requires indexing like:

blocks_ + num_block_bytes * block_id

Risks

The limit of number of rows in Swiss table is 2^32. So we can have 2^32 / 8 blocks at most, therefore the block_id is normally represented using uint32_t. The num_block_bytes is represented using regular int. If no explicit type promotion is conducted, num_block_bytes * block_id will perform 32-bit multiplication and overflow may happen (2^32 / 8 * 40 > 2^32).

In our code base, there are places where such calculations are done with promoting to 64-bit multiplication so overflow is avoided, to name a few:

const uint8_t* blockbase =
blocks_->data() + static_cast<uint64_t>(iblock) * num_block_bytes;

const uint64_t num_block_bytes = (8 + num_groupid_bits);
blockbase = blocks_->mutable_data() + num_block_bytes * (start_slot_id >> 3);

However requiring such explicit type promotion is error-prone.

What may cause real trouble is where such calculations are still in 32-bit, there is one:

block = *reinterpret_cast<const uint64_t*>(blocks_->mutable_data() +
num_block_bytes * iblock);

(I wish I could come up with a concrete test case that such overflow results in wrong data - it's possible. But it's non-trivial and wouldn't be practical to run in limited resources.)

Given that such code are either correct but error-prone, or possible for real overflow, and once issues happen I can't imagine how painful the debugging will be, we should refactor them in a more overflow-safe fashion.

Component(s)

C++

@zanmato1984 zanmato1984 self-assigned this Feb 12, 2025
pitrou pushed a commit that referenced this issue Feb 17, 2025
### Rationale for this change

See #45506.

### What changes are included in this PR?

1. Abstract current overflow-prone block data access into functions that do proper type promotion to avoid overflow. Also remove the old block base address accessor.
2. Unify the data types used for various concepts as they naturally are (i.e., w/o explicit promotion): `uint32_t` for `block_id`, `int` for `num_xxx_bits/bytes`, `uint32_t` for `group_id`, `int` for `local_slot_id` and `uint32_t` for `global_slot_id`.
3. Abstract several constants and utility functions for readability and maintainability.

### Are these changes tested?

Existing tests should suffice.

It is really hard (gosh I did try) to create a concrete test case that fails w/o this change and passes w/ this change.

### Are there any user-facing changes?

None.

* GitHub Issue: #45506

Authored-by: Rossi Sun <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 20.0.0 milestone Feb 17, 2025
@pitrou
Copy link
Member

pitrou commented Feb 17, 2025

Issue resolved by pull request 45515
#45515

@pitrou pitrou closed this as completed Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants