-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43057: [C++] Thread-safe AesEncryptor / AesDecryptor #44990
base: main
Are you sure you want to change the base?
Conversation
The test failures look like they might be caused by openssl/openssl#21955 The failing ASAN test run is using OpenSSL 3.02, but builds with other versions are passing. Eg. the Ubuntu 20.04 build uses OpenSSL 1.1.1f and the Conda build has OpenSSL 3.3.2. The fix for this was backported to the 3.0 branch and released in 3.0.13 (openssl/openssl#23102), but Ubuntu 22.04 still uses 3.0.2. If there's not an easy way around this, maybe we need a slower code path for 3.0.2 that creates a new context each time? |
Let's use the "slower" code path everywhere? I'm skeptical it will be that slow... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for stepping in and submitting this @EnricoMi . You'll find a number of comments below.
a2e3992
to
108c8cf
Compare
I have addressed all comments except for the test related ones. Will go over them tomorrow. Thanks for your input, that is highly appreciated! |
I tested concurrent scans of a larger dataset with uniform encryption, and this change doesn't completely fix that scenario: --- a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
+++ b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
@@ -289,6 +289,8 @@ TEST_F(DatasetEncryptionTest, ReadSingleFile) {
// processing encrypted datasets over 2^15 rows in multi-threaded mode.
class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
public:
+ LargeRowEncryptionTest() : DatasetEncryptionTestBase(true) {}
+
// The dataset is partitioned using a Hive partitioning scheme.
void PrepareTableAndPartitioning() override {
// Specifically chosen to be greater than batch size for triggering prefetch.
@@ -307,7 +309,7 @@ class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
// Test for writing and reading encrypted dataset with large row count.
TEST_F(LargeRowEncryptionTest, ReadEncryptLargeRows) {
- ASSERT_NO_FATAL_FAILURE(TestScanDataset());
+ ASSERT_NO_FATAL_FAILURE(TestScanDataset(true));
}
} // namespace dataset This gives errors like:
I believe this is due to the decryptor AAD being mutable. It's updated for each data page read, so concurrent use will result in incorrect AADs being used in some threads. Rather than mutating the decryptor state, it might be possible to refactor this so that the AAD values are passed in to the decrypt method as a parameter. I think this should probably be addressed as a separate PR though as the changes should be orthogonal. |
Confirmed in ec9b054. Multi-threaded uniform encryption read fails. |
33ff64b
to
6508c5a
Compare
Any updates? |
4fc5b42
to
b652e06
Compare
b652e06
to
fbe3a1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EnricoMi Sorry for the delay and thanks a lot for persevering on this. I think the fix is still not 100% correct, see comments below. But we're nearing a solution.
Also, I can help on this PR if you want.
85c55ba
to
4d388fa
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Co-authored-by: Antoine Pitrou <[email protected]>
c4305bc
to
ed76e98
Compare
I pushed some more cleanups, in particular I switched to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think at some point a large cleanup and reorganization phase would be deserved on the Parquet encryption code:
|
Hmm, the Lint CI step is failing for unrelated reasons (see #45521). |
@github-actions crossbow submit -g cpp |
Revision: ed76e98 Submitted crossbow builds: ursacomputing/crossbow @ actions-5fc2004b28 |
@pitrou I took a moment to put your comment above into an enhancement issue here: |
Oh, thank you @raulcd ! |
Well, there's an interesting segfault on the Windows C++ build... Edit: it looks like a non-deterministic bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou quite some substantial refactoring, LGTM! Thanks for stepping in!
bcfd43a
to
71812ed
Compare
71812ed
to
4249ce4
Compare
After several retries, it seems the Windows segfaults can happen randomly in various places, even when writing an encrypted file. Example logs (two segfaults, including one in |
Given the symptoms, I'm not sure the Windows segfaults are directly caused by this PR's changes. I might try to reproduce on a Windows VM, though (unless someone beats me to it?). |
Rationale for this change
OpenSSL encryption / decryption is wrapped by AesEncryptor / AesDencryptor, which is used by multiple threads of a single scanner or by multiple concurrent scanners when scanning a dataset. Some thread may call
WipeOut
while other threads still use the instance.What changes are included in this PR?
WipeOut
methods and related datastructures entirely.CtrEncrypt
/CtrDecrypt
andGcmEncrypt
/GcmDecrypt
uses its ownEVP_CIPHER_CTX
instance, making this thread-safe.After fixing this
"AesDecryptor was wiped out"
issue, two other segmentation faults surfaced: GH-44988. This has also been addressed here as it can only be exposed after fixing the wipe-out issue.Fixes GH-43057.
Fixes GH-44852.
Fixes GH-44988.
Are these changes tested?
A unit test that scans a dataset concurrently reproduced the initial issue in 30% of the test runs.
Are there any user-facing changes?
No.