Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43057: [C++] Thread-safe AesEncryptor / AesDecryptor #44990

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Dec 10, 2024

Rationale for this change

OpenSSL encryption / decryption is wrapped by AesEncryptor / AesDencryptor, which is used by multiple threads of a single scanner or by multiple concurrent scanners when scanning a dataset. Some thread may call WipeOut while other threads still use the instance.

What changes are included in this PR?

  • Remove the WipeOut methods and related datastructures entirely.
  • Each call into CtrEncrypt / CtrDecrypt and GcmEncrypt / GcmDecrypt uses its own EVP_CIPHER_CTX instance, making this thread-safe.

After fixing this "AesDecryptor was wiped out" issue, two other segmentation faults surfaced: GH-44988. This has also been addressed here as it can only be exposed after fixing the wipe-out issue.

Fixes GH-43057.
Fixes GH-44852.
Fixes GH-44988.

Are these changes tested?

A unit test that scans a dataset concurrently reproduced the initial issue in 30% of the test runs.

Are there any user-facing changes?

No.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 11, 2024
@adamreeve
Copy link
Contributor

The test failures look like they might be caused by openssl/openssl#21955

The failing ASAN test run is using OpenSSL 3.02, but builds with other versions are passing. Eg. the Ubuntu 20.04 build uses OpenSSL 1.1.1f and the Conda build has OpenSSL 3.3.2.

The fix for this was backported to the 3.0 branch and released in 3.0.13 (openssl/openssl#23102), but Ubuntu 22.04 still uses 3.0.2. If there's not an easy way around this, maybe we need a slower code path for 3.0.2 that creates a new context each time?

@pitrou
Copy link
Member

pitrou commented Dec 11, 2024

If there's not an easy way around this, maybe we need a slower code path for 3.0.2 that creates a new context each time?

Let's use the "slower" code path everywhere? I'm skeptical it will be that slow...

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for stepping in and submitting this @EnricoMi . You'll find a number of comments below.

cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encryption/encryption_internal.h Outdated Show resolved Hide resolved
cpp/src/parquet/encryption/encryption_internal.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encryption/encryption_internal.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encryption/encryption_internal.cc Outdated Show resolved Hide resolved
@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from a2e3992 to 108c8cf Compare December 11, 2024 19:49
@EnricoMi
Copy link
Contributor Author

I have addressed all comments except for the test related ones. Will go over them tomorrow.

Thanks for your input, that is highly appreciated!

@adamreeve
Copy link
Contributor

I tested concurrent scans of a larger dataset with uniform encryption, and this change doesn't completely fix that scenario:

--- a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
+++ b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
@@ -289,6 +289,8 @@ TEST_F(DatasetEncryptionTest, ReadSingleFile) {
 // processing encrypted datasets over 2^15 rows in multi-threaded mode.
 class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
  public:
+  LargeRowEncryptionTest() : DatasetEncryptionTestBase(true) {}
+
   // The dataset is partitioned using a Hive partitioning scheme.
   void PrepareTableAndPartitioning() override {
     // Specifically chosen to be greater than batch size for triggering prefetch.
@@ -307,7 +309,7 @@ class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
 
 // Test for writing and reading encrypted dataset with large row count.
 TEST_F(LargeRowEncryptionTest, ReadEncryptLargeRows) {
-  ASSERT_NO_FATAL_FAILURE(TestScanDataset());
+  ASSERT_NO_FATAL_FAILURE(TestScanDataset(true));
 }
 
 }  // namespace dataset

This gives errors like:

failed with IOError: Failed decryption finalization

I believe this is due to the decryptor AAD being mutable. It's updated for each data page read, so concurrent use will result in incorrect AADs being used in some threads.

Rather than mutating the decryptor state, it might be possible to refactor this so that the AAD values are passed in to the decrypt method as a parameter.

I think this should probably be addressed as a separate PR though as the changes should be orthogonal.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Dec 12, 2024

failed with IOError: Failed decryption finalization

Confirmed in ec9b054. Multi-threaded uniform encryption read fails.

@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch 2 times, most recently from 33ff64b to 6508c5a Compare December 12, 2024 09:49
@EnricoMi
Copy link
Contributor Author

failed with IOError: Failed decryption finalization

Confirmed in ec9b054. Multi-threaded uniform encryption read fails.

Attempt to fix multi-threaded AAD update: d77a081

The PageReader uses its own copy of the data_decryptor / data_decryptor. Happy to move into separate PR.

cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member

mapleFU commented Jan 21, 2025

Any updates?

@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from 4fc5b42 to b652e06 Compare January 21, 2025 08:55
@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from b652e06 to fbe3a1c Compare January 29, 2025 18:25
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EnricoMi Sorry for the delay and thanks a lot for persevering on this. I think the fix is still not 100% correct, see comments below. But we're nearing a solution.

Also, I can help on this PR if you want.

cpp/src/parquet/encryption/encryption_internal.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_encryption_test.cc Outdated Show resolved Hide resolved
@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from 85c55ba to 4d388fa Compare February 6, 2025 17:17
@pitrou

This comment was marked as outdated.

This comment was marked as outdated.

@pitrou pitrou force-pushed the thread-safe-endecryptor branch from c4305bc to ed76e98 Compare February 13, 2025 09:56
@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

I pushed some more cleanups, in particular I switched to std::unique_ptr for non-shareable objects Decryptor and AesDecryptor.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

I think at some point a large cleanup and reorganization phase would be deserved on the Parquet encryption code:

  • header and source file naming
  • object and method naming
  • reorganizing responsabilities across objects
  • wacky implementation details (e.g. why exactly do we need the mutable CryptoContext::start_decrypt_with_dictionary_page?)

@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

Hmm, the Lint CI step is failing for unrelated reasons (see #45521).

@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: ed76e98

Submitted crossbow builds: ursacomputing/crossbow @ actions-5fc2004b28

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@raulcd
Copy link
Member

raulcd commented Feb 13, 2025

@pitrou I took a moment to put your comment above into an enhancement issue here:

@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

Oh, thank you @raulcd !

@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

Well, there's an interesting segfault on the Windows C++ build...

Edit: it looks like a non-deterministic bug, but it might be due to the WipeoutDecryptionKeys call in the InternalFileDecryptor destructor. Will try to investigate more. Actually, it also happened without the WipeoutDecryptionKeys call when retrying CI jobs on another commit. It might not be related to this PR at all, as it is happening in the single-threaded Parquet encryption tests, not in the multi-threaded Dataset sets.

Copy link
Contributor Author

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou quite some substantial refactoring, LGTM! Thanks for stepping in!

@pitrou pitrou force-pushed the thread-safe-endecryptor branch 2 times, most recently from bcfd43a to 71812ed Compare February 13, 2025 18:00
@pitrou pitrou force-pushed the thread-safe-endecryptor branch from 71812ed to 4249ce4 Compare February 13, 2025 18:21
@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

After several retries, it seems the Windows segfaults can happen randomly in various places, even when writing an encrypted file. Example logs (two segfaults, including one in TestEncryptionKeyManagement.WrapLocally):
https://github.com/pitrou/arrow/actions/runs/13314446458/job/37184921711#step:10:266

@pitrou
Copy link
Member

pitrou commented Feb 13, 2025

Given the symptoms, I'm not sure the Windows segfaults are directly caused by this PR's changes. I might try to reproduce on a Windows VM, though (unless someone beats me to it?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants