Feat(Source-S3): Use dataframe processing in place of singleton record operations (polars) #44194

aaronsteers · 2024-08-16T22:33:58Z

What

This PR replaces the inner record loop with a dataframe-based transformation of records in bulk.

We use the Polars library, which:

Is written in Rust.
Is very fast.
Is automatically parallel.
Supports lazy operations.

One helpful way to think about this, I think, is to consider this a move from procedural to functional programming - specifically for the operations we perform once on every records and for which processing speed is most important. Rather than managing a step-by-step process to operate on each record, we define the operations and let the engine push down, consolidate, and parallelize as much as possible.

How

Perf Profile (Tentative)

Show/Hide

With the latest updates, running locally on my Mac:

About 4x improvement versus the docker image, running locally on my Mac:

Still TODO

I need to revert a bunch of unnecessary changes.
Consider if we should add support for other file formats. (Currently only JSONL.)
Could use more tests, and/or creation of "scenarios".
Consider breaking into smaller PRs.
Very large files are sometimes blocking/locking. Need to research this...

Related Docs

Original Dev Plan Google Doc (now stale)

vercel · 2024-08-16T22:34:02Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Sep 21, 2024 4:11am

…ame-ops

…r S3

aaronsteers · 2024-09-20T19:53:04Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_based_source.py

@clnoll, @pnilan - If you have a sec, could you review this file's changes?

This works in my testing - the revised/refactored version attempts to handle more edge cases predictably. As discussed, previous to this PR, we were hitting the condition where concurrency was defined (allowing concurrency in full-refresh mode) but the cursor was not concurrent (disabling concurrency in incremental mode). We probably could add a test to check for this, but for now I warn explicitly.

I also used the continue pattern to make the code slightly easier to read with less branching. When handled by an earlier case, we can disregard the remainder of the code in the loop.

Let me know what you think! Thanks!

aaronsteers · 2024-09-20T19:54:00Z

airbyte-integrations/connectors/source-s3/source_s3/v4/cursor.py

 from airbyte_cdk.sources.file_based.types import StreamState

 logger = logging.Logger("source-S3")


-class Cursor(DefaultFileBasedCursor):
+class Cursor(FileBasedConcurrentCursor):


@clnoll - I took a pass at refactoring with the concurrent base class. It looks like it is working smoothly now, but let me know if anything looks off! Thanks!

Looks right to me!

clnoll

@aaronsteers the switch to the concurrent cursor looks fine to me.

I did a superficial pass at some of the other code and left some minor comments, but I know this is a draft so they're probably already on your mind.

clnoll · 2024-09-23T12:20:02Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/config/file_based_stream_config.py

@@ -71,6 +101,11 @@ class FileBasedStreamConfig(BaseModel):
        default=None,
        gt=0,
    )
+    bulk_mode: BulkMode = Field(
+        title="Bulk Processing Optimizations",
+        description="The bulk processing mode for this stream.",


Since this will be surfaced to users, it would be nice to give them more information about how to choose. If we dynamically select whether we use bulk mode if a user selects AUTO, we should also consider telling them the criteria we're using.

Also - if this is only available for jsonl to start it should probably be in the JsonlFormat file.

clnoll · 2024-09-23T12:47:00Z

airbyte-integrations/connectors/source-s3/scripts/fetch_test_secrets.py

@@ -0,0 +1,46 @@
+"""Simple script to download secrets from GCS.


Out of curiosity did you try using ci_credentials?

@clnoll - Yeah, I tried that first before creating the GSMSecretManager class in PyAirbyte. The ci_credentials library didn't install cleanly or run cleanly when I tried - and it printed secrets to console output by default, part of the github "hide secrets" feature, but when run outside of github actions, it had the reverse effect as intended.

After a few attempts at using ci_credentials, I decided it would be easier to just bring the code into PyAirbyte - some of the code is vendored in originally from ci_credentials, but by now it is pretty specialized for the PyAirbyte use cases.

The docs here show how PyAirbyte handles building the secret manager and getting secrets: https://airbytehq.github.io/PyAirbyte/airbyte/secrets.html

Interesting, that's good to know. I think @alafanechere would probably be interested in the issues with ci_credentials.

clnoll · 2024-09-23T12:57:39Z

airbyte-integrations/connectors/source-s3/source_s3/v4/cursor.py

 from airbyte_cdk.sources.file_based.types import StreamState

 logger = logging.Logger("source-S3")


-class Cursor(DefaultFileBasedCursor):
+class Cursor(FileBasedConcurrentCursor):


Looks right to me!

clnoll · 2024-09-23T13:04:19Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/jsonl_parser.py

+        with stream_reader.open_file(
+            file=file,
+            mode=self.file_read_mode,
+            encoding=None,


Might a user have a non-default encoding?

clnoll · 2024-09-23T13:06:36Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/jsonl_parser.py

+                df: pl.DataFrame = pl.read_ndjson(
+                    source=batch,
+                    # schema=schema,  # TODO: Add detected schema
+                    infer_schema_length=10,


Should this be using this config value?

initial code scaffold for dataframe processing

726a722

octavia-squidington-iii added the CDK Connector Development Kit label Aug 16, 2024

aaronsteers added 10 commits August 20, 2024 14:19

drive-by-fix: typo in type hint

faa8517

Merge remote-tracking branch 'origin/master' into aj/source-s3/datafr…

9fc9cf7

…ame-ops

drive-by-fix: missing .gitignore for test artifact

50207aa

mark method as abstract

392c43b

poetry add polars (TODO: Move to extra)

f8a7b7c

implement parse_records_to_dataframes() for jsonl file type

0a00d1e

add config option FileBasedStreamConfig.bulk_mode

cec132a

apply new enum class

09e8bf7

checkpoint: basic plumbing in place

c78a576

resolve version conflicts in source-s3

1dab0c4

octavia-squidington-iii added area/connectors Connector related issues connectors/source/s3 labels Aug 22, 2024

aaronsteers added 15 commits August 22, 2024 00:24

minor fixes

63a8dc2

ability to step-debug "full refresh" acceptance tests

2d3031d

make polars part of the file-based extra

a7b1989

fix lock check in airbyte-ci

e8b4a2d

script to download secret config

92733e9

fix extra args

cbb8777

cleanup secret fetch script

07f7929

checkpoint: jsonl sync running successfully

d0da02a

tidy secrets install script using latest pyairbyte features

87ea175

tidy some more

82568e0

make helper script slightly more reusable

f8093ce

use local CDK in poetry

a487c13

add read_to_buffer stub

3a6305d

improve handling

77120ab

add perftest

1ca2ead

tidy

0901763

aaronsteers force-pushed the aj/source-s3/dataframe-ops branch from dd354aa to 0901763 Compare September 18, 2024 04:55

aaronsteers added 15 commits September 17, 2024 22:00

tidy jsonl parser comments

486fdab

rename variable

cd5a7dd

fix type hint

d7c7af7

update perf-test script

0d80e81

chore: update comment

a94187d

delete unused

015fd90

multiple fixes, refactoring, including change to concurrent cursor fo…

4cec7a9

…r S3

chore: add comment

60f843c

chore: add CLI entrypoint

6a20761

update tests

5ebf102

update poetry projects and lock files

87e43c7

remove very slow tests from acceptance tests

c930f69

minor format stuff

b2cd2c5

clean up files

51d511c

update comment

9526835

aaronsteers requested a review from clnoll September 20, 2024 19:16

aaronsteers commented Sep 20, 2024

View reviewed changes

aaronsteers added 6 commits September 20, 2024 21:05

update poetry

da92440

update perf test

b946e43

update cursor logic

1919b8d

update concurrency

2bcbc98

update poetry

6aba77e

buffered reads

5478be1

clnoll reviewed Sep 23, 2024

View reviewed changes

aaronsteers mentioned this pull request Sep 23, 2024

Fix(source-s3): Optimize File-Based performance in Python CDK and Source-S3 #45721

Draft

2 tasks

This was referenced Oct 8, 2024

chore(source-s3): Update CDK to v5 (old) #45199

Closed

Chore(Source S3): Bump to CDK v5 (release candidate) #46298

Merged

aaronsteers changed the title ~~Feat(Source-S3): Use dataframe processing in place of singleton record operations~~ Feat(Source-S3): Use dataframe processing in place of singleton record operations (polars) Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(Source-S3): Use dataframe processing in place of singleton record operations (polars) #44194

Feat(Source-S3): Use dataframe processing in place of singleton record operations (polars) #44194

aaronsteers commented Aug 16, 2024 •

edited

Loading

vercel bot commented Aug 16, 2024 •

edited

Loading

aaronsteers Sep 20, 2024

aaronsteers Sep 20, 2024

clnoll Sep 23, 2024

clnoll left a comment

clnoll Sep 23, 2024

clnoll Sep 23, 2024

clnoll Sep 23, 2024

aaronsteers Oct 10, 2024 •

edited

Loading

clnoll Oct 10, 2024

clnoll Sep 23, 2024

clnoll Sep 23, 2024

clnoll Sep 23, 2024

		@@ -0,0 +1,46 @@
		"""Simple script to download secrets from GCS.

Feat(Source-S3): Use dataframe processing in place of singleton record operations (polars) #44194

Are you sure you want to change the base?

Feat(Source-S3): Use dataframe processing in place of singleton record operations (polars) #44194

Conversation

aaronsteers commented Aug 16, 2024 • edited Loading

What

How

Perf Profile (Tentative)

Still TODO

Related Docs

vercel bot commented Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clnoll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronsteers Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronsteers commented Aug 16, 2024 •

edited

Loading

vercel bot commented Aug 16, 2024 •

edited

Loading

aaronsteers Oct 10, 2024 •

edited

Loading