add concurrency option to control max concurrent writes to db #460

prydonius · 2024-10-14T22:06:24Z

When utilising Synth to export millions of rows of generated data to a database, we noticed we would consistently get pool timed out while waiting for an open connection errors after some amount of time. This problem is particularly noticeable when writes of batches take longer (in our testing, batches ended up taking several seconds to write after a while especially for certain tables with triggers). The result is Synth crashes after a few minutes, and only writes a partial amount of data.

We found that the issue is related to how Synth concurrently writes batches of rows to the database, it chunks by 1000 and then spins up that many tasks that wait to acquire a db connection from the pool. If writes take too long, we'll start seeing these tasks hit the acquire timeout (which by default is 30s in sqlx).

This PR introduces a new concurrency parameter to limit concurrency and the number of tasks we spin up at a time so that tasks are not unnecessarily waiting for a connection from the pool. This allows Synth to take as long as it needs to export a large amount of data, and allows users to configure the concurrency as needed. The pool size is also set to this parameter since there'll be at most that many connections to the database at any one time.

I've manually tested this change in our environment and it no longer produces the timeouts we were seeing. We're using MySQL and so I haven't tested this with other database providers. If there's more testing I should do or add to the codebase, please let me know!

I have never written a line of Rust before, so I'd appreciate any feedback on this change and if there's ways to make this more idiomatic.

prydonius · 2024-10-14T22:08:08Z

core/src/schema/content/datasource.rs

@@ -16,6 +16,7 @@ impl Compile for DatasourceContent {
        let params = DataSourceParams {
            uri: URI::try_from(self.path.as_str())?,
            schema: None,
+            concurrency: 1,


not really sure what this is used for and what the default concurrency should be here

prydonius · 2024-10-14T22:08:43Z

synth/src/cli/mod.rs

@@ -123,6 +123,7 @@ impl<'w> Cli {
            uri: URI::try_from(cmd.from.as_str())
                .with_context(|| format!("Parsing import URI '{}'", cmd.from))?,
            schema: cmd.schema,
+            concurrency: 1,


I've hardcoded import jobs to concurrency 1 since I think we only use a single db connection here, but let me know if it should also be exposed in the import command.

prydonius added 2 commits October 13, 2024 14:05

add concurrency option to control max concurrent writes to db

f400e01

cleanup

fdb6904

prydonius commented Oct 15, 2024

View reviewed changes

prydonius force-pushed the scaling-mysql branch from 38fd233 to fdb6904 Compare October 22, 2024 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add concurrency option to control max concurrent writes to db #460

add concurrency option to control max concurrent writes to db #460

prydonius commented Oct 14, 2024

prydonius Oct 14, 2024

prydonius Oct 14, 2024

add concurrency option to control max concurrent writes to db #460

Are you sure you want to change the base?

add concurrency option to control max concurrent writes to db #460

Conversation

prydonius commented Oct 14, 2024

prydonius Oct 14, 2024

Choose a reason for hiding this comment

prydonius Oct 14, 2024

Choose a reason for hiding this comment