Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RoundRobin RepartitionExec #6043

Open
Dandandan opened this issue Apr 18, 2023 · 3 comments
Open

Improve RoundRobin RepartitionExec #6043

Dandandan opened this issue Apr 18, 2023 · 3 comments
Labels
bug Something isn't working performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Describe the bug

RoundRobin repartitioning currently does not distribute the input tasks evenly over the output channels, causing the work to be not distributed evenly.

To Reproduce

When loading the data in memory in the TPC-H benchmark, this can be seen in the number of batches in MemoryExec (which uses RoundRobin partitioning).

MemoryExec: partitions=32, partition_sizes=[32, 32, 32, 32, 32, 32, 32, 32, 26, 26, 26, 25, 25, 25, 25, 25, 25, 25, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16], metrics=[]

It has a bias for the first output partitions/channels.

Expected behavior

Batches should be distributed more evenly over output channels.

Additional context

No response

@Dandandan Dandandan added bug Something isn't working performance Make DataFusion faster labels Apr 18, 2023
@cristian-ilies-vasile
Copy link

Batches should be distributed more evenly over output channels.

Seems to be a load balancing issue. If you could count the number of batches already distributed to each channel and not completed then the classical The Power of Two Choices in Randomized Load Balancing algorithm could be evaluated.

@Dandandan
Copy link
Contributor Author

@cristian-ilies-vasile yes, instead of round-robin repartitioning an improved scheme could be implemented based on number of buffered batches.

@cristian-ilies-vasile
Copy link

One good article describing this technique can be read here:
Deterministic Aperture: A distributed, load balancing algorithm
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/daperture-load-balancer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Make DataFusion faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants