Replies: 1 comment
-
A couple of trail and test runs with our processors, the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As the typical task graph grows with the number of input chunks, the task graphs can grow unexpected quickly for complicated processors that needs to work with small input chunk sizes (for example, if you are working on physic analysis that required PFCandidate-like collections)
The following code can take the output of
coffea.dataset_tools.preprocess
and slice the fileset into chunks according such that each chunk does not exceedmax_chunks
chunks for each yielded entry:I'm still not sure what the best path forwards would be to help make intuitive how to use this function though... We can use this like:
But this will need some
merge
function to ensure all outputs can be merged, (which I'm not sure is easily possible for arbitrary user returns, and I don't fancy resorting to limiting the processor outputs a set of accumulator classes like what was done for coffea==0.6). I also don't have a handle for how best to estimate a goodmax_chunk
is value is, as ultimately it will depend on how complicated theprocessor
method is.I think we cay get an estimate on the estimate task graph complexity by taking the first file in each defined dataset:
Or similar... Let me know if this might be something worth persuing.
Beta Was this translation helpful? Give feedback.
All reactions