Automate slicing filesets for large processors #1189

yimuchen · 2024-10-08T12:43:52Z

yimuchen
Oct 8, 2024

As the typical task graph grows with the number of input chunks, the task graphs can grow unexpected quickly for complicated processors that needs to work with small input chunk sizes (for example, if you are working on physic analysis that required PFCandidate-like collections)

The following code can take the output of coffea.dataset_tools.preprocess and slice the fileset into chunks according such that each chunk does not exceed max_chunks chunks for each yielded entry:

def _count_fileset(fileset) -> Tuple[int, int, int]:
    """Returning the number of datasets, files and chunks given a fileset"""
    datasets = len(fileset)
    files = [len(entry["files"]) for entry in fileset.values()]
    chunks = [
        [len(file_entry["steps"]) for file_entry in entries["files"].values()]
        for entries in fileset.values()
    ]
    return datasets, ak.sum(files), ak.sum(chunks)


def fileset_slicer(fileset, max_chunks: int):
    t_datasets, t_files, t_chunks = _count_fileset(fileset)

    # Object to yield in as a single chunk and additional trackers
    chunked_fileset = {}
    current_chunks = 0  # Number of data sets in current chunk

    max_chunks = np.ceil(t_chunks / np.ceil(t_chunks / max_chunks))

    for dataset, entry in fileset.items():
        for file, fentry in entry["files"].items():
            for step in fentry["steps"]:
                if dataset not in chunked_fileset:
                    chunked_fileset[dataset] = {
                        k: v for k, v in entry.items() if k != "files"
                    }
                    chunked_fileset[dataset]["files"] = {}
                if file not in chunked_fileset[dataset]["files"]:
                    chunked_fileset[dataset]["files"][file] = {
                        k: v for k, v in fentry.items() if k != "steps"
                    }
                    chunked_fileset[dataset]["files"][file]["steps"] = []
                chunked_fileset[dataset]["files"][file]["steps"].append(step)
                current_chunks = current_chunks + 1
                if current_chunks >= max_chunks:
                    yield chunked_fileset
                    current_chunks = 0
                    chunked_fileset = {}
    if chunked_fileset:  # Yielding any trailing entry
        yield chunked_fileset

I'm still not sure what the best path forwards would be to help make intuitive how to use this function though... We can use this like:

fileset = proprocess( .... )

common_output = None
for chunks in fileset_slicer(fileset, max_chunks=10_000):
    _chunk_output = apply_to_fileset(processor, chunks)
    merge(common_output, _chunk_output)

But this will need some merge function to ensure all outputs can be merged, (which I'm not sure is easily possible for arbitrary user returns, and I don't fancy resorting to limiting the processor outputs a set of accumulator classes like what was done for coffea==0.6). I also don't have a handle for how best to estimate a good max_chunk is value is, as ultimately it will depend on how complicated the processor method is.

I think we cay get an estimate on the estimate task graph complexity by taking the first file in each defined dataset:

_single_fset = max_files(fileset, 1)
_single_entry_graph = apply_to_fileset(processor, _single_fset) # Test graph with 1 file per datasets

def dask_graph_size(item):
    if isinstance(item, dict):
        return np.sum([dask_graph_size(x) for x in item.values()])
    elif isinstance(item, list):
        return np.sum([dask_graph_size(x) for x in item])
    elif hasattr(item, "__dask_graph__"):
        return len(item.__dask_graph__())
    else:
        # TODO: print warning
        return 0

_estimated_graph_size = dask_graph_size(_single_graph_size)
_, __, chunks = _count_fileset(_single_fset)
graph_size_per_chunk = _estimate_graph_size / chunks
maxchunks = int(some_number / graph_size_per_chunk)

Or similar... Let me know if this might be something worth persuing.

yimuchen · 2024-10-08T14:46:54Z

yimuchen
Oct 8, 2024
Author

A couple of trail and test runs with our processors, the some_number would work decently well with the value 25_000_000. This is not rigorously obtained, more playing with the numbers to see when the local memory can blow up due to an excessively large task graph.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate slicing filesets for large processors #1189

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Automate slicing filesets for large processors #1189

yimuchen Oct 8, 2024

Replies: 1 comment

yimuchen Oct 8, 2024 Author

yimuchen
Oct 8, 2024

yimuchen
Oct 8, 2024
Author