MLPF and itwinai #395

matbun · 2025-01-27T14:58:32Z

matbun
Jan 27, 2025

Hi there, I am working at the intersection of ML and HPC in an European project called interTwin.
In this project, we are developing a tool called itwinai, which provides an abstraction layer to help scientists scaling their ML models to HPC by leveraging well-known tools such as Ray, Horovod, and DeepSpeed.

itwinai is being developed to support physics and earth observation use cases, and so far we have integrated medium-sized deep learning models developed by some research teams participating in the interTwin project. There is a HEP model currently being integrated by CNRS as well.

As of now, itwinai is capable of running distributed ML on multiple nodes using Torch DDP, Ray, Horovod, and Microsoft DeepSpeed. It also supports hyper-parameter optimization with Ray Tune and it integrates with distributed ML with Ray Train, allowing users to potentially scale each individual tuning trial to multiple nodes. Other features include configuration management and integration with popular logging frameworks. Moreover, we have developed a first version of a scalability report to profile how well the training of a model scales to multiple accelerators on multiple nodes, also collecting data on communication v. computation overhead and GPU energy consumption. Find an example here: https://itwinai.readthedocs.io/latest/use-cases/virgo_doc.html#scalability-metrics

We are planning to add support also for more advanced distributed ML algorithms, involving for instance model parallelism. Also, we would like to extend our work on GPU energy consumption and profiling of ML models on HPC.

I discovered MLPF thanks to @erwulff, a colleague of mine from CERN openlab and I would like to integrate it in itwinai, with the following potential benefits:

what we get is: itwinai is validated against a transformer-like model, which is larger than the models we currently worked with. Also, your use case seems to implement more complex workflows compared to the ones we worked with in interTwin (e.g., checkpointing and ONNX conversion/deployment)
what you get is: potential optimization of the training algorithm, scalability metrics/plots, data on GPU energy consumption, and (if this is of any value for you) the integration with itwinai.

Of course I cannot promise anything at the moment, also considering that this would be a side project.

As far as I understood from your citation & reuse policy, there are some constraints on using the latest versions of the datasets. If you prefer, I could generate a synthetic dataset replicating the structure of the real data (we did the same when working with Virgo, for instance).

Also, if there are some meaningful results and if you agree, I would like to present the results of the integration at some public events (e.g., CERN openlab workshop) as an example of integration of HEP models in itwinai. Considering that the PyTorch version is still under development, I understand if you don't want me to publish plots showing how well MLPF x.y.z scales to HPC. Instead, I could have plots showing the relative scaling of the itwinai integration compared to the baseline (your plain code). This way I would not show how "well" MLPF scales, but only the improvement or overhead introduced by itwinai, which would be even better suited to my goal.

So, in summary, I really like your work and I would be happy to help by understanding whether anything we developed in itwinai could be useful for you or not, and how we could improve.

jpata · 2025-01-27T15:21:03Z

jpata
Jan 27, 2025
Maintainer

Thanks for the writeup!

We are currently working on a publication with the latest datasets & code from MLPF. It will largely be about the physics results, but there may be some minor computational aspect, as we have not yet published the pytorch version of the code, therefore this code and datasets are unpublished and unreviewed as of yet.

However, we can share the current datasets that are compatible with the pre-publication version of the code so you can try integrating it in your project.

From the MLPF side, we would ask to be kept in the loop about the results, and to know about it if it will be shown in any non-internal presentation. If you are aiming for a publication later this year, hopefully we will have already released the latest version of the code and datasets by then that can be cited in your paper.

3 replies

matbun Jan 27, 2025
Author

Hi @jpata , thanks for the explanation.

Indeed I am going to keep you in the loop about the results and their "dissemination". My post was mostly to understand whether there was any possibility to do the integration and show any result of the scaling tests whastoever.

Thank you for sharing the datasets compatible with the pre-publication version of the code. Is this version of the code the one training a transformer-like model with PyTorch?

I am not aiming for a publication at the moment, but I don't exclude it if the results will be extraordinary :)

jpata Jan 27, 2025
Maintainer

Yes, you can use the pre-publication code and datasets for internal results as long as you keep us in the loop when it's shown in a wider audience (e.g. at some CERN meeting) and follow the citation policy.
I've shared the folder with you on CERNbox, let me know by email if you didn't receive it.
The latest tagged version uses the transformer-based model with pytorch as the default.

matbun Jan 27, 2025
Author

Thank you very much! I will keep @erwulff up to date with the latest developments and send you an email in case I would like to present the results somewhere.

hahahannes · 2025-01-28T08:04:34Z

hahahannes
Jan 28, 2025

Hi @matbun! I am running mlpf with version 1.8 and the open clic dataset on KubeFlow. You can find an example MPIJob here . It uses Horovod. There is a script included that downloads the dataset during runtime.
I will also try to run the latest version with Torch DDP and Ray Train!

Thanks @jpata for your help!

1 reply

hahahannes Jan 28, 2025

I also added some (work in progress) examples using the latest main branch with a KubeFlow PyTorchJob and a Ray Cluster that runs on top of KubeFlow.
See: https://gitlab.cern.ch/mlops/ngt/use-cases/-/tree/master/particleflow/pytorch_ddp_main_clic/workflows/training/multi_node?ref_type=heads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLPF and itwinai #395

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

MLPF and itwinai #395

matbun Jan 27, 2025

Replies: 2 comments · 4 replies

jpata Jan 27, 2025 Maintainer

matbun Jan 27, 2025 Author

jpata Jan 27, 2025 Maintainer

matbun Jan 27, 2025 Author

hahahannes Jan 28, 2025

hahahannes Jan 28, 2025

matbun
Jan 27, 2025

Replies: 2 comments 4 replies

jpata
Jan 27, 2025
Maintainer

matbun Jan 27, 2025
Author

jpata Jan 27, 2025
Maintainer

matbun Jan 27, 2025
Author

hahahannes
Jan 28, 2025