Replies: 2 comments 4 replies
-
Thanks for the writeup! We are currently working on a publication with the latest datasets & code from MLPF. It will largely be about the physics results, but there may be some minor computational aspect, as we have not yet published the pytorch version of the code, therefore this code and datasets are unpublished and unreviewed as of yet. However, we can share the current datasets that are compatible with the pre-publication version of the code so you can try integrating it in your project. From the MLPF side, we would ask to be kept in the loop about the results, and to know about it if it will be shown in any non-internal presentation. If you are aiming for a publication later this year, hopefully we will have already released the latest version of the code and datasets by then that can be cited in your paper. |
Beta Was this translation helpful? Give feedback.
-
Hi @matbun! I am running mlpf with version 1.8 and the open clic dataset on KubeFlow. You can find an example Thanks @jpata for your help! |
Beta Was this translation helpful? Give feedback.
-
Hi there, I am working at the intersection of ML and HPC in an European project called interTwin.
In this project, we are developing a tool called itwinai, which provides an abstraction layer to help scientists scaling their ML models to HPC by leveraging well-known tools such as Ray, Horovod, and DeepSpeed.
itwinai is being developed to support physics and earth observation use cases, and so far we have integrated medium-sized deep learning models developed by some research teams participating in the interTwin project. There is a HEP model currently being integrated by CNRS as well.
As of now, itwinai is capable of running distributed ML on multiple nodes using Torch DDP, Ray, Horovod, and Microsoft DeepSpeed. It also supports hyper-parameter optimization with Ray Tune and it integrates with distributed ML with Ray Train, allowing users to potentially scale each individual tuning trial to multiple nodes. Other features include configuration management and integration with popular logging frameworks. Moreover, we have developed a first version of a scalability report to profile how well the training of a model scales to multiple accelerators on multiple nodes, also collecting data on communication v. computation overhead and GPU energy consumption. Find an example here: https://itwinai.readthedocs.io/latest/use-cases/virgo_doc.html#scalability-metrics
We are planning to add support also for more advanced distributed ML algorithms, involving for instance model parallelism. Also, we would like to extend our work on GPU energy consumption and profiling of ML models on HPC.
I discovered MLPF thanks to @erwulff, a colleague of mine from CERN openlab and I would like to integrate it in itwinai, with the following potential benefits:
Of course I cannot promise anything at the moment, also considering that this would be a side project.
As far as I understood from your citation & reuse policy, there are some constraints on using the latest versions of the datasets. If you prefer, I could generate a synthetic dataset replicating the structure of the real data (we did the same when working with Virgo, for instance).
Also, if there are some meaningful results and if you agree, I would like to present the results of the integration at some public events (e.g., CERN openlab workshop) as an example of integration of HEP models in itwinai. Considering that the PyTorch version is still under development, I understand if you don't want me to publish plots showing how well MLPF x.y.z scales to HPC. Instead, I could have plots showing the relative scaling of the itwinai integration compared to the baseline (your plain code). This way I would not show how "well" MLPF scales, but only the improvement or overhead introduced by itwinai, which would be even better suited to my goal.
So, in summary, I really like your work and I would be happy to help by understanding whether anything we developed in itwinai could be useful for you or not, and how we could improve.
Beta Was this translation helpful? Give feedback.
All reactions