Project News ⚡
- [2023/12] The preprint of the Perseus paper is out here!
- [2023/10] We released Perseus, an energy optimizer for large model training. Get started here!
- [2023/09] We moved to under
ml-energy
! Please stay tuned for new exciting projects! - [2023/07]
ZeusMonitor
was used to profile GPU time and energy consumption for the ML.ENERGY leaderboard & Colosseum. - [2023/03] Chase, an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
- [2022/11] Carbon-Aware Zeus won the second overall best solution award at Carbon Hack 22.
Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.
from zeus.monitor import ZeusMonitor
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
monitor.begin_window("heavy computation")
# Four GPUs consuming energy like crazy!
measurement = monitor.end_window("heavy computation")
print(f"Energy: {measurement.total_energy} J")
print(f"Time : {measurement.time} s")
Zeus silently profiles different power limits during training and converges to the optimal one.
from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)
plo.on_epoch_begin()
for x, y in train_dataloader:
plo.on_step_begin()
# Learn from x and y!
plo.on_step_end()
plo.on_epoch_end()
$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
Please refer to our NSDI’23 paper and slides for details. Checkout Overview for a summary.
Zeus is part of The ML.ENERGY Initiative.
.
├── zeus/ # ⚡ Zeus Python package
│ ├── optimizer/ # - GPU energy and time optimizers
│ ├── run/ # - Tools for running Zeus on real training jobs
│ ├── policy/ # - Optimization policies and extension interfaces
│ ├── util/ # - Utility functions and classes
│ ├── monitor.py # - `ZeusMonitor`: Measure GPU time and energy of any code block
│ ├── controller.py # - Tools for controlling the flow of training
│ ├── callback.py # - Base class for Hugging Face-like training callbacks.
│ ├── simulate.py # - Tools for trace-driven simulation
│ ├── analyze.py # - Analysis functions for power logs
│ └── job.py # - Class for job specification
│
├── zeus_monitor/ # 🔌 GPU power monitor
│ ├── zemo/ # - A header-only library for querying NVML
│ └── main.cpp # - Source code of the power monitor
│
├── examples/ # 🛠️ Examples of integrating Zeus
│
├── capriccio/ # 🌊 A drifting sentiment analysis dataset
│
└── trace/ # 🗃️ Train and power traces for various GPUs and DNNs
Refer to Getting started for complete instructions on environment setup, installation, and integration.
We provide a Docker image fully equipped with all dependencies and environments. The only command you need is:
docker run -it \
--gpus all `# Mount all GPUs` \
--cap-add SYS_ADMIN `# Needed to change the power limit of the GPU` \
--ipc host `# PyTorch DataLoader workers need enough shm` \
mlenergy/zeus:latest \
bash
Refer to Environment setup for details.
We provide working examples for integrating and running Zeus in the examples/
directory.
You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.
Refer to Extending Zeus for details.
The use of GPUs for training DNNs results in high carbon emissions and energy consumption. Building on top of Zeus, we introduce Chase -- a carbon-aware solution. Chase dynamically controls the energy consumption of GPUs; adapts to shifts in carbon intensity during DNN training, reducing carbon footprint with minimal compromises on training performance. To proactively adapt to shifting carbon intensity, a lightweight machine learning algorithm is used to forecast the carbon intensity of the upcoming time frame. For more details on Chase, please refer to our paper and the chase branch.
@inproceedings{zeus-nsdi23,
title = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
author = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
booktitle = {USENIX NSDI},
year = {2023}
}
Jae-Won Chung ([email protected])