Deep Learning Energy Measurement and Optimization

Project News ⚡

[2023/12] The preprint of the Perseus paper is out here!
[2023/10] We released Perseus, an energy optimizer for large model training. Get started here!
[2023/09] We moved to under ml-energy! Please stay tuned for new exciting projects!
[2023/07] ZeusMonitor was used to profile GPU time and energy consumption for the ML.ENERGY leaderboard & Colosseum.
[2023/03] Chase, an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
[2022/11] Carbon-Aware Zeus won the second overall best solution award at Carbon Hack 22.

Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.

Measuring GPU energy

from zeus.monitor import ZeusMonitor

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

monitor.begin_window("heavy computation")
# Four GPUs consuming energy like crazy!
measurement = monitor.end_window("heavy computation")

print(f"Energy: {measurement.total_energy} J")
print(f"Time  : {measurement.time} s")

Finding the optimal GPU power limit

Zeus silently profiles different power limits during training and converges to the optimal one.

from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

plo.on_epoch_begin()

for x, y in train_dataloader:
    plo.on_step_begin()
    # Learn from x and y!
    plo.on_step_end()

plo.on_epoch_end()

CLI power and energy monitor

$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}

$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})

Please refer to our NSDI’23 paper and slides for details. Checkout Overview for a summary.

Zeus is part of The ML.ENERGY Initiative.

Repository Organization

.
├── zeus/                # ⚡ Zeus Python package
│   ├── optimizer/       #    - GPU energy and time optimizers
│   ├── run/             #    - Tools for running Zeus on real training jobs
│   ├── policy/          #    - Optimization policies and extension interfaces
│   ├── util/            #    - Utility functions and classes
│   ├── monitor.py       #    - `ZeusMonitor`: Measure GPU time and energy of any code block
│   ├── controller.py    #    - Tools for controlling the flow of training
│   ├── callback.py      #    - Base class for Hugging Face-like training callbacks.
│   ├── simulate.py      #    - Tools for trace-driven simulation
│   ├── analyze.py       #    - Analysis functions for power logs
│   └── job.py           #    - Class for job specification
│
├── zeus_monitor/        # 🔌 GPU power monitor
│   ├── zemo/            #    -  A header-only library for querying NVML
│   └── main.cpp         #    -  Source code of the power monitor
│
├── examples/            # 🛠️ Examples of integrating Zeus
│
├── capriccio/           # 🌊 A drifting sentiment analysis dataset
│
└── trace/               # 🗃️ Train and power traces for various GPUs and DNNs

Getting Started

Refer to Getting started for complete instructions on environment setup, installation, and integration.

Docker image

We provide a Docker image fully equipped with all dependencies and environments. The only command you need is:

docker run -it \
    --gpus all                  `# Mount all GPUs` \
    --cap-add SYS_ADMIN         `# Needed to change the power limit of the GPU` \
    --ipc host                  `# PyTorch DataLoader workers need enough shm` \
    mlenergy/zeus:latest \
    bash

Refer to Environment setup for details.

Examples

We provide working examples for integrating and running Zeus in the examples/ directory.

Extending Zeus

You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.

Refer to Extending Zeus for details.

Carbon-Aware Zeus

The use of GPUs for training DNNs results in high carbon emissions and energy consumption. Building on top of Zeus, we introduce Chase -- a carbon-aware solution. Chase dynamically controls the energy consumption of GPUs; adapts to shifts in carbon intensity during DNN training, reducing carbon footprint with minimal compromises on training performance. To proactively adapt to shifting carbon intensity, a lightweight machine learning algorithm is used to forecast the carbon intensity of the upcoming time frame. For more details on Chase, please refer to our paper and the chase branch.

Citation

@inproceedings{zeus-nsdi23,
    title     = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
    author    = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
    booktitle = {USENIX NSDI},
    year      = {2023}
}

Contact

Jae-Won Chung ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
.github		.github
capriccio		capriccio
docs		docs
examples		examples
scripts		scripts
tests		tests
trace		trace
zeus		zeus
zeus_monitor		zeus_monitor
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning Energy Measurement and Optimization

Measuring GPU energy

Finding the optimal GPU power limit

CLI power and energy monitor

Repository Organization

Getting Started

Docker image

Examples

Extending Zeus

Carbon-Aware Zeus

Citation

Contact

About

Releases

Packages

Languages

License

Zijie-Tian/zeus

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Energy Measurement and Optimization

Measuring GPU energy

Finding the optimal GPU power limit

CLI power and energy monitor

Repository Organization

Getting Started

Docker image

Examples

Extending Zeus

Carbon-Aware Zeus

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages