You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the optimizer contains any data of type float64, then adding a LearningRateMonitor causes a Value Error on MPS backends with apple silicon. See the self-contained and minimal example in "How to reproduce the bug" below.
The error is:
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_monitor.py", line 219, in <dictcomp>
name: torch.tensor(value, device=trainer.strategy.root_device) for name, value in latest_stat.items()
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
When removing the LearningRateMonitor, the code runs through, thus the optimiser itself is fine.
Note that the quick fix to remove the lr=np.float64(0.01) works only for the minimal example. In my case, the optimiser is imported from an external module and has more parameters, making it much harder to change.
I tried out 4 fixes in the pytorch lightning source code, all of them fix the problem, but might have side-effects or not work on other devices or in other configurations:
Replace torch.tensor(value, device=trainer.strategy.root_device) in this line to one of:
Epoch 0: 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 49/50 [00:00<00:00, 188.15it/s, v_num=13]Traceback (most recent call last):
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/test_lr_monitor.py", line 37, in <module>
trainer.fit(model, dataloader)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
self.advance()
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 233, in advance
call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 218, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_monitor.py", line 173, in on_train_batch_start
latest_stat = self._extract_stats(trainer, interval)
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_monitor.py", line 216, in _extract_stats
trainer.callback_metrics.update({
File "/Users/malteebnerlightly/Documents/GitHub/lightly-train/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_monitor.py", line 217, in <dictcomp>
name: torch.tensor(value, device=trainer.strategy.root_device) for name, value in latest_stat.items()
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
Epoch 0: 98%|█████████▊| 49/50 [00:00<00:00, 129.92it/s, v_num=13]
Bug description
When the optimizer contains any data of type
float64
, then adding aLearningRateMonitor
causes a Value Error on MPS backends with apple silicon. See the self-contained and minimal example in "How to reproduce the bug" below.The error is:
When removing the
LearningRateMonitor
, the code runs through, thus the optimiser itself is fine.Note that the quick fix to remove the
lr=np.float64(0.01)
works only for the minimal example. In my case, the optimiser is imported from an external module and has more parameters, making it much harder to change.I tried out 4 fixes in the pytorch lightning source code, all of them fix the problem, but might have side-effects or not work on other devices or in other configurations:
Replace
torch.tensor(value, device=trainer.strategy.root_device)
in this line to one of:torch.tensor(value, device="cpu")
torch.tensor(value, device=value.device)
torch.tensor(value, device=trainer.strategy.root_device, dtype=torch.float32)
value
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Machine is a MacBook Pro with M1-Pro CPU
Current environment
- GPU: None
- available: False
- version: None
- lightning-utilities: 0.11.7
- pytorch-lightning: 2.4.0
- torch: 2.4.1
- torchmetrics: 1.4.1
- torchvision: 0.19.1
- absl-py: 2.1.0
- aenum: 3.1.15
- aiohappyeyeballs: 2.4.0
- aiohttp: 3.10.5
- aiosignal: 1.3.1
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.9.3
- async-timeout: 4.0.3
- attrs: 24.2.0
- autocommand: 2.2.2
- backports.tarfile: 1.2.0
- certifi: 2024.7.4
- charset-normalizer: 3.3.2
- exceptiongroup: 1.2.2
- filelock: 3.15.4
- frozenlist: 1.4.1
- fsspec: 2024.9.0
- grpcio: 1.65.5
- huggingface-hub: 0.24.6
- hydra-core: 1.3.2
- idna: 3.8
- importlib-metadata: 8.0.0
- importlib-resources: 6.4.0
- inflect: 7.3.1
- iniconfig: 2.0.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jinja2: 3.1.4
- licenseheaders: 0.8.8
- lightning-utilities: 0.11.7
- markdown: 3.7
- markupsafe: 2.1.5
- more-itertools: 10.3.0
- mpmath: 1.3.0
- multidict: 6.0.5
- mypy: 1.11.1
- mypy-extensions: 1.0.0
- networkx: 3.3
- numpy: 2.1.1
- omegaconf: 2.3.0
- packaging: 24.1
- pillow: 10.4.0
- platformdirs: 4.2.2
- pluggy: 1.5.0
- protobuf: 5.27.3
- psutil: 6.0.0
- pydantic: 1.10.18
- pydantic-core: 2.20.1
- pydeprecate: 0.3.2
- pytest: 8.3.2
- pytest-mock: 3.14.0
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.4.0
- pyyaml: 6.0.2
- regex: 2024.7.24
- requests: 2.32.3
- ruff: 0.6.1
- safetensors: 0.4.4
- setuptools: 74.1.2
- six: 1.16.0
- sympy: 1.13.2
- tensorboard: 2.17.1
- tensorboard-data-server: 0.7.2
- timm: 1.0.8
- tomli: 2.0.1
- torch: 2.4.1
- torchmetrics: 1.4.1
- torchvision: 0.19.1
- tqdm: 4.66.5
- typeguard: 4.3.0
- types-tqdm: 4.66.0.20240417
- typing-extensions: 4.12.2
- urllib3: 2.2.2
- werkzeug: 3.0.3
- wheel: 0.43.0
- yarl: 1.9.11
- zipp: 3.19.2
- OS: Darwin
- architecture:
- 64bit
-
- processor: arm
- python: 3.10.8
- release: 23.6.0
- version: Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000
More info
No response
The text was updated successfully, but these errors were encountered: