Found two forward recomputation exist in a single backward when using FSDP with activation checkpointing #740

mingyuanw-mt · 2024-10-21T08:14:59Z

System Info

PyTorch version: 2.2.0
Is debug build: False
CUDA used to build PyTorch: 11.8
GPU: A100 PCIe * 4
transformers: 4.45.2

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

Hi, I am trying fintuning Llama2-7B using FSDP , but I found there exits two forward recomputations in a single backward when using FSDP with activation checkpointing, but there should only be one recomputation normally? This strange phenomenon decreases the throughput of the model’training

As you can see in the diagram, there are two flash_attention operations

here is my tracing file:
fsdp_ac_cuda_profile.zip

It seems that using both fsdp and transformer's activation checkpointing has caused this problem

llama-recipes/src/llama_recipes/finetuning.py

Lines 220 to 221 in 799e90e

    
           model.gradient_checkpointing_enable() 
        
           apply_fsdp_checkpointing(model)

fsdp's activation checkpointing will use non reentrant version of torch.utils.checkpoint, and transformer will do the opposite, please correct me if my understanding is incorrect.

When I delete model.gradient_checkpointing_enable(), the behavior of backward becomes normal, and the train_epoch_time before this change is 248.88s, after this change is 185.07s

Error logs

Please check the attachment above

Expected behavior

One backward operation corresponds to one recomputation

The text was updated successfully, but these errors were encountered:

mreso · 2024-10-28T16:35:46Z

Hi @mingyuanw-mt thanks for flagging this. I think they are using reentrant as well but in any case this will wrap with torch.utils.checkpoint twice which leads to the trace. I'll be creating a related PR this week and can disable the second call in it.

wukaixingxp assigned wukaixingxp and mreso and unassigned wukaixingxp Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Found two forward recomputation exist in a single backward when using FSDP with activation checkpointing #740

Found two forward recomputation exist in a single backward when using FSDP with activation checkpointing #740

mingyuanw-mt commented Oct 21, 2024

mreso commented Oct 28, 2024

Found two forward recomputation exist in a single backward when using FSDP with activation checkpointing #740

Found two forward recomputation exist in a single backward when using FSDP with activation checkpointing #740

Comments

mingyuanw-mt commented Oct 21, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

mreso commented Oct 28, 2024