Save and load checkpoints #29

xrsrke · 2023-11-08T06:48:39Z

Notes

Suppose we previously trained a model and saved a checkpoint using a configuration with tensor_parallel_size=2 and pipeline_parallel_size=4. Now we want to load this checkpoint and continue training, but with a new configuration that has tensor_parallel_size=4 and pipeline_parallel_size=3.
With merge=True, instead of each rank saving its corresponding partitions, now all checkpoints are merged into a single file and saved in a format that both an unparallelized model and a parallelized model can use to load that checkpoint.
With save_config=True, all configuration like tensor_parallel_size, pipeline_parallel_size, and arguments in these XParallel and DistributedOptimizer classes are saved if present.

APIs

# save checkpoints of a parallelized model
model.save_pretrained(
    save_directory="./checkpoints",
    save_config=True, # default
    save_function=torch.save, # default
    merge_checkpoints=True, # False by default
)

# load checkpoints from a parallelized model
model.from_parallelized(path="./checkpoints")

The text was updated successfully, but these errors were encountered:

xrsrke added this to pipegoose v1 Oct 31, 2023

xrsrke converted this from a draft issue Nov 8, 2023

xrsrke changed the title ~~Merge checkpoints~~ Distributed Checkpoint Nov 14, 2023

xrsrke assigned xrsrke and unassigned xrsrke Nov 14, 2023

xrsrke added help wanted Extra attention is needed good first issue Good for newcomers labels Nov 14, 2023

xrsrke moved this from Pending to Todo in pipegoose v1 Nov 14, 2023

xrsrke changed the title ~~Distributed Checkpoint~~ Save and Load Checkpoints Nov 15, 2023

xrsrke changed the title ~~Save and Load Checkpoints~~ Save and load checkpoints Nov 15, 2023

xrsrke assigned danielgrittner Nov 27, 2023

xrsrke moved this from Todo to In Progress in pipegoose v1 Nov 27, 2023

xrsrke removed the help wanted Extra attention is needed label Nov 27, 2023

xrsrke added the help wanted Extra attention is needed label Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save and load checkpoints #29

Save and load checkpoints #29

xrsrke commented Nov 8, 2023 •

edited

Loading

Save and load checkpoints #29

Save and load checkpoints #29

Comments

xrsrke commented Nov 8, 2023 • edited Loading

xrsrke commented Nov 8, 2023 •

edited

Loading