You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Suppose we previously trained a model and saved a checkpoint using a configuration with tensor_parallel_size=2 and pipeline_parallel_size=4. Now we want to load this checkpoint and continue training, but with a new configuration that has tensor_parallel_size=4 and pipeline_parallel_size=3.
With merge=True, instead of each rank saving its corresponding partitions, now all checkpoints are merged into a single file and saved in a format that both an unparallelized model and a parallelized model can use to load that checkpoint.
With save_config=True, all configuration like tensor_parallel_size, pipeline_parallel_size, and arguments in these XParallel and DistributedOptimizer classes are saved if present.
APIs
# save checkpoints of a parallelized model
model.save_pretrained(
save_directory="./checkpoints",
save_config=True, # default
save_function=torch.save, # default
merge_checkpoints=True, # False by default
)
# load checkpoints from a parallelized model
model.from_parallelized(path="./checkpoints")
The text was updated successfully, but these errors were encountered:
Notes
Suppose we previously trained a model and saved a checkpoint using a configuration with
tensor_parallel_size=2
andpipeline_parallel_size=4
. Now we want to load this checkpoint and continue training, but with a new configuration that hastensor_parallel_size=4
andpipeline_parallel_size=3
.With
merge=True
, instead of each rank saving its corresponding partitions, now all checkpoints are merged into a single file and saved in a format that both an unparallelized model and a parallelized model can use to load that checkpoint.With
save_config=True
, all configuration like tensor_parallel_size, pipeline_parallel_size, and arguments in theseXParallel
andDistributedOptimizer
classes are saved if present.APIs
The text was updated successfully, but these errors were encountered: