🚧 pipegoose: Large scale 4D parallelism pre-training for 🤗 `transformers` in Mixture of Experts

⚠️ The project is actively under development, and we're actively seeking collaborators. Come join us: [discord link] [roadmap] [good first issue]

⚠️ The APIs is still a work in progress and could change at any time. None of the public APIs are set in stone until we hit version 0.6.9.

⚠️ Currently only parallelize bloom-560m is supported. Support for hybrid 3D parallelism and distributed optimizer for 🤗 transformers will be available in the upcoming weeks (it's basically done, but it doesn't support 🤗 transformers yet)

⚠️ This library is underperforming when compared to Megatron-LM and DeepSpeed (and not even achieving reasonable performance yet).

from torch.utils.data import DataLoader
+ from torch.utils.data.distributed import DistributedSampler
from torch.optim import Adam
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

+ from pipegoose.distributed import ParallelContext, ParallelMode
+ from pipegoose.nn import DataParallel, TensorParallel
+ from pipegoose.optim import DistributedOptimizer

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
tokenizer.pad_token = tokenizer.eos_token

BATCH_SIZE = 4
+ DATA_PARALLEL_SIZE = 2
+ parallel_context = ParallelContext.from_torch(
+    tensor_parallel_size=2,
+    data_parallel_size=2,
+    pipeline_parallel_size=1
+ )
+ model = TensorParallel(model, parallel_context).parallelize()
+ model = DataParallel(model, parallel_context).parallelize()
model.to("cuda")
+ device = next(model.parameters()).device

optim = Adam(model.parameters(), lr=1e-3)
+ optim = DistributedOptimizer(optim, parallel_context)

dataset = load_dataset("imdb", split="train")
+ dp_rank = parallel_context.get_local_rank(ParallelMode.DATA)
+ sampler = DistributedSampler(dataset, num_replicas=DATA_PARALLEL_SIZE, rank=dp_rank, seed=42)
+ dataloader = DataLoader(dataset, batch_size=BATCH_SIZE // DATA_PARALLEL_SIZE, shuffle=False, sampler=sampler)

for epoch in range(100):
+    sampler.set_epoch(epoch)

    for batch in dataloader:
        inputs = tokenizer(batch["text"], padding=True, truncation=True, max_length=1024, return_tensors="pt")
        inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
        labels = inputs["input_ids"]

        outputs = model(**inputs, labels=labels)

        optim.zero_grad()
        outputs.loss.backward()
        optim.step()

Installation and try it out

You can install the package through the following command:

pip install pipegoose

And try out a hybrid tensor and data parallelism training script (You must have at least 4 GPUs in order to try hybrid 2D parallelism).

cd pipegoose/examples
torchrun --standalone --nnodes=1 --nproc-per-node=4 hybrid_parallelism.py

We did a small scale correctness test by comparing the validation losses between a paralleized transformer and one kept by default, starting at identical checkpoints and training data. We will conduct rigorous large scale convergence and weak scaling law benchmarks against Megatron and DeepSpeed in the near future if we manage to make it.

Data Parallelism [link]
~~Tensor Parallelism [link]~~ (We've found a bug in convergence, and we are fixing it)
~~Hybrid 2D Parallelism (TP+DP) [link]~~
Distributed Optimizer ZeRO-1 Convergence: [sgd link] [adam link]

Features

Megatron-style 3D parallelism
Sequence parallelism and Mixture of Experts that work in 3D parallelism
ZeRO-1: Distributed Optimizer
Highly optimized CUDA kernels port from Megatron-LM, DeepSpeed
...

Appreciation

Big thanks to 🤗 Hugging Face for sponsoring this project with GPUs for testing! And Zach Schrier for monthly twitch donations
The library's APIs are inspired by OSLO's and ColossalAI's APIs.

Citation

@software{pipegoose,
  title = {{pipegoose: Large-scale 4D parallelism pre-training for `transformers`}},
  author = {},
  url = {https://github.com/xrsrke/pipegoose},
  doi = {},
  month = {},
  year = {2024},
  version = {},
}

Name		Name	Last commit message	Last commit date
Latest commit History 510 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
pipegoose		pipegoose
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
3d-parallelism.png		3d-parallelism.png
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚧 pipegoose: Large scale 4D parallelism pre-training for 🤗 `transformers` in Mixture of Experts

About

Releases

Packages

Languages

License

abourramouss/pipegoose

Folders and files

Latest commit

History

Repository files navigation

🚧 pipegoose: Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

🚧 pipegoose: Large scale 4D parallelism pre-training for 🤗 `transformers` in Mixture of Experts

Packages