Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support Wan2.1 inference #856

Merged
merged 56 commits into from
Feb 28, 2025
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
0d72737
init
zhtmike Feb 26, 2025
fc6f203
migrate dit/t5/vae
zhtmike Feb 26, 2025
c9d84d3
migrate scheduler
zhtmike Feb 26, 2025
9e95dd0
migrate text2video
zhtmike Feb 26, 2025
e271e0d
migrate generate
zhtmike Feb 27, 2025
063f133
fix error
zhtmike Feb 27, 2025
fba688b
fix error 2
zhtmike Feb 27, 2025
423a7b1
fix error 3
zhtmike Feb 27, 2025
2f2f0a6
fix error 4
zhtmike Feb 27, 2025
ab7b51b
migrate image2video
zhtmike Feb 27, 2025
4e699a3
remove syn
zhtmike Feb 27, 2025
58e07ac
add clip
HaoyangLee Feb 27, 2025
4538a2b
migrate clip
zhtmike Feb 27, 2025
ccf62bf
fix i2v bug
HaoyangLee Feb 27, 2025
72e77fd
fix error 5
zhtmike Feb 28, 2025
d523b6b
fix error 6
zhtmike Feb 28, 2025
158c9bc
fix error 7
zhtmike Feb 28, 2025
33f1de9
support fsdp
zhtmike Feb 28, 2025
5abd3a9
add global seed
zhtmike Feb 28, 2025
d3f44b9
support sp
zhtmike Feb 28, 2025
f614003
change args name
zhtmike Feb 28, 2025
376df99
update readme
SamitHuang Feb 28, 2025
acfb69a
update readme
SamitHuang Feb 28, 2025
3b62d5c
Update README.md
SamitHuang Feb 28, 2025
ab9fa57
Update README.md
SamitHuang Feb 28, 2025
6c894a5
update readme
SamitHuang Feb 28, 2025
a7ce053
clean files
zhtmike Feb 28, 2025
1d739f2
fix readme typo
SamitHuang Feb 28, 2025
6fa31cc
fix seed
zhtmike Feb 28, 2025
5721beb
update requirement.txt
zhtmike Feb 28, 2025
a686e26
fix error 8
zhtmike Feb 28, 2025
8e33539
change args name 2
zhtmike Feb 28, 2025
fd07871
update
SamitHuang Feb 28, 2025
718870b
rm files
SamitHuang Feb 28, 2025
2e9cb8d
fix after rm
SamitHuang Feb 28, 2025
538444f
update readme
SamitHuang Feb 28, 2025
4a19db9
rm torchvision transforms
HaoyangLee Feb 28, 2025
2902a70
fix error 9
zhtmike Feb 28, 2025
a366bbe
update readme (#15)
SamitHuang Feb 28, 2025
600809a
add pil2tensor, rm torchvision
HaoyangLee Feb 28, 2025
13a5419
update docstring
zhtmike Feb 28, 2025
ba74e15
small fix
SamitHuang Feb 28, 2025
d285995
fix error 10
zhtmike Feb 28, 2025
6837f3b
fix
HaoyangLee Feb 28, 2025
5388846
Merge pull request #16 from HaoyangLee/wan2_1
HaoyangLee Feb 28, 2025
b061523
add make_grid_ms, save_image_ms
HaoyangLee Feb 28, 2025
0423dc7
fix make_grid_ms
HaoyangLee Feb 28, 2025
f727335
fix save_image_ms
HaoyangLee Feb 28, 2025
0ea8982
rm distributed
SamitHuang Feb 28, 2025
0fc4472
Update README.md
SamitHuang Feb 28, 2025
fb8bc0a
Merge pull request #19 from SamitHuang/wan2_1_pr
HaoyangLee Feb 28, 2025
83bc963
Merge pull request #20 from HaoyangLee/wan2_1
HaoyangLee Feb 28, 2025
938b919
fix linting
HaoyangLee Feb 28, 2025
50ebfdc
Merge pull request #21 from HaoyangLee/wan2_1
HaoyangLee Feb 28, 2025
4d69dcb
fix linting
HaoyangLee Feb 28, 2025
3aadf32
Merge pull request #22 from HaoyangLee/wan2_1
HaoyangLee Feb 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
282 changes: 282 additions & 0 deletions examples/wan2_1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
<div align="center">
<h1>🚀 Wan: Open and Advanced Large-Scale Video Generative Models </h1>

</div>

In this repository, we present an efficient MindSpore implementation of [Wan2.1](https://github.com/Wan-Video/Wan2.1). This repository is built on the models and code released by the Alibaba Wan group. We are grateful for their exceptional work and generous contribution to open source.

- 👍 **SOTA Performance**: **Wan2.1** consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.
- 👍 **Multiple Tasks**: **Wan2.1** excels in Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, advancing the field of video generation.
- 👍 **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
- 👍 **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.


## Video Demos

The following videos are generated based on MindSpore and Ascend 910*.

- Text-to-Video

https://github.com/user-attachments/assets/f6705d28-7755-447b-a256-6727f66d693b

```text
prompt: A sepia-toned vintage photograph depicting a whimsical bicycle race featuring several dogs wearing goggles and tiny cycling outfits. The canine racers, with determined expressions and blurred motion, pedal miniature bicycles on a dusty road. Spectators in period clothing line the sides, adding to the nostalgic atmosphere. Slightly grainy and blurred, mimicking old photos, with soft side lighting enhancing the warm tones and rustic charm of the scene. 'Bicycle Race' captures this unique moment in a medium shot, focusing on both the racers and the lively crowd.
```

https://github.com/user-attachments/assets/1e1da53a-9112-4fc3-bb8e-b458497c4806

```text
prompt: Film quality, professional quality, rich details. The video begins to show the surface of a pond, and the camera slowly zooms in to a close-up. The water surface begins to bubble, and then a blonde woman is seen coming out of the lotus pond soaked all over, showing the subtle changes in her facial expression, creating a dreamy atmosphere.
```

https://github.com/user-attachments/assets/34e4501f-a207-40bb-bb6c-b162ff6505b0

```text
prompt: Two anthropomorphic cats wearing boxing suits and bright gloves fiercely battled on the boxing ring under the spotlight. Their muscles are tight, displaying the strength and agility of professional boxers. A spotted dog judge stood aside. The animals in the audience around cheered and cheered, adding a lively atmosphere to the competition. The cat's boxing movements are quick and powerful, with its paws tracing blurry trajectories in the air. The screen adopts a dynamic blur effect, close ups, and focuses on the intense confrontation on the boxing ring.
```

https://github.com/user-attachments/assets/aceda253-78a2-4fa5-9edc-83f035c7c2ea

```text
prompt: Sports photography full of dynamism, several motorcycles fiercely compete on the loess flying track, their wheels rolling up the dust in the sky. The motorcyclist is wearing professional racing clothes. The camera uses a high-speed shutter to capture moments, follows from the side and rear, and finally freezes in a close-up of a motorcycle, showcasing its exquisite body lines and powerful mechanical beauty, creating a tense and exciting racing atmosphere. Close up dynamic perspective, perfectly presenting the visual impact of speed and power.
```

- Image-to-Video


https://github.com/user-attachments/assets/d37bf480-595e-4a41-95f8-acbc421b7428

```text
prompt: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.
```


## 🔥 Latest News!!

* Feb 28, 2025: 👋 MindSpore implementation of Wan2.1 is released, supporting text-to-video and image-to-video inference tasks on 1.3B and 14B models.


## 📑 Todo List
- Wan2.1 Text-to-Video
- [x] Single-NPU Inference code of the 14B and 1.3B models
- [x] Multi-NPU Inference code of the 14B models
- [ ] prompt extension
- [ ] Gradio demo
- Wan2.1 Image-to-Video
- [x] Single-NPU Inference code of the 14B model
- [x] Multi-NPU Inference code of the 14B model
- [ ] prompt extension
- [ ] Gradio demo


## Quickstart

### Requirments

The code is tested in the following environments

| mindspore | ascend driver | firmware | cann tookit/kernel |
| :---: | :---: | :---: | :---: |
| 2.5.0 | 24.1.0 |7.35.23 | 8.0.RC3.beta1 |


### Installation
Clone the repo:
```
git clone https://github.com/mindspore-lab/mindone
cd mindone/examples/wan2_1
```

Install dependencies:
```
pip install -r requirements.txt
```

### Model Download

| Models | Download Link | Notes |
| --------------|-------------------------------------------------------------------------------|-------------------------------|
| T2V-14B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B) | Supports both 480P and 720P
| I2V-14B-720P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
| I2V-14B-480P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
| T2V-1.3B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B) | Supports 480P

> 💡Note: The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.


Download the models using huggingface-cli or modelscope:
```
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./Wan2.1-I2V-14B-480P
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./Wan2.1-I2V-14B-720P
```

Download models using modelscope-cli similarly:
```
pip install modelscope
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
```

### Run Text-to-Video Generation

This repository supports two Text-to-Video models (1.3B and 14B) and two resolutions (480P and 720P). The parameters and configurations for these models are as follows:

<table>
<thead>
<tr>
<th rowspan="2">Task</th>
<th colspan="2">Resolution</th>
<th rowspan="2">Model</th>
</tr>
<tr>
<th>480P</th>
<th>720P</th>
</tr>
</thead>
<tbody>
<tr>
<td>t2v-14B</td>
<td style="color: green;">✔️</td>
<td style="color: green;">✔️</td>
<td>Wan2.1-T2V-14B</td>
</tr>
<tr>
<td>t2v-1.3B</td>
<td style="color: green;">✔️</td>
<td style="color: red;">❌</td>
<td>Wan2.1-T2V-1.3B</td>
</tr>
</tbody>
</table>


- Single-NPU inference

```
python generate.py \
--task t2v-14B \
--size 1280*720 \
--ckpt_dir ./Wan2.1-T2V-14B \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
```

```
python generate.py \
--task t2v-1.3B \
--size 832*480 \
--ckpt_dir ./Wan2.1-T2V-1.3B \
--sample_shift 8 \
--sample_guide_scale 6 \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
```

> 💡Note: If you are using the `T2V-1.3B` model, we recommend setting the parameter `--sample_guide_scale 6`. The `--sample_shift parameter` can be adjusted within the range of 8 to 12 based on the performance.


- Multi-NPU inference

```
msrun --worker_num=2 --local_worker_num=2 generate.py \
--task t2v-14B \
--size 1280*720 \
--ckpt_dir ./Wan2.1-T2V-14B \
--distributed \
--dit_zero3 --t5_zero3 --ulysses_sp \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
```

> 💡 We support 720 T2V inference using only 1 card. But using more cards can accelerate the generation process.

### Run Image-to-Video Generation

Similar to Text-to-Video, Image-to-Video supports different resolutions. The specific parameters and their corresponding settings are as follows:

<table>
<thead>
<tr>
<th rowspan="2">Task</th>
<th colspan="2">Resolution</th>
<th rowspan="2">Model</th>
</tr>
<tr>
<th>480P</th>
<th>720P</th>
</tr>
</thead>
<tbody>
<tr>
<td>i2v-14B</td>
<td style="color: green;">❌</td>
<td style="color: green;">✔️</td>
<td>Wan2.1-I2V-14B-720P</td>
</tr>
<tr>
<td>i2v-14B</td>
<td style="color: green;">✔️</td>
<td style="color: red;">❌</td>
<td>Wan2.1-T2V-14B-480P</td>
</tr>
</tbody>
</table>


- Single-NPU inference

```
python generate.py \
--task i2v-14B \
--size 832*480 \
--ckpt_dir ./Wan2.1-I2V-14B-480P \
--image examples/i2v_input.JPG \
--prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
```

> 💡For the Image-to-Video task, the `size` parameter represents the area of the generated video, with the aspect ratio following that of the original input image.


- Multi-NPU inference

```
msrun --worker_num=2 --local_worker_num=2 generate.py \
--task i2v-14B --size 1280*720 \
--ckpt_dir ./Wan2.1-I2V-14B-720P \
--distributed \
--dit_zero3 --t5_zero3 --ulysses_sp \
--image examples/i2v_input.JPG \
--prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
```

> 💡At least 2 cards are required to run 720P I2V generation to avoid OOM. 8 cards will accelerate the generation process at most.

## Performance

Experiments are tested on ascend 910* with mindspore 2.5.0 **pynative** mode:

| model | h x w x f | cards | steps | npu peak memory | s/video |
|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|
| T2V-1.3B | 832x480x81 | 1 | 50 | 21GB | ~235 |
| T2V-14B | 1280x720x81 | 1 | 50 | 52.2GB | ~4650 |
| I2V-14B | 832x480x81 | 1 | 40 | 50GB | ~1150 |
| I2V-14B | 1280x720x81 | 4 | 40 | 25GB | ~1000 |


## Citation

```
@article{wan2.1,
title = {Wan: Open and Advanced Large-Scale Video Generative Models},
author = {Wan Team},
journal = {},
year = {2025}
}
```

## License Agreement
The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generate contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations. For a complete list of restrictions and details regarding your rights, please refer to the full text of the [license](LICENSE.txt).


## Acknowledgements

We would like to thank the contributors to the [Wan2.1](https://github.com/Wan-Video/Wan2.1), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Qwen](https://huggingface.co/Qwen), [umt5-xxl](https://huggingface.co/google/umt5-xxl), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research.
Binary file added examples/wan2_1/examples/i2v_input.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading