MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
Yiren Song, Cheng Liu, and Mike Zheng Shou
Show Lab, National University of Singapore
To run the Gradio app on HuggingFace Space for Asymmetric LoRA or Recraft Model.
git clone https://github.com/showlab/MakeAnything.git
cd MakeAnything
conda create -n makeanything python=3.11.10
conda activate makeanything
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install --upgrade -r requirements.txt
accelerate config
You can download the trained checkpoints of Asymmetric LoRA & LoRA for inference. Below are the details of available models:
Model | Description | Resolution |
---|---|---|
asylora_9f_general | The Asymmetric LoRA has been fine-tuned on all 9-frames datasets. Index of lora_up: 1:LEGO 2:Cook 3:Painting 4:Icon 5:Landscape illustration 6:Portrait 7:Transformer 8:Sand art 9:Illustration 10:Sketch |
1056,1056 |
asylora_4f_general | The Asymmetric LoRA has been fine-tuned on all 4-frames datasets. Index of lora_up: (1~10 same as 9f) 11:Clay toys 12:Clay sculpture 13:Zbrush Modeling 14:Wood sculpture 15:Ink painting 16:Pencil sketch 17:Fabric toys 18:Oil painting 19:Jade Carving 20:Line draw 21:Emoji |
1024,1024 |
The training process relies on paired dataset consisting of text captions and images. Each dataset folder contains both .caption
and .png
files, where the filenames of the caption files correspond directly to the image filenames. Here is an example of the organized dataset.
dataset/
├── portrait_001.png
├── portrait_001.caption
├── portrait_002.png
├── portrait_002.caption
├── lego_001.png
├── lego_001.caption
The .caption
files contain a single line of text that serves as a prompt for generating the corresponding image. The prompt must specify the index of the lora_up used for that particular training sample in the Asymmetric LoRA. The format for this is --lora_up <index>
, where <index>
is the current B matrices index in the Asymmetric LoRA, refers to the certain domain used in the training, and index should start from 1, not 0.
For example, a .caption file for a portrait painting sequence might look as follows:
3*3 of 9 sub-images, step-by-step portrait painting process, 1 girl --lora_up 6
Then, you should organize your dataset configuration file written in TOML
. Here is an example:
[general]
enable_bucket = false
[[datasets]]
resolution = 1056
batch_size = 1
[[datasets.subsets]]
image_dir = '/path/to/dataset/'
caption_extension = '.caption'
num_repeats = 1
It is recommended to set batch size to 1 and set resolution to 1024 (4-frames) or 1056 (9-frames).
We have provided a template file for training Asymmetric LoRA in scripts/asylora_train.sh
. Simply replace corresponding paths with yours to start the training. Note that lora_ups_num
in the script is the total number of B matrices used in Asymmetric LoRA that you specified during training.
chmod +x scripts/asylora_train.sh
scripts/asylora_train.sh
Additionally, if you are directly using our dataset for training, the .caption
files in our released dataset do not specify the --lora_up <index>
field. You will need to organize and update the .caption
files to include the appropriate --lora_up <index>
values before starting the training.
We have also provided a template file for inference Asymmetric LoRA in scripts/asylora_inference.sh
. Once the training is done, replace file paths, fill in your prompt and run inference. Note that lora_up_cur
in the script is the current number of B matrices index to be used for inference.
chmod +x scripts/asylora_inference.sh
scripts/asylora_train.sh
You can download the trained checkpoints of Recraft Model for inference. Below are the details of available models:
Model | Description | Resolution |
---|---|---|
recraft_9f_lego | The Recraft Model has been trained on LEGO dataset. Support 9-frames generation. |
1056,1056 |
recraft_9f_portrait | The Recraft Model has been trained on Portrait dataset. Support 9-frames generation. |
1056,1056 |
recraft_9f_sketch | The Recraft Model has been trained on Sketch dataset. Support 9-frames generation. |
1056,1056 |
recraft_4f_wood_sculpture | The Recraft Model has been trained on Wood sculpture dataset. Support 4-frames generation. |
1024,1024 |
During the second phase of training the image-to-sequence generation with the Recraft model, we need to apply a standard LoRA architecture to be merged to flux.1 before performing the Recraft training. Therefore, the first step is to decompose the Asymmetric LoRA into the original LoRA format.
To achieve this, train a standard LoRA directly (optional method below) or we have provided a script template in scripts/asylora_split.sh
for splitting the Asymmetric LoRA. The script allows you to extract the required B matrices from the Asymmetric LoRA model. Specifically, the LORA_UP
in the script specifies the index of the B matrices you wish to extract for use as the original LoRA.
chmod +x scripts/asylora_split.sh
scripts/asylora_split.sh
You can also directly train a standard LoRA for Recraft process, eliminating the need to decompose the Asymmetric LoRA. In our project, we have included the standard LoRA training code from kohya-ss/sd-scripts in the files flux_train_network.py
for training and flux_minimal_inference.py
for inference. You can refer to the related documentation for guidance on how to train.
Alternatively, using other training platforms like kijai/ComfyUI-FluxTrainer is also a viable option. These platforms provide tools to facilitate the training and inference of LoRA models for the Recraft process.
Now you have obtained a standard LoRA, use our scripts/lora_merge.sh
template script to merge the LoRA to flux.1 checkpoints for further recraft training. Note that the merged model may take up around 50GB of your memory space.
chmod +x scripts/lora_merge.sh
scripts/lora_merge.sh
The dataset structure for Recraft training follows the same organization format as the dataset for Asymmetric LoRA, specifically described in Asymmetric LoRA 2.1 Settings for dataset. A TOML
configuration file is also required to organize and configure the dataset. Below is a template for the dataset configuration file:
[general]
flip_aug = false
color_aug = false
keep_tokens_separator = "|||"
shuffle_caption = false
caption_tag_dropout_rate = 0
caption_extension = ".caption"
[[datasets]]
batch_size = 1
enable_bucket = true
resolution = [1024, 1024]
[[datasets.subsets]]
image_dir = "/path/to/dataset/"
num_repeats = 1
Note that for training with 4-frame step sequences, the resolution must be set to 1024
. For training with 9-frame steps, the resolution should be 1056
.
For the sampling phase of the Recraft training process, we need to organize two text files: sample_images.txt
and sample_prompts.txt
. These files will store the sampled condition images and their corresponding prompts, respectively. Below are the templates for both files:
sample_images.txt
/path/to/image_1.png
/path/to/image_2.png
sample_prompts.txt
image_1_prompt_content
image_2_prompt_content
We have provided a template file for training Recraft Model in scripts/recraft_train.sh
. Simply replace corresponding paths with yours to start the training. Note that frame_num
in the script must be 4
(for 1024 resolution) or 9
(for 1056 resolution).
chmod +x scripts/recraft_train.sh
scripts/recraft_train.sh
We have also provided a template file for inference Recraft Model in scripts/recraft_inference.sh
. Once the training is done, replace file paths, fill in your prompt and run inference.
chmod +x scripts/recraft_inference.sh
scripts/recraft_inference.sh
We have uploaded our datasets on Hugging Face. The datasets includes both 4-frame and 9-frame sequence images, covering a total of 21 domains of procedural sequences. For MakeAnything training, each domain consists of 50 sequences, with resolutions of either 1024 (4-frame) or 1056 (9-frame). Additionally, we provide an extensive collection of SVG datasets and Sketch datasets for further research and experimentation.
Note that the arrangement of 9-frame sequences follows an S-shape pattern, whereas 4-frame sequences follow a ɔ-shape pattern.
Click to preview the datasets
@inproceedings{Song2025MakeAnythingHD,
title={MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation},
author={Yiren Song and Cheng Liu and Mike Zheng Shou},
year={2025},
url={https://api.semanticscholar.org/CorpusID:276107845}
}