Skip to content

Official code of "MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation"

License

Notifications You must be signed in to change notification settings

showlab/MakeAnything

Repository files navigation

MakeAnything

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
Yiren Song, Cheng Liu, and Mike Zheng Shou
Show Lab, National University of Singapore

arXiv HuggingFace HuggingFace HuggingFace HuggingFace


Quick Start

Gradio app

To run the Gradio app on HuggingFace Space for Asymmetric LoRA or Recraft Model.

Configuration

1. Environment setup

git clone https://github.com/showlab/MakeAnything.git
cd MakeAnything

conda create -n makeanything python=3.11.10
conda activate makeanything

2. Requirements installation

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install --upgrade -r requirements.txt

accelerate config

Asymmetric LoRA

1. Weights

You can download the trained checkpoints of Asymmetric LoRA & LoRA for inference. Below are the details of available models:

Model Description Resolution
asylora_9f_general The Asymmetric LoRA has been fine-tuned on all 9-frames datasets. Index of lora_up: 1:LEGO 2:Cook 3:Painting 4:Icon 5:Landscape illustration 6:Portrait 7:Transformer 8:Sand art 9:Illustration 10:Sketch 1056,1056
asylora_4f_general The Asymmetric LoRA has been fine-tuned on all 4-frames datasets. Index of lora_up: (1~10 same as 9f) 11:Clay toys 12:Clay sculpture 13:Zbrush Modeling 14:Wood sculpture 15:Ink painting 16:Pencil sketch 17:Fabric toys 18:Oil painting 19:Jade Carving 20:Line draw 21:Emoji 1024,1024

2. Training

2.1 Settings for dataset

The training process relies on paired dataset consisting of text captions and images. Each dataset folder contains both .caption and .png files, where the filenames of the caption files correspond directly to the image filenames. Here is an example of the organized dataset.

dataset/
├── portrait_001.png
├── portrait_001.caption
├── portrait_002.png
├── portrait_002.caption
├── lego_001.png
├── lego_001.caption

The .caption files contain a single line of text that serves as a prompt for generating the corresponding image. The prompt must specify the index of the lora_up used for that particular training sample in the Asymmetric LoRA. The format for this is --lora_up <index>, where <index> is the current B matrices index in the Asymmetric LoRA, refers to the certain domain used in the training, and index should start from 1, not 0.

For example, a .caption file for a portrait painting sequence might look as follows:

3*3 of 9 sub-images, step-by-step portrait painting process, 1 girl --lora_up 6

Then, you should organize your dataset configuration file written in TOML. Here is an example:

[general]
enable_bucket = false

[[datasets]]
resolution = 1056
batch_size = 1

  [[datasets.subsets]]
  image_dir = '/path/to/dataset/'
  caption_extension = '.caption'
  num_repeats = 1

It is recommended to set batch size to 1 and set resolution to 1024 (4-frames) or 1056 (9-frames).

2.2 Start training

We have provided a template file for training Asymmetric LoRA in scripts/asylora_train.sh. Simply replace corresponding paths with yours to start the training. Note that lora_ups_num in the script is the total number of B matrices used in Asymmetric LoRA that you specified during training.

chmod +x scripts/asylora_train.sh
scripts/asylora_train.sh

Additionally, if you are directly using our dataset for training, the .caption files in our released dataset do not specify the --lora_up <index> field. You will need to organize and update the .caption files to include the appropriate --lora_up <index> values before starting the training.

3. Inference

We have also provided a template file for inference Asymmetric LoRA in scripts/asylora_inference.sh. Once the training is done, replace file paths, fill in your prompt and run inference. Note that lora_up_cur in the script is the current number of B matrices index to be used for inference.

chmod +x scripts/asylora_inference.sh
scripts/asylora_train.sh

Recraft Model

1. Weights

You can download the trained checkpoints of Recraft Model for inference. Below are the details of available models:

Model Description Resolution
recraft_9f_lego The Recraft Model has been trained on LEGO dataset. Support 9-frames generation. 1056,1056
recraft_9f_portrait The Recraft Model has been trained on Portrait dataset. Support 9-frames generation. 1056,1056
recraft_9f_sketch The Recraft Model has been trained on Sketch dataset. Support 9-frames generation. 1056,1056
recraft_4f_wood_sculpture The Recraft Model has been trained on Wood sculpture dataset. Support 4-frames generation. 1024,1024

2. Training

2.1 Obtain standard LoRA

During the second phase of training the image-to-sequence generation with the Recraft model, we need to apply a standard LoRA architecture to be merged to flux.1 before performing the Recraft training. Therefore, the first step is to decompose the Asymmetric LoRA into the original LoRA format.

To achieve this, train a standard LoRA directly (optional method below) or we have provided a script template in scripts/asylora_split.sh for splitting the Asymmetric LoRA. The script allows you to extract the required B matrices from the Asymmetric LoRA model. Specifically, the LORA_UP in the script specifies the index of the B matrices you wish to extract for use as the original LoRA.

chmod +x scripts/asylora_split.sh
scripts/asylora_split.sh

(Optional) Train standard LoRA

You can also directly train a standard LoRA for Recraft process, eliminating the need to decompose the Asymmetric LoRA. In our project, we have included the standard LoRA training code from kohya-ss/sd-scripts in the files flux_train_network.py for training and flux_minimal_inference.py for inference. You can refer to the related documentation for guidance on how to train.

Alternatively, using other training platforms like kijai/ComfyUI-FluxTrainer is also a viable option. These platforms provide tools to facilitate the training and inference of LoRA models for the Recraft process.

2.2 Merge LoRA to flux.1

Now you have obtained a standard LoRA, use our scripts/lora_merge.sh template script to merge the LoRA to flux.1 checkpoints for further recraft training. Note that the merged model may take up around 50GB of your memory space.

chmod +x scripts/lora_merge.sh
scripts/lora_merge.sh

2.3 Settings for training

The dataset structure for Recraft training follows the same organization format as the dataset for Asymmetric LoRA, specifically described in Asymmetric LoRA 2.1 Settings for dataset. A TOML configuration file is also required to organize and configure the dataset. Below is a template for the dataset configuration file:

[general]
flip_aug = false
color_aug = false
keep_tokens_separator = "|||"
shuffle_caption = false
caption_tag_dropout_rate = 0
caption_extension = ".caption"

[[datasets]]
batch_size = 1
enable_bucket = true
resolution = [1024, 1024]

[[datasets.subsets]]
image_dir = "/path/to/dataset/"
num_repeats = 1

Note that for training with 4-frame step sequences, the resolution must be set to 1024. For training with 9-frame steps, the resolution should be 1056.

For the sampling phase of the Recraft training process, we need to organize two text files: sample_images.txt and sample_prompts.txt. These files will store the sampled condition images and their corresponding prompts, respectively. Below are the templates for both files:

sample_images.txt

/path/to/image_1.png
/path/to/image_2.png

sample_prompts.txt

image_1_prompt_content
image_2_prompt_content

2.4 Recraft training

We have provided a template file for training Recraft Model in scripts/recraft_train.sh. Simply replace corresponding paths with yours to start the training. Note that frame_num in the script must be 4 (for 1024 resolution) or 9 (for 1056 resolution).

chmod +x scripts/recraft_train.sh
scripts/recraft_train.sh

3. Inference

We have also provided a template file for inference Recraft Model in scripts/recraft_inference.sh. Once the training is done, replace file paths, fill in your prompt and run inference.

chmod +x scripts/recraft_inference.sh
scripts/recraft_inference.sh

Datasets

We have uploaded our datasets on Hugging Face. The datasets includes both 4-frame and 9-frame sequence images, covering a total of 21 domains of procedural sequences. For MakeAnything training, each domain consists of 50 sequences, with resolutions of either 1024 (4-frame) or 1056 (9-frame). Additionally, we provide an extensive collection of SVG datasets and Sketch datasets for further research and experimentation.

Note that the arrangement of 9-frame sequences follows an S-shape pattern, whereas 4-frame sequences follow a ɔ-shape pattern.

Click to preview the datasets
Domain Preview Quantity Domain Preview Quantity
LEGO LEGO Preview 50 Cook Cook Preview 50
Painting Painting Preview 50 Icon Icon Preview 50+1.4k
Landscape Illustration Landscape Illustration Preview 50 Portrait Portrait Preview 50+2k
Transformer Transformer Preview 50 Sand Art Sand Art Preview 50
Illustration Illustration Preview 50 Sketch Sketch Preview 50+9k
Clay Toys Clay Toys Preview 50 Clay Sculpture Clay Sculpture Preview 50
ZBrush Modeling ZBrush Modeling Preview 50 Wood Sculpture Wood Sculpture Preview 50
Ink Painting Ink Painting Preview 50 Pencil Sketch Pencil Sketch Preview 50
Fabric Toys Fabric Toys Preview 50 Oil Painting Oil Painting Preview 50
Jade Carving Jade Carving Preview 50 Line Draw Line Draw Preview 50
Emoji Emoji Preview 50+12k

Results

Text-to-Sequence Generation (LoRA & Asymmetric LoRA)

Image-to-Sequence Generation (Recraft Model)

Generalization on Unseen Domains

Citation

@inproceedings{Song2025MakeAnythingHD,
  title={MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation},
  author={Yiren Song and Cheng Liu and Mike Zheng Shou},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:276107845}
}

About

Official code of "MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •