Add model splitting feature doc (#1493)

## Describe your changes ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link
microsoft · Nov 16, 2024 · 40dc359 · 40dc359
1 parent a1e9ebc
commit 40dc359
Show file tree

Hide file tree

Showing 4 changed files with 59 additions and 0 deletions.
diff --git a/docs/source/features/model_splitting.md b/docs/source/features/model_splitting.md
@@ -0,0 +1,58 @@
+# Model Splitting
+With the advent of Small Language Models (SLM) such Microsoft's Phi family of models, it has become possible to deploy powerful language models on edge devices. However, these models are still a few gigabytes in size even after model optimization and compressions techniques like quantization. These sizes might be too large to load in a single session on the edge device, whether it is due to memory or runtime limitations.
+
+Therefore, we need to split the model into multiple components and run inference on them in a cascade. This raises several questions: How to split the model? How many splits to make? and where to make the splits? In existing implementations that we have seen, users load the model graph and take note of the connections between different sections of the model. These connections are then used to modify the model graph and create the split graphs. However, this requires an understanding of the model architecture in the exported graph and is not a scalable approach.
+
+## Approach
+Olive automates this process by using the rich model structure available in the PyTorch model to make the split decisions to produce optimized ONNX model splits.
+
+Olive provides multiple ways to guide split decisions:
+1. For transformers like models, if the user already knows the number of splits to make, Olive can split transformer layers into equal splits. Such a split decision is made at a higher level of the model architecture.
+2. The user can provide a cost model (a CSV containing the cost per module in the model in terms of memory, FLOPs and parameter counts).  Olive currently uses the memory requirements of each layer to determine split assignments. However, we intend to improve splitting algorithm considering layers' arithmetic intensity.
+
+## CLI
+Olive provides command line tools that make it easy for the user to optimize and split models. Olive provides utility to generate cost model for LLMs from HuggingFace hub. Olive also includes pre-generated cost models for popular models.
+
+### auto-opt
+Olive provides `auto-opt` command to convert, optimize and quantize the ONNX model. t now also provides options to split the model.
+
+**`num-splits`**
+
+Let's split the model using a user defined number of splits.
+
+```bash
+olive auto-opt -m microsoft/Phi-3.5-mini-instruct --precision fp16 --provider CUDAExecutionProvider --num-splits 2 -o models/phi-nsplit
+```
+
+Olive uses the `model_type` for the HuggingFace model and divides the transformers layers equally among the splits. `microsoft/Phi-3.5-mini-instruct` has 32 such layers so each split gets assigned 16 layers each in this example. The first layer also includes the embedding layer and attention subgraphs while the final layer includes the language modeling head.
+
+The following shows the final split models:
+
+<img align="center" alt="num_splits" width="500px" src="../images/model_splitting/num_splits.png"><br>
+
+
+**`cost-model`**
+
+Let's now split the model using a cost model. Please refer to the [pre-generated cost models](https://github.com/microsoft/Olive/blob/main/assets/cost_models/Phi-3.5-mini.csv) in the Olive repository for an example a cost model csv.
+
+```bash
+olive auto-opt -m microsoft/Phi-3.5-mini-instruct --precision fp16 --provider CUDAExecutionProvider --memory 2GB --cost-model phi-3.5-cost.csv -o models/phi-costsplit
+```
+
+Olive uses the memory specs of the device and the cost model to automatically choose the required number of splits and make split assignments for each module.
+
+The following shows the final split models:
+
+<img align="center" alt="cost_model" width="400px" src="../images/model_splitting/cost_model.png"><br>
+
+In this example, Olive split the model into four components each with size less than the max target memory of 2GB specified by the user.
+
+### generate-cost-model
+This tool can generate cost model for HuggingFace transformers models.
+
+```bash
+olive generate-cost-model -m microsoft/Phi-3.5-mini-instruct -p fp16 -o phi-3.5-cost.csv
+```
+
+## Conclusion
+In this blog post, we introduced how one can use Olive to split models.
diff --git a/docs/source/images/model_splitting/cost_model.png b/docs/source/images/model_splitting/cost_model.png
diff --git a/docs/source/images/model_splitting/num_splits.png b/docs/source/images/model_splitting/num_splits.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -49,6 +49,7 @@ This document introduces Olive and provides some examples to get you started.
    features/conversion
    features/quantization
    features/model_transformations_and_optimizations
+   features/model_splitting
 
 .. toctree::
    :maxdepth: 1