Compress TinyLLama model using synthetic data

This example demonstrates how to optimize Large Language Models (LLMs) using NNCF weight compression API & synthetic data for the advanced algorithms usage. The example applies 4/8-bit mixed-precision quantization & Scale Estimation algorithm to weights of Linear (Fully-connected) layers of TinyLlama/TinyLlama-1.1B-Chat-v1.0 model. To evaluate the accuracy of the compressed model we measure similarity between two texts generated by the baseline and compressed models using WhoWhatBench library.

The example includes the following steps:

Prepare wikitext dataset.
Prepare TinyLlama/TinyLlama-1.1B-Chat-v1.0 text-generation model in OpenVINO representation using Optimum-Intel.
Compress weights of the model with NNCF Weight compression algorithm with Scale Estimation & wikitext dataset.
Prepare synthetic dataset using nncf.data.generate_text_data method.
Compress weights of the model with NNCF Weight compression algorithm with Scale Estimation & synthetic dataset.
Measure the similarity of the two models optimized with different datasets.

Install requirements

To use this example:

Create a separate Python* environment and activate it: python3 -m venv nncf_env && source nncf_env/bin/activate
Install dependencies:

pip install -U pip
pip install -r requirements.txt
pip install ../../../../

Run Example

The example is fully automated. Just run the following command in the prepared Python environment:

python main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Compress TinyLLama model using synthetic data

Install requirements

Run Example

Files

README.md

Latest commit

History

README.md

File metadata and controls

Compress TinyLLama model using synthetic data

Install requirements

Run Example