Skip to content

Latest commit

 

History

History
34 lines (24 loc) · 1.72 KB

File metadata and controls

34 lines (24 loc) · 1.72 KB

Compress TinyLLama model using synthetic data

This example demonstrates how to optimize Large Language Models (LLMs) using NNCF weight compression API & synthetic data for the advanced algorithms usage. The example applies 4/8-bit mixed-precision quantization & Scale Estimation algorithm to weights of Linear (Fully-connected) layers of TinyLlama/TinyLlama-1.1B-Chat-v1.0 model. To evaluate the accuracy of the compressed model we measure similarity between two texts generated by the baseline and compressed models using WhoWhatBench library.

The example includes the following steps:

  • Prepare wikitext dataset.
  • Prepare TinyLlama/TinyLlama-1.1B-Chat-v1.0 text-generation model in OpenVINO representation using Optimum-Intel.
  • Compress weights of the model with NNCF Weight compression algorithm with Scale Estimation & wikitext dataset.
  • Prepare synthetic dataset using nncf.data.generate_text_data method.
  • Compress weights of the model with NNCF Weight compression algorithm with Scale Estimation & synthetic dataset.
  • Measure the similarity of the two models optimized with different datasets.

Install requirements

To use this example:

  • Create a separate Python* environment and activate it: python3 -m venv nncf_env && source nncf_env/bin/activate
  • Install dependencies:
pip install -U pip
pip install -r requirements.txt
pip install ../../../../

Run Example

The example is fully automated. Just run the following command in the prepared Python environment:

python main.py