LLM_Sizing_Guide

A calculator to estimate the memory footprint, capacity, and latency based on your planned LLM application's requirements on different GPU architectures..

Usage

Prerequisite: pip install -r requirements.txt

Here are the Flags and their abbreviations for the script.

num_gpu ('-g'): Specify the number of GPUs you plan to use for your deployment.
prompt_sz ('-p'): Define the average size of the input prompts you expect to process.
response_sz ('-r'): Set the average size of the responses you expect to generate.
n_concurrent_req ('-c'): Indicate the number of concurrent requests you anticipate handling.

By modifying these variables, you can easily estimate the performance characteristics of your LLM deployment and make informed decisions about your infrastructure requirements.

Sample output

✗ python LLM_size_pef_calculator.py -g 4 -p 4096 -r 256 -c 10
 num_gpu = 4, prompt_size = 4096 tokens, response_size = 256 tokens
 n_concurrent_request = 10

******************** Estimate LLM Memory Footprint ********************
| Model         | KV Cache Size per Token   | Memory Footprint   |
|---------------+---------------------------+--------------------|
| Llama-3.1-70B | 0.002441 GiB/token        | 246.25 GB          |

******************** Estimate LLM Capacity and Latency ********************
| Model         | GPU       |   Max # KV Cache Tokens | Prefill Time   | TPOT (ms)   | TTFT    | E2E Latency   | Output Tokens Throughput   |
|---------------+-----------+-------------------------+----------------+-------------+---------+---------------+----------------------------|
| Llama-3.1-70B | H100 PCIe |                   73728 | 0.046 ms       | 17.500 ms   | 0.064 s | 4.7 s         | 54.82 tokens/sec           |
| Llama-3.1-70B | H100 SXM  |                   73728 | 0.035 ms       | 10.448 ms   | 0.046 s | 2.8 s         | 90.80 tokens/sec           |
| Llama-3.1-70B | H100 NVL  |                   96665 | 0.042 ms       | 8.974 ms    | 0.051 s | 2.5 s         | 103.68 tokens/sec          |
| Llama-3.1-70B | H200 SXM  |                  173670 | 0.035 ms       | 7.292 ms    | 0.043 s | 2.0 s         | 127.27 tokens/sec          |
| Llama-3.1-70B | H200 NVL  |                  173670 | 0.042 ms       | 7.292 ms    | 0.049 s | 2.0 s         | 125.60 tokens/sec          |

Name	Name	Last commit message	Last commit date
Latest commit qoofyk Merge pull request #1 from joinalahmed/feature/add-qwen-model-and-csv… Mar 3, 2025 1dc8716 · Mar 3, 2025 History 20 Commits
.gitignore	.gitignore	feat: add Qwen-14B model and CSV export functionality	Mar 3, 2025
LLM_size_pef_calculator.py	LLM_size_pef_calculator.py	added qwen, and support for csv export	Mar 3, 2025
README.md	README.md	move the warning under memory footprint, improve warning msg	Dec 17, 2024
requirements.txt	requirements.txt	add requirements and update table for memory footprint	Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM_Sizing_Guide

Usage

Sample output

About

Releases

Packages

Languages

qoofyk/LLM_Sizing_Guide

Folders and files

Latest commit

History

Repository files navigation

LLM_Sizing_Guide

Usage

Sample output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages