llama2.rs 🤗

This is a Rust implementation of Llama2 inference on CPU

The goal is to be as fast as possible.

It has the following features:

Support for 4-bit GPT-Q Quantization
Batched prefill of prompt tokens
SIMD support for fast CPU inference
Memory mapping, loads 70B instantly.
Static size checks for safety
Support for Grouped Query Attention (needed for big Llamas)
Python calling API

Can run up on 1 tok/s 70B Llama2 and 9 tok/s 7B Llama2. (on my intel i9 desktop)

To build, you'll need the nightly toolchain, which is used by default:

> rustup toolchain install nightly # to get nightly
> ulimit -s 10000000 # Increase your stack memory limit.

You can load models from the Hugging Face hub. For example this creates a version of a 70B quantized) model with 4 bit quant and 64 sized groups:

> pip install -r requirements.export.txt
> python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True

The library needs to be recompiled to match the model. You can do this with cargo.

To run:

> cargo run --release --features 70B,group_64,quantized -- -c llama2-70b-q.bin -t 0.0 -s 11 -p "The only thing"                                                                                                                                 
The only thing that I can think of is that the          
achieved tok/s: 0.89155835

Honestly, not so bad for running on my GPU machine, significantly faster than llama.c.

Here's a run of 13B quantized:

> cargo run --release --features 13B,group_128,quantized -- -c l13orca.act.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
Hello to all the cool people out there who are reading this. I hope you are having a great day. I am here
achieved tok/s: 5.1588936

Here's a run of 7B quantized:

cargo run --release --features 7B,group_128,quantized -- -c l7.ack.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
> Hello to all the cool people out there who are reading this. I am a newbie here and I am looking for some
achieved tok/s: 9.048136

Python

To run in Python, you need to first compile from the main directory with the python flag.

cargo build --release --features 7B,group_128,quantized,python
pip install .

You can then run the following code.

import llama2_rs

def test_llama2_13b_4_128act_can_generate():
    model = llama2_rs.LlamaModel("lorca13b.act132.bin", False)
    tokenizer = llama2_rs.Tokenizer("tokenizer.bin")
    random = llama2_rs.Random()
    response = llama2_rs.generate(
        model,
        tokenizer,
        "Tell me zero-cost abstractions in Rust ",
        50,
        random, 
        0.0
    )

Todos

Support fast GPU processing with Triton
Support https://github.com/oobabooga/text-generation-webui
Documentation
Blog Post about the methods for fast gptq
Remove dependency on AutoGPTQ for preloading
Support for safetensors directly.

Configuration

In order to make the model as fast as possible, you need to compile a new version to adapt to other Llama versions. Currently in .cargo/config. The model will fail if these disagree with the binary model that is being loaded. To turn quantization off set it to quant="no".

How does it work?

Started as a port of the original code, with extra type information to make it easier to extend.

There are some dependencies:

memmap2for memory mapping
rayon for parallel computation.
clap for command-line args.
pyO3 for python calling
SIMD enabled support with portable_simd

Authors

Llama2.rs is written by @srush and @rachtsingh.

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.github/workflows		.github/workflows
.vscode		.vscode
python/llama2_rs		python/llama2_rs
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
export.py		export.py
pyproject.toml		pyproject.toml
requirements.export.txt		requirements.export.txt
rust-toolchain.toml		rust-toolchain.toml
tokenizer.bin		tokenizer.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama2.rs 🤗

Python

Todos

Configuration

See Also

How does it work?

Authors

About

Releases

Packages

Languages

License

danieledagnelli/llama2.rs

Folders and files

Latest commit

History

Repository files navigation

llama2.rs 🤗

Python

Todos

Configuration

See Also

How does it work?

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages