diff --git a/README.md b/README.md index ce4a97c..4316b7e 100644 --- a/README.md +++ b/README.md @@ -19,3 +19,4 @@ Fri Nov 15 13:46:13 UTC 2024 Mon Dec 2 21:30:54 UTC 2024 Fri Dec 6 14:34:48 UTC 2024 Mon Dec 9 18:51:57 UTC 2024 +Tue Jan 28 10:21:33 UTC 2025 diff --git a/howtos/howto_torch2zml/index.html b/howtos/howto_torch2zml/index.html index 1a2cb06..a6f4370 100644 --- a/howtos/howto_torch2zml/index.html +++ b/howtos/howto_torch2zml/index.html @@ -284,7 +284,7 @@

Loading an individual layer

In the a var ctx = try zml.Context.init(); defer ctx.deinit(); - const platform = ctx.autoPlatform(); + const platform = ctx.autoPlatform(.{}); const mlp_weights = try zml.aio.loadModelBuffers(Mlp, mlp_shape, model_weights, allocator, platform); zml.testing.testLayer(platform, activations, "model.layers.0.mlp", mlp_shape, mlp_weights, 1e-3); diff --git a/sources.tar b/sources.tar index 4f48cf6..b54ed33 100755 Binary files a/sources.tar and b/sources.tar differ diff --git a/tutorials/getting_started/index.html b/tutorials/getting_started/index.html index 8530a97..8aefed3 100644 --- a/tutorials/getting_started/index.html +++ b/tutorials/getting_started/index.html @@ -157,26 +157,23 @@

Getting Started with ZML

In this tutorial, we will install ZML and run a few models locally.

Prerequisites

First, let's checkout the ZML codebase. In a terminal, run:

git clone https://github.com/zml/zml.git
 cd zml/
 

We use bazel to build ZML and its dependencies. We recommend to download it through bazelisk, a version manager for bazel.

Install Bazel:

macOs:

    brew install bazelisk
-

Linux:

    curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64'
+

Linux:

    curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.25.0/bazelisk-linux-amd64'
     chmod +x /usr/local/bin/bazel
 

Run a pre-packaged model

ZML comes with a variety of model examples. See also our reference implementations in the examples folder.

MNIST

The classic handwritten digits recognition task. The model is tasked to recognize a handwritten digit, which has been converted to a 28x28 pixel monochrome image. Bazel will download a pre-trained model, and the test dataset. The program will load the model, compile it, and classify a randomly picked example from the test dataset.

On the command line:

cd examples
 bazel run -c opt //mnist
-

Llama

Llama is a family of "Large Language Models", trained to generate text, based on the beginning of a sentence/book/article. This "beginning" is generally referred to as the "prompt".

TinyLlama, Stories 15M

To start, you can use a small model trained specifically on children's history books. This model has been trained by Andrej Karpathy; you can read more about it on his Github.

cd examples
-bazel run -c opt //llama:TinyLlama-Stories-15M
-bazel run -c opt //llama:TinyLlama-Stories-15M -- --prompt="Once upon a time, there was a cute little dragon"
-

OpenLLama 3B

cd examples
-bazel run -c opt //llama:OpenLLaMA-3B
-bazel run -c opt //llama:OpenLLaMA-3B -- --prompt="Once upon a time,"
-

Meta Llama 3 8B

This model has restrictions, see here: it requires approval from Meta on Huggingface, which can take a few hours to get granted.

While waiting for approval, you can already generate your Huggingface access token.

Once you've been granted access, you're ready to download a gated model like Meta-Llama-3-8b!

# requires token in $HOME/.cache/huggingface/token, as created by the
+

Llama

Llama is a family of "Large Language Models", trained to generate text, based on the beginning of a sentence/book/article. This "beginning" is generally referred to as the "prompt".

Meta Llama 3.1 8B

This model has restrictions, see here. It requires approval from Meta on Huggingface, which can take a few hours to get granted.

While waiting for approval, you can already generate your Huggingface access token.

Once you've been granted access, you're ready to download a gated model like Meta-Llama-3.1-8B-Instruct!

# requires token in $HOME/.cache/huggingface/token, as created by the
 # `huggingface-cli login` command, or the `HUGGINGFACE_TOKEN` environment variable.
 cd examples
-bazel run -c opt //llama:Meta-Llama-3-8b
-bazel run -c opt //llama:Meta-Llama-3-8b -- --promt="Once upon a time,"
-

Run Tests

bazel test //zml:test
+bazel run -c opt //llama:Llama-3.1-8B-Instruct
+bazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="What is the capital of France?"
+

You can also try Llama-3.1-70B-Instruct if you have enough memory.

Meta Llama 3.2 1B

Like the 8B model above, this model also requires approval. See here for access requirements.

cd examples
+bazel run -c opt //llama:Llama-3.2-1B-Instruct
+bazel run -c opt //llama:Llama-3.2-1B-Instruct -- --prompt="What is the capital of France?"
+

For a larger 3.2 model, you can also try Llama-3.2-3B-Instruct.

Run Tests

bazel test //zml:test
 

Running Models on GPU / TPU

You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:

The latter, avoiding compilation for CPU, cuts down compilation time.

So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:

cd examples
-bazel run -c opt //llama:OpenLLaMA-3B             \
-          --@zml//runtimes:cuda=true              \
-          -- --prompt="Once upon a time,"
+bazel run -c opt //llama:Llama-3.2-1B-Instruct            \
+          --@zml//runtimes:cuda=true                      \
+          -- --prompt="What is the capital of France?"
 

Where to go next:

In Deploying Models on a Server, we show how you can cross-compile and package for a specific architecture, then deploy and run your model. Alternatively, you can also dockerize your model.

You might also want to check out the examples, read through the documentation, start writing your first model, or read about more high-level ZML concepts.

diff --git a/tutorials/write_first_model/index.html b/tutorials/write_first_model/index.html index 6bd3eab..fe1c28e 100644 --- a/tutorials/write_first_model/index.html +++ b/tutorials/write_first_model/index.html @@ -183,7 +183,7 @@

You see, in ZML AI models are just structs with a forward function!

There are more things to observe:

Adding a main() function

ZML code is async. Hence, We need to provide an async main function. It works like this:

pub fn main() !void {
     var gpa = std.heap.GeneralPurposeAllocator(.{}){};
     defer _ = gpa.deinit();
-    try asynk.AsyncThread.main(gpa.allocator(), asyncMain, .{});
+    try asynk.AsyncThread.main(gpa.allocator(), asyncMain);
 }
 
 
@@ -205,7 +205,7 @@ 

var context = try zml.Context.init(); defer context.deinit(); - const platform = context.autoPlatform(); + const platform = context.autoPlatform(.{}); ... }
@@ -221,7 +221,7 @@

try buffers.put(arena, "bias", zml.HostBuffer.fromArray(&bias)); // the actual BufferStore -var bs: zml.aio.BufferStore = .{ +const bs: zml.aio.BufferStore = .{ .arena = arena_state, .buffers = buffers, }; @@ -234,7 +234,7 @@

// The shape of the input tensor, we have to pass in manually. var compilation = try asyncc( zml.compileModel, - .{ allocator, model_shapes, .forward, .{input_shape}, platform }, + .{ allocator, Layer.forward, model_shapes, .{input_shape}, platform }, ); // Produce a bufferized weights struct from the fake BufferStore. @@ -330,7 +330,7 @@

Running it

With everything in place now, running the pub fn main() !void { var gpa = std.heap.GeneralPurposeAllocator(.{}){}; defer _ = gpa.deinit(); - try asynk.AsyncThread.main(gpa.allocator(), asyncMain, .{}); + try asynk.AsyncThread.main(gpa.allocator(), asyncMain); } pub fn asyncMain() !void { @@ -348,7 +348,7 @@

Running it

With everything in place now, running the var context = try zml.Context.init(); defer context.deinit(); - const platform = context.autoPlatform(); + const platform = context.autoPlatform(.{}); // Our weights and bias to use var weights = [3]f16{ 2.0, 2.0, 2.0 }; @@ -373,10 +373,7 @@

Running it

With everything in place now, running the // Start compiling. This uses the inferred shapes from the BufferStore. // The shape of the input tensor, we have to pass in manually. - var compilation = try asyncc( - zml.compileModel, - .{ allocator, model_shapes, .forward, .{input_shape}, platform }, - ); + var compilation = try asyncc(zml.compileModel, .{ allocator, Layer.forward, model_shapes, .{input_shape}, platform }); // Produce a bufferized weights struct from the fake BufferStore. // This is like the inferred shapes, but with actual values.