Update GitHub Pages

zml · Dec 9, 2024 · 655b089 · 655b089
1 parent 07e511d
commit 655b089
Show file tree

Hide file tree

Showing 3 changed files with 2 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -18,3 +18,4 @@ Wed Sep 18 11:13:08 UTC 2024
 Fri Nov 15 13:46:13 UTC 2024
 Mon Dec  2 21:30:54 UTC 2024
 Fri Dec  6 14:34:48 UTC 2024
+Mon Dec  9 18:51:57 UTC 2024
diff --git a/learn/concepts/index.html b/learn/concepts/index.html
@@ -154,7 +154,7 @@ <h3 class="centered"></h3>
   });
   </script>
   <!-- <h1 :text="$page.title"></h1> -->
-  <div id="docs"><h1 id="zml-concepts">ZML Concepts</h1><h2 id="model-lifecycle">Model lifecycle</h2><p>ZML is an inference stack that helps running Machine Learning (ML) models, and particulary Neural Networks (NN).</p><p>The lifecycle of a model is implemented in the following steps:</p><ol><li><p>Open the model file and read the shapes of the weights, but leave the weights on the disk.</p></li><li><p>Using the loaded shapes and optional metadata, instantiate a model struct with <code>Tensor</code>s, representing the shape and layout of each layer of the NN.</p></li><li><p>Compile the model struct and it's <code>forward</code> function into an accelerator specific executable. The <code>forward</code> function describes the mathematical operations corresponding to the model inference.</p></li><li><p>Load the model weights from disk, onto the accelerator memory.</p></li><li><p>Bind the model weights to the executable.</p></li><li><p>Load some user inputs, and copy them to the accelerator.</p></li><li><p>Call the executable on the user inputs.</p></li><li><p>Fetch the returned model output from accelerator into host memory, and finally present it to the user.</p></li><li><p>When all user inputs have been processed, free the executable resources and the associated weights.</p></li></ol><p><strong>Some details:</strong></p><p>Note that the compilation and weight loading steps are both bottlenecks to your model startup time, but they can be done in parallel. <strong>ZML provides asynchronous primitives</strong> to make that easy.</p><p>The <strong>compilation can be cached</strong> across runs, and if you're always using the same model architecture with the same shapes, it's possible to by-pass it entirely.</p><p>The accelerator is typically a GPU, but can be another chip, or even the CPU itself, churning vector instructions.</p><h2 id="tensor-bros">Tensor Bros.</h2><p>In ZML, we leverage Zig's static type system to differentiate between a few concepts, hence we not only have a <code>Tensor</code> to work with, like other ML frameworks, but also <code>Buffer</code>, <code>HostBuffer</code>, and <code>Shape</code>.</p><p>Let's explain all that.</p><ul><li><p><code>Shape</code>: <em>describes</em> a multi-dimension array.</p><ul><li><code>Shape.init(.{16}, .f32)</code> represents a vector of 16 floats of 32 bits precision.</li><li><code>Shape.init(.{512, 1024}, .f16)</code> represents a matrix of <code>512*1024</code> floats of 16 bits precision, i.e. a <code>[512][1024]f16</code> array.</li></ul><p>A <code>Shape</code> is only <strong>metadata</strong>, it doesn't point to or own any memory. The <code>Shape</code> struct can also represent a regular number, aka a scalar: <code>Shape.init(.{}, .i32)</code> represents a 32-bit signed integer.</p></li><li><p><code>HostBuffer</code>: <em>is</em> a multi-dimensional array, whose memory is allocated <strong>on the CPU</strong>.</p><ul><li>points to the slice of memory containing the array</li><li>typically owns the underlying memory - but has a flag to remember when it doesn't.</li></ul></li><li><p><code>Buffer</code>: <em>is</em> a multi-dimension array, whose memory is allocated <strong>on an accelerator</strong>.</p><ul><li>contains a handle that the ZML runtime can use to convert it into a physical address, but there is no guarantee this address is visible from the CPU.</li><li>can be created by loading weights from disk directly to the device via <code>zml.aio.loadBuffers</code></li><li>can be created by calling <code>HostBuffer.toDevice(accelerator)</code>.</li></ul></li><li><p><code>Tensor</code>: is a mathematical object representing an intermediary result of a computation.</p><ul><li>is basically a <code>Shape</code> with an attached MLIR value representing the mathematical operation that produced this <code>Tensor</code>.</li></ul></li></ul><h2 id="the-model-struct">The model struct</h2><p>The model struct is the Zig code that describes your Neural Network (NN). Let's look a the following model architecture:</p><p><figure><img src="https://zml.ai/docs-assets/perceptron.png">
+  <div id="docs"><h1 id="zml-concepts">ZML Concepts</h1><h2 id="model-lifecycle">Model lifecycle</h2><p>ZML is an inference stack that helps running Machine Learning (ML) models, and particulary Neural Networks (NN).</p><p>The lifecycle of a model is implemented in the following steps:</p><ol><li><p>Open the model file and read the shapes of the weights, but leave the weights on the disk.</p></li><li><p>Using the loaded shapes and optional metadata, instantiate a model struct with <code>Tensor</code>s, representing the shape and layout of each layer of the NN.</p></li><li><p>Compile the model struct and it's <code>forward</code> function into an accelerator specific executable. The <code>forward</code> function describes the mathematical operations corresponding to the model inference.</p></li><li><p>Load the model weights from disk, onto the accelerator memory.</p></li><li><p>Bind the model weights to the executable.</p></li><li><p>Load some user inputs, and copy them to the accelerator.</p></li><li><p>Call the executable on the user inputs.</p></li><li><p>Fetch the returned model output from accelerator into host memory, and finally present it to the user.</p></li><li><p>When all user inputs have been processed, free the executable resources and the associated weights.</p></li></ol><p><strong>Some details:</strong></p><p>Note that the compilation and weight loading steps are both bottlenecks to your model startup time, but they can be done in parallel. <strong>ZML provides asynchronous primitives</strong> to make that easy.</p><p>The <strong>compilation can be cached</strong> across runs, and if you're always using the same model architecture with the same shapes, it's possible to by-pass it entirely.</p><p>The accelerator is typically a GPU, but can be another chip, or even the CPU itself, churning vector instructions.</p><h2 id="tensor-bros">Tensor Bros.</h2><p>In ZML, we leverage Zig's static type system to differentiate between a few concepts, hence we not only have a <code>Tensor</code> to work with, like other ML frameworks, but also <code>Buffer</code>, <code>HostBuffer</code>, and <code>Shape</code>.</p><p>Let's explain all that.</p><ul><li><p><code>Shape</code>: <em>describes</em> a multi-dimension array.</p><ul><li><code>Shape.init(.{16}, .f32)</code> represents a vector of 16 floats of 32 bits precision.</li><li><code>Shape.init(.{512, 1024}, .f16)</code> represents a matrix of <code>512*1024</code> floats of 16 bits precision, i.e. a <code>[512][1024]f16</code> array.</li></ul><p>A <code>Shape</code> is only <strong>metadata</strong>, it doesn't point to or own any memory. The <code>Shape</code> struct can also represent a regular number, aka a scalar: <code>Shape.init(.{}, .i32)</code> represents a 32-bit signed integer.</p></li><li><p><code>HostBuffer</code>: <em>is</em> a multi-dimensional array, whose memory is allocated <strong>on the CPU</strong>.</p><ul><li>points to the slice of memory containing the array</li><li>typically owns the underlying memory - but has a flag to remember when it doesn't.</li></ul></li><li><p><code>Buffer</code>: <em>is</em> a multi-dimension array, whose memory is allocated <strong>on an accelerator</strong>.</p><ul><li>contains a handle that the ZML runtime can use to convert it into a physical address, but there is no guarantee this address is visible from the CPU.</li><li>can be created by loading weights from disk directly to the device via <code>zml.aio.loadBuffers</code></li><li>can be created by calling <code>HostBuffer.toDevice(accelerator)</code>.</li></ul></li><li><p><code>Tensor</code>: is a mathematical object representing an intermediary result of a computation.</p><ul><li>is basically a <code>Shape</code> with an attached MLIR value representing the mathematical operation that produced this <code>Tensor</code>.</li></ul></li></ul><h2 id="the-model-struct">The model struct</h2><p>The model struct is the Zig code that describes your Neural Network (NN). Let's look a the following model architecture:</p><p><figure><img src="https://raw.githubusercontent.com/zml/zml.github.io/refs/heads/main/docs-assets/perceptron.png">
 <figcaption>Multilayer perceptrons</figcaption></figure></p><p>This is how we can describe it in a Zig struct:</p><pre><code class="zig"><span class="type qualifier">const</span> <span class="variable">Model</span> = <span class="keyword">struct</span> <span class="punctuation bracket">{</span>
     <span class="field">input_layer</span><span class="punctuation delimiter">:</span> <span class="variable">zml</span><span class="punctuation delimiter">.</span><span class="field">Tensor</span><span class="punctuation delimiter">,</span>
     <span class="field">output_layer</span><span class="punctuation delimiter">:</span> <span class="variable">zml</span><span class="punctuation delimiter">.</span><span class="field">Tensor</span><span class="punctuation delimiter">,</span>

diff --git a/sources.tar b/sources.tar