ZML is a production inference stack, purpose-built to decouple AI workloads from proprietary hardware.
Any model, many hardwares, one codebase, peak performance.
Compiled directly to NVIDIA, AMD, TPU, Trainium for peak hardware performance on any accelerator. No rewriting.
It is built using the Zig language, MLIR, and Bazel.
We use bazel to build ZML and its dependencies. The only prerequisite is
bazel, which we recommend to download through bazelisk, a version manager
for bazel.
Install Bazel (recommended):
brew install bazelisk
curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.28.0/bazelisk-linux-amd64'
chmod +x /usr/local/bin/bazel
We have implemented a variety of example models in ZML. See our reference implementations in the examples folder.
The classic handwritten digits
recognition task. The model is tasked to recognize a handwritten digit, which
has been converted to a 28x28 pixel monochrome image. Bazel will download a
pre-trained model, and the test dataset. The program will load the model,
compile it, and classify a randomly picked example from the test dataset.
On the command line:
bazel run //examples/mnist
This model has restrictions, see here. It requires approval from Meta on Hugging Face, which can take a few hours to get granted.
Ensure you are authenticated with the Hugging Face CLI:
hf auth login
Alternatively, set the HF_TOKEN environment variable.
Now, you can run the model like so:
bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.2-1B-Instruct --prompt="What is the capital of France?"
For a larger 3.2 model, you can also try Llama-3.2-3B-Instruct.
Like the 1B model above, this model also requires approval. See here for access requirements.
bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.1-8B-Instruct --prompt="What is the capital of France?"
You can also try Llama-3.1-70B-Instruct if you have enough memory.
You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:
- NVIDIA CUDA:
--@zml//platforms:cuda=true - AMD RoCM:
--@zml//platforms:rocm=true - Google TPU:
--@zml//platforms:tpu=true - AWS Trainium/Inferentia 2:
--@zml//platforms:neuron=true - AVOID CPU:
--@zml//platforms:cpu=false
The latter, avoiding compilation for CPU, cuts down compilation time.
So, to run the Llama 3.1 8B model from above on your host supporting an NVIDIA GPU, run the following:
bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
And on your host supporting an AMD GPU:
bazel run //examples/llm --@zml//platforms:rocm=true -- --model=hf://meta-llama/Llama-3.1-8B-Instruct
Same goes for all supported platforms.
bazel test //zml:test
const Mnist = struct {
fc1: Layer,
fc2: Layer,
const Layer = struct {
weight: zml.Tensor,
bias: zml.Tensor,
pub fn init(store: zml.io.TensorStore.View) Layer {
return .{
.weight = store.createTensor("weight", .{ .d_out, .d }, null),
.bias = store.createTensor("bias", .{.d_out}, null),
};
}
pub fn forward(self: Layer, input: zml.Tensor) zml.Tensor {
return self.weight.dot(input, .d).add(self.bias).relu().withTags(.{.d});
}
};
pub fn init(store: zml.io.TensorStore.View) Mnist {
return .{
.fc1 = .init(store.withPrefix("fc1")),
.fc2 = .init(store.withPrefix("fc2")),
};
}
pub fn load(
self: *const Mnist,
allocator: std.mem.Allocator,
io: std.Io,
platform: *const zml.Platform,
store: *const zml.io.TensorStore,
shardings: []const zml.sharding.Sharding,
) !zml.Bufferized(Mnist) {
return zml.io.load(Mnist, self, allocator, io, platform, store, .{
.shardings = shardings,
.parallelism = 1,
.dma_chunks = 1,
.dma_chunk_size = 16 * 1024 * 1024,
});
}
pub fn unloadBuffers(self: *zml.Bufferized(Mnist)) void {
self.fc1.weight.deinit();
self.fc1.bias.deinit();
self.fc2.weight.deinit();
self.fc2.bias.deinit();
}
/// just two linear layers + relu activation
pub fn forward(self: Mnist, input: zml.Tensor) zml.Tensor {
var x = input.flatten().convert(.f32).withTags(.{.d});
const layers: []const Layer = &.{ self.fc1, self.fc2 };
for (layers) |layer| {
x = layer.forward(x);
}
return x.argMax(0).indices.convert(.u8);
}
};You might want to check out more examples, read through the documentation directly on GitHub.
See here.
ZML is licensed under the Apache 2.0 license.