A lightweight CUDA inference engine for Qwen3 models, designed for consumer multi-GPU setups (for example, dual RTX 3080 Ti).
As of March 12, 2026, Ember is focused on a narrow goal: high-performance Qwen inference on consumer NVIDIA GPUs.
Current:
- Stable Qwen3 dense CUDA inference
- Native safetensors loading
- Multi-GPU pipeline-parallel runtime
- Minimal CLI and server path
Next:
- Qwen3.5 hybrid architecture support
- DeltaNet + Gated Attention runtime
- HF-aligned correctness and regression harness
Then:
- Qwen3.5 35B-A3B MoE inference
- Dual-GPU + CPU offload execution path
- Reproducible benchmarks and OpenAI-compatible serving
- Build (example: RTX 3080 Ti,
SM=86):
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallelBy default, Ember builds only the core inference target (ember).
To also build optional tests/benchmarks:
cmake -S . -B build \
-DCMAKE_CUDA_ARCHITECTURES=86 \
-DEMBER_BUILD_TESTS=ON \
-DEMBER_BUILD_BENCHMARKS=ON- Download a model (safetensors format; first run may require
huggingface-cli login):
# If huggingface-cli is missing: pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./qwen3-0.6b- Run:
./build/ember -m ./qwen3-0.6b -p "Hello, my name is"English (default):
- Contributing
- Development Guide
- Testing and Regression
- Architecture Overview
- Sampler Deep Dive
- Benchmark Handbook
Chinese:
- Native CUDA implementation: full control over compute flow, without ggml/llama.cpp.
- Direct safetensors loading: loads HuggingFace format natively, no conversion step.
- Pipeline parallelism: multi-GPU layer split with memory-aware allocation.
- FP16 compute: FP16 weights/activations, GEMM via cuBLAS.
- Custom kernels: RMSNorm, RoPE, Softmax, SiLU, Attention, and more.
ember/
|-- apps/ember_cli/ # CLI entry point (main.cpp)
|-- cli/ # CLI argument parsing
|-- core/ # Core abstractions (pure C++, no CUDA dependency)
|-- runtime/ # Scheduling and device mapping runtime logic
|-- formats/ # safetensors/config loaders
|-- backends/cuda/ # CUDA runtime + kernels
|-- benchmarks/ # Performance and interconnect benchmarks
|-- tests/ # Unit tests and smoke tests
|-- scripts/ # CI and alignment scripts
`-- docs/ # Current design and testing docs
- CMake 3.18+
- CUDA Toolkit 11.0+ (12.x recommended)
- C++17 compiler
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release # RTX 3080 Ti
cmake --build build --parallelCommon CUDA architecture values:
86- RTX 3080/309089- RTX 409080- A100
# Single-GPU inference
./build/ember -m /path/to/qwen3-4b -p "Hello, my name is"
# Dual-GPU inference
./build/ember -m /path/to/qwen3-14b --devices 0,1 -p "Explain quantum computing"
# Interactive mode
./build/ember -m /path/to/qwen3-4b -i| Argument | Description | Default |
|---|---|---|
-m, --model |
Model directory (must contain safetensors and config.json) |
Required |
-p, --prompt |
Input prompt | "Hello, my name is" |
--devices |
GPU device list | 0 |
-c, --ctx-size |
Context length | 2048 |
-n, --n-predict |
Number of generated tokens | 128 |
--temp |
Temperature | 0.7 |
--top-p |
Top-P | 0.9 |
--top-k |
Top-K | 40 |
-i, --interactive |
Interactive mode | false |
-v, --verbose |
Verbose output | false |
- Qwen3-0.6B
- Qwen3-1.7B
- Qwen3-4B
- Qwen3-8B
- Qwen3-14B (typically needs dual GPUs)
Models should be downloaded from HuggingFace in safetensors format:
# If huggingface-cli is missing: pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-4B --local-dir ./qwen3-4b- Tensor: lightweight view (
shape + dtype + data ptr), no ownership. - Session: inference session state including KV cache.
- IRuntime: backend interface (CUDA now, extensible in future).
- DeviceMap: layer-to-device mapping for pipeline parallelism.
Input IDs
|
v
Embedding Lookup (GPU 0)
|
v
+-----------------------------+
| Layer 0-N (may span GPUs) |
| |- Input LayerNorm |
| |- QKV Projection |
| |- RoPE |
| |- KV Cache Update |
| |- Attention (Q@K^T -> V) |
| |- O Projection |
| |- Residual Add |
| |- Post-Attn LayerNorm |
| |- MLP (SwiGLU) |
| `- Residual Add |
+-----------------------------+
|
v
Final LayerNorm
|
v
LM Head -> Logits
|
v
Sampling -> Next Token
Ember uses layer-wise pipeline parallelism. Example split:
GPU 0: Embedding + Layers 0-13
GPU 1: Layers 14-27 + LM Head
Hidden states are transferred with cudaMemcpyPeer.
Actual throughput depends on hardware and model size.
| Model | Approx. VRAM | Expected Speed |
|---|---|---|
| Qwen3-4B (FP16) | ~8 GB | ~40 tok/s |
| Qwen3-8B (FP16) | ~16 GB | ~25 tok/s |
| Qwen3-14B (FP16) | ~28 GB | ~15 tok/s (dual GPU) |
- M0: Single-GPU inference baseline
- M1: Dual-GPU pipeline parallelism
- M2: Quantization support (INT8/INT4)
- M3: FlashAttention optimization
If you use Ember in a paper or report, cite the Zenodo archive:
Apache-2.0
- llama.cpp - architecture inspiration
- HuggingFace Transformers - model format ecosystem
- Qwen - model weights