Ember - Qwen3 CUDA Inference Engine

A lightweight CUDA inference engine for Qwen3 models, designed for consumer multi-GPU setups (for example, dual RTX 3080 Ti).

Roadmap

As of March 12, 2026, Ember is focused on a narrow goal: high-performance Qwen inference on consumer NVIDIA GPUs.

Current:

Stable Qwen3 dense CUDA inference
Native safetensors loading
Multi-GPU pipeline-parallel runtime
Minimal CLI and server path

Qwen3.5 hybrid architecture support
DeltaNet + Gated Attention runtime
HF-aligned correctness and regression harness

Then:

Qwen3.5 35B-A3B MoE inference
Dual-GPU + CPU offload execution path
Reproducible benchmarks and OpenAI-compatible serving

Quick Start

Build (example: RTX 3080 Ti, SM=86):

cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel

By default, Ember builds only the core inference target (ember). To also build optional tests/benchmarks:

cmake -S . -B build \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DEMBER_BUILD_TESTS=ON \
  -DEMBER_BUILD_BENCHMARKS=ON

Download a model (safetensors format; first run may require huggingface-cli login):

# If huggingface-cli is missing: pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./qwen3-0.6b

Run:

./build/ember -m ./qwen3-0.6b -p "Hello, my name is"

Documentation

English (default):

Chinese:

Features

Native CUDA implementation: full control over compute flow, without ggml/llama.cpp.
Direct safetensors loading: loads HuggingFace format natively, no conversion step.
Pipeline parallelism: multi-GPU layer split with memory-aware allocation.
FP16 compute: FP16 weights/activations, GEMM via cuBLAS.
Custom kernels: RMSNorm, RoPE, Softmax, SiLU, Attention, and more.

Project Structure

ember/
|-- apps/ember_cli/         # CLI entry point (main.cpp)
|-- cli/                    # CLI argument parsing
|-- core/                   # Core abstractions (pure C++, no CUDA dependency)
|-- runtime/                # Scheduling and device mapping runtime logic
|-- formats/                # safetensors/config loaders
|-- backends/cuda/          # CUDA runtime + kernels
|-- benchmarks/             # Performance and interconnect benchmarks
|-- tests/                  # Unit tests and smoke tests
|-- scripts/                # CI and alignment scripts
`-- docs/                   # Current design and testing docs

Build

Requirements

CMake 3.18+
CUDA Toolkit 11.0+ (12.x recommended)
C++17 compiler

Compile

cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release  # RTX 3080 Ti
cmake --build build --parallel

Common CUDA architecture values:

86 - RTX 3080/3090
89 - RTX 4090
80 - A100

Usage

Basic Examples

# Single-GPU inference
./build/ember -m /path/to/qwen3-4b -p "Hello, my name is"

# Dual-GPU inference
./build/ember -m /path/to/qwen3-14b --devices 0,1 -p "Explain quantum computing"

# Interactive mode
./build/ember -m /path/to/qwen3-4b -i

CLI Arguments

Argument	Description	Default
`-m, --model`	Model directory (must contain safetensors and `config.json`)	Required
`-p, --prompt`	Input prompt	`"Hello, my name is"`
`--devices`	GPU device list	`0`
`-c, --ctx-size`	Context length	`2048`
`-n, --n-predict`	Number of generated tokens	`128`
`--temp`	Temperature	`0.7`
`--top-p`	Top-P	`0.9`
`--top-k`	Top-K	`40`
`-i, --interactive`	Interactive mode	`false`
`-v, --verbose`	Verbose output	`false`

Supported Models

Qwen3-0.6B
Qwen3-1.7B
Qwen3-4B
Qwen3-8B
Qwen3-14B (typically needs dual GPUs)

Models should be downloaded from HuggingFace in safetensors format:

# If huggingface-cli is missing: pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-4B --local-dir ./qwen3-4b

Architecture

Core Abstractions

Tensor: lightweight view (shape + dtype + data ptr), no ownership.
Session: inference session state including KV cache.
IRuntime: backend interface (CUDA now, extensible in future).
DeviceMap: layer-to-device mapping for pipeline parallelism.

Compute Flow

Input IDs
    |
    v
Embedding Lookup (GPU 0)
    |
    v
+-----------------------------+
| Layer 0-N (may span GPUs)   |
|   |- Input LayerNorm        |
|   |- QKV Projection         |
|   |- RoPE                   |
|   |- KV Cache Update        |
|   |- Attention (Q@K^T -> V) |
|   |- O Projection           |
|   |- Residual Add           |
|   |- Post-Attn LayerNorm    |
|   |- MLP (SwiGLU)           |
|   `- Residual Add           |
+-----------------------------+
    |
    v
Final LayerNorm
    |
    v
LM Head -> Logits
    |
    v
Sampling -> Next Token

Multi-GPU Strategy

Ember uses layer-wise pipeline parallelism. Example split:

GPU 0: Embedding + Layers 0-13
GPU 1: Layers 14-27 + LM Head

Hidden states are transferred with cudaMemcpyPeer.

Performance

Actual throughput depends on hardware and model size.

Model	Approx. VRAM	Expected Speed
Qwen3-4B (FP16)	~8 GB	~40 tok/s
Qwen3-8B (FP16)	~16 GB	~25 tok/s
Qwen3-14B (FP16)	~28 GB	~15 tok/s (dual GPU)

Roadmap

M0: Single-GPU inference baseline
M1: Dual-GPU pipeline parallelism
M2: Quantization support (INT8/INT4)
M3: FlashAttention optimization

Citation

If you use Ember in a paper or report, cite the Zenodo archive:

DOI: https://doi.org/10.5281/zenodo.18477269

License

Apache-2.0

Acknowledgements

llama.cpp - architecture inspiration
HuggingFace Transformers - model format ecosystem
Qwen - model weights

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
apps/ember_cli		apps/ember_cli
backends/cuda		backends/cuda
benchmarks		benchmarks
cli		cli
core		core
docs		docs
formats		formats
kernels/common		kernels/common
runtime		runtime
scripts/ci		scripts/ci
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ember - Qwen3 CUDA Inference Engine

Roadmap

Quick Start

Documentation

Features

Project Structure

Build

Requirements

Compile

Usage

Basic Examples

CLI Arguments

Supported Models

Architecture

Core Abstractions

Compute Flow

Multi-GPU Strategy

Performance

Roadmap

Citation

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ember - Qwen3 CUDA Inference Engine

Roadmap

Quick Start

Documentation

Features

Project Structure

Build

Requirements

Compile

Usage

Basic Examples

CLI Arguments

Supported Models

Architecture

Core Abstractions

Compute Flow

Multi-GPU Strategy

Performance

Roadmap

Citation

License

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages