Skip to content

dongchany/ember

Repository files navigation

Ember - Qwen3 CUDA Inference Engine

A lightweight CUDA inference engine for Qwen3 models, designed for consumer multi-GPU setups (for example, dual RTX 3080 Ti).

Roadmap

As of March 12, 2026, Ember is focused on a narrow goal: high-performance Qwen inference on consumer NVIDIA GPUs.

Current:

  • Stable Qwen3 dense CUDA inference
  • Native safetensors loading
  • Multi-GPU pipeline-parallel runtime
  • Minimal CLI and server path

Next:

  • Qwen3.5 hybrid architecture support
  • DeltaNet + Gated Attention runtime
  • HF-aligned correctness and regression harness

Then:

  • Qwen3.5 35B-A3B MoE inference
  • Dual-GPU + CPU offload execution path
  • Reproducible benchmarks and OpenAI-compatible serving

Quick Start

  1. Build (example: RTX 3080 Ti, SM=86):
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel

By default, Ember builds only the core inference target (ember). To also build optional tests/benchmarks:

cmake -S . -B build \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DEMBER_BUILD_TESTS=ON \
  -DEMBER_BUILD_BENCHMARKS=ON
  1. Download a model (safetensors format; first run may require huggingface-cli login):
# If huggingface-cli is missing: pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./qwen3-0.6b
  1. Run:
./build/ember -m ./qwen3-0.6b -p "Hello, my name is"

Documentation

English (default):

Chinese:

Features

  • Native CUDA implementation: full control over compute flow, without ggml/llama.cpp.
  • Direct safetensors loading: loads HuggingFace format natively, no conversion step.
  • Pipeline parallelism: multi-GPU layer split with memory-aware allocation.
  • FP16 compute: FP16 weights/activations, GEMM via cuBLAS.
  • Custom kernels: RMSNorm, RoPE, Softmax, SiLU, Attention, and more.

Project Structure

ember/
|-- apps/ember_cli/         # CLI entry point (main.cpp)
|-- cli/                    # CLI argument parsing
|-- core/                   # Core abstractions (pure C++, no CUDA dependency)
|-- runtime/                # Scheduling and device mapping runtime logic
|-- formats/                # safetensors/config loaders
|-- backends/cuda/          # CUDA runtime + kernels
|-- benchmarks/             # Performance and interconnect benchmarks
|-- tests/                  # Unit tests and smoke tests
|-- scripts/                # CI and alignment scripts
`-- docs/                   # Current design and testing docs

Build

Requirements

  • CMake 3.18+
  • CUDA Toolkit 11.0+ (12.x recommended)
  • C++17 compiler

Compile

cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release  # RTX 3080 Ti
cmake --build build --parallel

Common CUDA architecture values:

  • 86 - RTX 3080/3090
  • 89 - RTX 4090
  • 80 - A100

Usage

Basic Examples

# Single-GPU inference
./build/ember -m /path/to/qwen3-4b -p "Hello, my name is"

# Dual-GPU inference
./build/ember -m /path/to/qwen3-14b --devices 0,1 -p "Explain quantum computing"

# Interactive mode
./build/ember -m /path/to/qwen3-4b -i

CLI Arguments

Argument Description Default
-m, --model Model directory (must contain safetensors and config.json) Required
-p, --prompt Input prompt "Hello, my name is"
--devices GPU device list 0
-c, --ctx-size Context length 2048
-n, --n-predict Number of generated tokens 128
--temp Temperature 0.7
--top-p Top-P 0.9
--top-k Top-K 40
-i, --interactive Interactive mode false
-v, --verbose Verbose output false

Supported Models

  • Qwen3-0.6B
  • Qwen3-1.7B
  • Qwen3-4B
  • Qwen3-8B
  • Qwen3-14B (typically needs dual GPUs)

Models should be downloaded from HuggingFace in safetensors format:

# If huggingface-cli is missing: pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-4B --local-dir ./qwen3-4b

Architecture

Core Abstractions

  1. Tensor: lightweight view (shape + dtype + data ptr), no ownership.
  2. Session: inference session state including KV cache.
  3. IRuntime: backend interface (CUDA now, extensible in future).
  4. DeviceMap: layer-to-device mapping for pipeline parallelism.

Compute Flow

Input IDs
    |
    v
Embedding Lookup (GPU 0)
    |
    v
+-----------------------------+
| Layer 0-N (may span GPUs)   |
|   |- Input LayerNorm        |
|   |- QKV Projection         |
|   |- RoPE                   |
|   |- KV Cache Update        |
|   |- Attention (Q@K^T -> V) |
|   |- O Projection           |
|   |- Residual Add           |
|   |- Post-Attn LayerNorm    |
|   |- MLP (SwiGLU)           |
|   `- Residual Add           |
+-----------------------------+
    |
    v
Final LayerNorm
    |
    v
LM Head -> Logits
    |
    v
Sampling -> Next Token

Multi-GPU Strategy

Ember uses layer-wise pipeline parallelism. Example split:

GPU 0: Embedding + Layers 0-13
GPU 1: Layers 14-27 + LM Head

Hidden states are transferred with cudaMemcpyPeer.

Performance

Actual throughput depends on hardware and model size.

Model Approx. VRAM Expected Speed
Qwen3-4B (FP16) ~8 GB ~40 tok/s
Qwen3-8B (FP16) ~16 GB ~25 tok/s
Qwen3-14B (FP16) ~28 GB ~15 tok/s (dual GPU)

Roadmap

  • M0: Single-GPU inference baseline
  • M1: Dual-GPU pipeline parallelism
  • M2: Quantization support (INT8/INT4)
  • M3: FlashAttention optimization

Citation

If you use Ember in a paper or report, cite the Zenodo archive:

License

Apache-2.0

Acknowledgements

About

A lightweight multi-GPU inference engine for LLMs on mid/low-end GPUs.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors