Inference of Qwen3.5 models in pure C, for learning purpose
No pytorch required. The safetensors loading is done with safetensors-cpp
Inspired by llama2.c and mamba.c
For those interested in (or that only learn by) seeing the actual operations on the weights and state at a lower level
Qwen 3.5 combines multi-head attention and linear attention (GatedDeltaNet) layers
For fast inference, use other methods like qwen3.5-triton
pip install huggingface_hub transformers
python prepare.py Qwen/Qwen3.5-0.8B # download + create tokenizer
make fast
./qwen35 Qwen3.5-0.8BIf there are more than 1 model with the same name on the local cache, pass the full name of the model:
./qwen35 Qwen/Qwen3.5-0.8B
Or pass the path to the folder containing the model:
./qwen35 ./Qwen3.5-0.8BUse Qwen3.5 dense models from Qwen's Huggingface folder or finetunes
Examples:
- Qwen/Qwen3.5-0.8B
- Qwen/Qwen3.5-2B
- Qwen/Qwen3.5-4B
- Qwen/Qwen3.5-9B
Many of these repos are vision–language on the Hub; prepare.py uses text_config when present and exports the text transformer only.
Not supported: MoE checkpoints (e.g. Qwen3.5-35B-A3B, 122B-A10B, 397B-A17B), FP8 / GPTQ / other non-float weight formats.
make # reference
make fast # -Ofast -march=native
make omp # OpenMP (set OMP_NUM_THREADS when running)
make debug
make clean./qwen35 <model> [options]
# <model> can be a model name (after prepare.py) or a local directory
./qwen35 Qwen3.5-0.8B -i "Hello!"
./qwen35 ./Qwen3.5-0.8B -y "You are a helpful assistant."WTFPL (Do What The Fuck You Want To Public License)