aha

Lightweight AI Inference Engine — All-in-one Solution for Text, Vision, Speech, and OCR

aha is a high-performance, cross-platform AI inference engine built with Rust and the Candle framework. It brings state-of-the-art AI models to your local machine—no API keys, no cloud dependencies, just pure, fast AI running directly on your hardware.

Changelog

v0.2.5 (2026-03-30)

add LFM2.5VL-1.6B
add LFM2VL-1.6B

v0.2.4 (2026-03-23)

add LFM2.5-1.2B-Instruct
add LFM2-1.2B

v0.2.3 (2026-03-18)

add DeepSeek-OCR-2

2026-03-17

add PaddleOCR-VL1.5 model
fix qwen3.5 position_ids create bug
cli param add
- gguf_path: Local GGUF model weight path (required for loading models with GGUF)
- mmproj_path: Local path to mmproj GGUF weights (required for multimodal GGUF loading)
WhichModel add qwen3.5-gguf

2026-03-16

Added Qwen3.5 mmproj

View full changelog →

Quick Start

Installation

git clone https://github.com/jhqxxx/aha.git
cd aha
cargo build --release

Optional Features:

# CUDA (NVIDIA GPU acceleration)
cargo build --release --features cuda

# Metal (Apple GPU acceleration for macOS)
cargo build --release --features metal

# Flash Attention (faster inference)
cargo build --release --features cuda,flash-attn

# FFmpeg (multimedia processing)
cargo build --release --features ffmpeg

CLI Quick Reference

# List all supported models
aha list

# Download model only
aha download -m qwen3asr-0.6b

# Download model and start service
aha -m qwen3asr-0.6b

# Run inference directly (without starting service)
aha run -m qwen3asr-0.6b -i "audio.wav"

# Start service only (model already downloaded)
aha serv -m qwen3asr-0.6b -p 10100

Chat

aha serv -m qwen3-0.6b -p 10100

Then use the unified (OpenAI-compatible) API:

curl http://localhost:10100/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }
'

Supported Models

Category	Models
Text	Qwen3, MiniCPM4, LFM2-1.2B, LFM2.5-1.2B-Instruct
Vision	Qwen2.5-VL, Qwen3-VL, Qwen3.5, LFM2.5-VL-1.6B, LFM2-VL-1.6B
OCR	DeepSeek-OCR, DeepSeek-OCR-2 , PaddleOCR-VL, PaddleOCR-VL1.5, Hunyuan-OCR, GLM-OCR
ASR	GLM-ASR-Nano, Fun-ASR-Nano, Qwen3-ASR
Audio	VoxCPM, VoxCPM1.5
Image	RMBG-2.0 (background removal)

Documentation

Document	Description
Getting Started	First steps with aha
Installation	Detailed installation guide
CLI Reference	Command-line interface
API Documentation	Library & REST API
Supported Models	Available AI models
Concepts	Architecture & design
Development	Contributing guide
Changelog	Version history

Why aha?

🚀 High-Performance Inference - Powered by Candle framework for efficient tensor computation and model inference
🔧 Unified Interface — One tool for text, vision, speech, and OCR
📦 Local-First — All processing runs locally, no data leaves your machine
🎯 Cross-Platform — Works on Linux, macOS, and Windows
⚡ GPU Accelerated — Optional CUDA support for faster inference
🛡️ Memory Safe — Built with Rust for reliability
🧠 Attention Optimization - Optional Flash Attention support for optimized long sequence processing

Development

Using aha as a Library

cargo add aha

# VoxCPM example
use aha::models::voxcpm::generate::VoxCPMGenerate;
use aha::utils::audio_utils::save_wav;
use anyhow::Result;

fn main() -> Result<()> {
    let model_path = "xxx/openbmb/VoxCPM-0.5B/";

    let mut voxcpm_generate = VoxCPMGenerate::init(model_path, None, None)?;

    let generate = voxcpm_generate.generate(
        "The sun is shining bright, flowers smile at me, birds say early early early".to_string(),
        None,
        None,
        2,
        100,
        10,
        2.0,
        false,
        6.0,
    )?;

    let _ = save_wav(&generate, "voxcpm.wav")?;
    Ok(())
}

Extending New Models

Create new model file in src/models/
Export in src/models/mod.rs
Add support for CLI model inference in src/exec/
Add tests and examples in tests/

Features

High-performance inference via Candle framework
Multi-modal model support (vision, language, speech)
Clean, easy-to-use API design
Minimal dependencies, compact binaries
Flash Attention support for long sequences
FFmpeg support for multimedia processing

License

Apache-2.0 — See LICENSE for details.

Acknowledgments

Candle - Excellent Rust ML framework
All model authors and contributors

Wechat

_{Built with ❤️ by the aha team}

_{We're continuously expanding our model support. Contributions are welcome!}

_{If this project helps you, please consider giving us a ⭐ Star!}

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
clippy.toml		clippy.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aha

Changelog

v0.2.5 (2026-03-30)

v0.2.4 (2026-03-23)

v0.2.3 (2026-03-18)

2026-03-17

2026-03-16

Quick Start

Installation

CLI Quick Reference

Chat

Supported Models

Documentation

Why aha?

Development

Using aha as a Library

Extending New Models

Features

License

Acknowledgments

Wechat

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

aha

Changelog

v0.2.5 (2026-03-30)

v0.2.4 (2026-03-23)

v0.2.3 (2026-03-18)

2026-03-17

2026-03-16

Quick Start

Installation

CLI Quick Reference

Chat

Supported Models

Documentation

Why aha?

Development

Using aha as a Library

Extending New Models

Features

License

Acknowledgments

Wechat

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages