Stars
Voxtral Codec : Combining Semantic VQ and Acoustic FSQ for Ultra-Low Bitrate Speech Generation (Voxtral TTS Backbone)
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Pure C inference of Mistral Voxtral Realtime 4B speech to text model
Unofficial implementation of training pipeline in mimo-tokenizer about "MiMo-Audio: Audio Language Models are Few-Shot Learners"
DFlash: Block Diffusion for Flash Speculative Decoding
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Write scalable load tests in plain Python 🚗💨
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...)…
Trainging, inference, and testing of the SAC speech codec model.
VibeVoice: Expressive, longform conversational speech synthesis. (Community fork)
LongCat Audio Tokenizer and Detokenizer
MOSS-Speech is a true speech-to-speech large language model without text guidance.
MiMo-Audio: Audio Language Models are Few-Shot Learners
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Official Repository of Paper: "Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling"
[ICLR2026] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)
VoiceStar: Robust, Duration-controllable TTS that can Extrapolate