Stars
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
Fast and memory-efficient exact attention
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
AI Crash Course to help busy builders catch up to the public frontier of AI research in 2 weeks
Development repository for the Triton language and compiler
QLoRA: Efficient Finetuning of Quantized LLMs
[ICCV'25] SSVQ: Unleashing the potential of vector quantization with sign-splitting
CUDA Matrix Multiplication Optimization
Fast CUDA matrix multiplication from scratch
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
A curated list of neural network pruning resources.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Universal LLM Deployment Engine with ML Compilation
Model Compression Toolbox for Large Language Models and Diffusion Models
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
Hackable and optimized Transformers building blocks, supporting a composable construction.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Making large AI models cheaper, faster and more accessible
利用HuggingFace的官方下载工具从镜像网站进行高速下载。
一款简单易用和高性能的AI部署框架 | An Easy-to-Use and High-Performance AI Deployment Framework