Skip to content
View daodaoawaker's full-sized avatar

Block or report daodaoawaker

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

ONNX Optimizer

C++ 802 102 Updated Apr 2, 2026

Simplify your onnx model

C++ 4,311 421 Updated Apr 2, 2026

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

HTML 23,844 2,745 Updated Mar 12, 2026

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 255 14 Updated Feb 13, 2026

Fast and memory-efficient exact attention

Python 23,101 2,574 Updated Apr 2, 2026

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 498 69 Updated Nov 26, 2024

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 336 30 Updated Jul 2, 2024

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 820 61 Updated Mar 6, 2025

AI Crash Course to help busy builders catch up to the public frontier of AI research in 2 weeks

5,952 858 Updated Feb 23, 2026

Development repository for the Triton language and compiler

MLIR 18,827 2,723 Updated Apr 2, 2026

QLoRA: Efficient Finetuning of Quantized LLMs

Jupyter Notebook 10,864 870 Updated Jun 10, 2024

[ICCV'25] SSVQ: Unleashing the potential of vector quantization with sign-splitting

Python 9 1 Updated Jul 30, 2025

CUDA Matrix Multiplication Optimization

Cuda 264 25 Updated Jul 19, 2024

Fast CUDA matrix multiplication from scratch

Cuda 1,117 170 Updated Sep 2, 2025

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

Python 2,611 302 Updated Apr 1, 2026

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".

Python 876 119 Updated Aug 20, 2024

A curated list of neural network pruning resources.

2,491 332 Updated Apr 4, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,092 1,022 Updated Mar 23, 2026

Universal LLM Deployment Engine with ML Compilation

Python 22,302 1,979 Updated Apr 2, 2026

Model Compression Toolbox for Large Language Models and Diffusion Models

Python 772 89 Updated Aug 14, 2025

MySelf

24 4 Updated Jan 19, 2026

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

Python 696 76 Updated Apr 1, 2026

Model Quantization Benchmark

Python 862 142 Updated Apr 20, 2025

Offline Quantization Tools for Deploy.

Python 144 19 Updated Dec 28, 2023

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,398 775 Updated Mar 30, 2026

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 41,966 4,772 Updated Apr 2, 2026

Making large AI models cheaper, faster and more accessible

Python 41,372 4,523 Updated Mar 30, 2026

利用HuggingFace的官方下载工具从镜像网站进行高速下载。

Python 1,309 115 Updated Oct 12, 2024

一款简单易用和高性能的AI部署框架 | An Easy-to-Use and High-Performance AI Deployment Framework

C++ 1,777 212 Updated Mar 28, 2026
Next