- taiwan
-
10:20
(UTC -12:00) - https://www.linkedin.com/in/wadehuang811/
Starred repositories
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
[ECCV 2022]Code for paper "DaViT: Dual Attention Vision Transformer"
yolort is a runtime stack for yolov5 on specialized accelerators such as tensorrt, libtorch, onnxruntime, tvm and ncnn.
Introduction to Parallel Programming class code
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Model Compression Toolbox for Large Language Models and Diffusion Models
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through w…
[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Flash Attention in ~100 lines of CUDA (forward pass only)
Kolosal AI is an OpenSource and Lightweight alternative to LM Studio to run LLMs 100% offline on your device.
AutoMQ is a diskless Kafka® on S3. 10x Cost-Effective. No Cross-AZ Traffic Cost. Autoscale in seconds. Single-digit ms latency. Multi-AZ Availability.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Efficient vision foundation models for high-resolution generation and perception.
[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
An OCR API service implementation using PaddleOCR and FastAPI
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Some common CUDA kernel implementations (Not the fastest).
TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
Safety helmet wearing detect dataset, with pretrained model
Base on YOLOv5 Head Person Helmet Detection on Construction Sites,基于目标检测工地安全帽和禁入危险区域识别系统,🚀😆附 YOLOv5 训练自己的数据集超详细教程🚀😆2021.3新增可视化界面❗❗
Read-only mirror of https://gitlab.gnome.org/GNOME/glib
NVIDIA DeepStream SDK 8.0 / 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models