-
Beijing University of Posts and Telecommunications
- Beijing, China
- https://shigangli.github.io/
- @shigang_li
Stars
FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swap-and-Transpose mapping strategy. FlashSparse is accepted by…
DLRover: An Automatic Distributed Deep Learning System
Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
BackPACK - a backpropagation package built on top of PyTorch which efficiently computes quantities other than the gradient.
C++/MPI proxies for distributed training of deep neural networks.
Ok-Topk is a scheme for distributed training with sparse gradients. Ok-Topk integrates a novel sparse allreduce algorithm (less than 6k communication volume which is asymptotically optimal) with th…
Training and serving large-scale neural networks with auto parallelization.
[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
A cross-platform Sparse Matrix Vector Multiplication (SpMV) framework for many-core architectures (GPUs and Xeon Phi).
Cache-oblivious MPI all-to-all communications based on Morton order
WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can…
Research and development for optimizing transformers
Deep Learning for Post-Processing Ensemble Weather Forecasts
Eager-SGD is a decentralized asynchronous SGD. It utilizes novel partial collectives operations to accumulate the gradients across all the processes.
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
A Deep Learning Meta-Framework and HPC Benchmarking Library
CSR-based SpGEMM on nVidia and AMD GPUs