Research Interests
I am interested in efficient machine learning and systems, experienced in the intersection of both field. I aim to use low-rank approximation and compression algorithms to accelerate machine learning models especially LLMs. Also, I design efficient systems like inference/fine-tuning scheduler to accelerate training/inference process.
|
Published Papers
|
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
Arxiv
Hao Kang*, Qingru Zhang*, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao
We propose a novel cache compression algorithm on KV cache for large language model inference. It can achieve near-lossless compression ratio and 2x speedup and 2x peak memory saving on inference time.
|
KV Cache Optimizations for Large Language Model Inference
In review Mlsys2024
|
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
ICML2023
Yu Yang, Hao Kang, Baharan Mirzasoleiman
We design a dataset distiling algorithm based on submodular function and batch SGD that can distil a small dataset from a large dataset. The small dataset can be used to train a model with similar performance to the model trained on the large dataset.
|
Research Project and Tools
|
torchanalyse.
A model profiling tool based on TVM and Maestro(Thanks for the help Abhi!). It can profile the model and give the flops, memory usage, and latency of each layer. It can also give the flops of each operator in the model.
More Information
|
Epipe: Efficient Pipeline Parallelism with Compression Algorithms.
A research project based on Gpipe and low-rank approximation which decreases bandwidth of activation transfer during cloud-based training process.
More Information
|
THOP: PyTorch-OpCounter.
A python third party library that counts flops of models(pytorch, jit, onnx). Already has 4k stars! I wrote the counter of onnx form.
More Information
|
|