|
Hao Kang(康浩)
I am a PhD student at Georgia Institute of Technology advised by Prof. Tushar Krishna Currently, I am visiting MIT as a visiting student advised by Prof. Song Han and work closely with Han Cai.
Prior to GT, I was fortunate to have worked with Prof Baharan at UCLA about efficient machine learning from massive datasets.
At MIT, I work with Prof Song Han about efficient machine learning on edge device. I received my B.Eng. in Computer Science in 2023 from Zhejiang University.
Email  / 
CV  / 
Github
|
|
|
Research Interests
I am working in machine learning systems. In the first two years(2023-2025), I focused on efficient quantization and sparsity of LLMs to reduce serving costs. I learnt how to wrote cuda kernels through these projects. Now I am working on Agentic LLM Efficiency and system-algorithm co-design for LLM architectures. I firmly believe that the future AI system is adaptive and learnable based on the feedback from the environment, algorithm and workload.
|
|
Selected Published Papers
|
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Neurips 2025 Spotlight
Hao Kang, Qingru Zhang, Han Cai, Weiyuan Xu, Tushar Krishna, Yilun Du, Tsachy Weissman
Win Fast or Lose Slow is the first paper to show that latency-sensitive environments naturally induce an accuracy–latency Pareto frontier for LLM agents, and to study how to jointly optimize model latency and capability under these real-time constraints.
|
TurboAttention: Efficient Attention Approximation For High Throughputs LLMs
Mlsys 2025
Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, Saravan Rajmohan
TurboAttention integrates two core innovations—FlashQ for headwise KV-cache and activation quantization, and SAS for dequantization-free softmax—achieving 1.2–1.8× faster attention, >4.4× KV-cache reduction, and up to 2.37× higher throughput than FP16 while surpassing prior compression and quantization methods.
|
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
Neurips 2024 ENLSP Best Paper Candidate
Hao Kang*, Qingru Zhang*, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao
We propose a novel cache compression algorithm on KV cache for large language model inference. It can achieve near-lossless compression ratio and 2x speedup and 2x peak memory saving on inference time.
|
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
ICML2023
Yu Yang, Hao Kang, Baharan Mirzasoleiman
We design a dataset distiling algorithm based on submodular function and batch SGD that can distil a small dataset from a large dataset. The small dataset can be used to train a model with similar performance to the model trained on the large dataset.
|
|
Research Project and Tools
|
torchanalyse.
A model profiling tool based on TVM and Maestro(Thanks for the help Abhi!). It can profile the model and give the flops, memory usage, and latency of each layer. It can also give the flops of each operator in the model.
More Information
|
Epipe: Efficient Pipeline Parallelism with Compression Algorithms.
A research project based on Gpipe and low-rank approximation which decreases bandwidth of activation transfer during cloud-based training process.
More Information
|
THOP: PyTorch-OpCounter.
A python third party library that counts flops of models(pytorch, jit, onnx). Already has 4k stars! I wrote the counter of onnx form.
More Information
|
|