LLM Agents · Algorithm / System Co-Design

Hao Kang 康浩

I aim to build LLM agent systems and algorithms that evolve with their environments.

Email CV GitHub

Agentic inference LLM serving KV-cache systems CUDA kernels

4.0x agentic serving throughput

2.37x LLM attention throughput

I am currently on the industry job market and based in the Bay Area. I can start from September 2026. Please feel free to contact me if you have available positions in agentic RL or agentic infrastructure.

About

Researcher in machine learning systems

I am a PhD student at Georgia Institute of Technology advised by Prof. Tushar Krishna. I am currently visiting MIT, advised by Prof. Song Han, and work closely with Han Cai.

Earlier in my PhD, I focused on efficient quantization and sparsity for LLMs to reduce serving cost, learning CUDA kernel development through these projects. Now I work on agentic LLM efficiency and system-algorithm co-design. I believe future AI systems will be adaptive and learnable from environment, algorithm, and workload feedback.

Prior to Georgia Tech, I worked with Prof. Baharan Mirzasoleiman at UCLA on efficient machine learning from massive datasets. I received my B.Eng. in Computer Science from Zhejiang University in 2023.

Selected Papers

Agents, inference, and efficient LLM systems

ICML 2026 Spotlight Agentic systems

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Hao Kang*, Ziyang Li*, Xinyu Yang*, Weili Xu*, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, Simran Arora

The first heterogeneous agentic inference system, improving throughput by up to 4.0x in large-scale agentic serving and RL training. We are currently working with SkyRL, Together AI, and the NVIDIA Dynamo team to implement and integrate ThunderAgent.

NeurIPS 2025 Spotlight LLM agents

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

Hao Kang, Qingru Zhang, Han Cai, Weiyuan Xu, Tushar Krishna, Yilun Du, Tsachy Weissman

Shows that latency-sensitive environments induce an accuracy-latency Pareto frontier for LLM agents, and studies how to jointly optimize model latency and capability.

MLSys 2025 Attention systems

TurboAttention: Efficient Attention Approximation For High Throughput LLMs

Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, Saravan Rajmohan

Combines FlashQ for headwise KV-cache and activation quantization with SAS for dequantization-free softmax, achieving 1.2-1.8x faster attention and up to 2.37x higher throughput than FP16.

NeurIPS 2024 ENLSP Best Paper Candidate

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLMs

Hao Kang*, Qingru Zhang*, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao

Proposes a KV-cache compression recipe for large language model inference with near-lossless compression, 2x speedup, and 2x peak memory savings.

ICML 2023 Data efficiency

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

Yu Yang, Hao Kang, Baharan Mirzasoleiman

Designs a data distillation algorithm based on submodular functions and batch SGD to train strong models from smaller distilled datasets.

Projects

Hao Kang 康浩

Researcher in machine learning systems

Agents, inference, and efficient LLM systems

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

TurboAttention: Efficient Attention Approximation For High Throughput LLMs

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLMs

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

Research projects and tools

torchanalyse

Epipe

THOP: PyTorch-OpCounter