ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System
The first heterogeneous agentic inference system, improving throughput by up to 4.0x in large-scale agentic serving and RL training.
LLM Agents · Algorithm / System Co-Design
I aim to build LLM agent systems and algorithms that evolve with their environments.
About
I am a PhD student at Georgia Institute of Technology advised by Prof. Tushar Krishna. I am currently visiting MIT, advised by Prof. Song Han, and work closely with Han Cai.
Earlier in my PhD, I focused on efficient quantization and sparsity for LLMs to reduce serving cost, learning CUDA kernel development through these projects. Now I work on agentic LLM efficiency and system-algorithm co-design. I believe future AI systems will be adaptive and learnable from environment, algorithm, and workload feedback.
Prior to Georgia Tech, I worked with Prof. Baharan Mirzasoleiman at UCLA on efficient machine learning from massive datasets. I received my B.Eng. in Computer Science from Zhejiang University in 2023.
Selected Papers
The first heterogeneous agentic inference system, improving throughput by up to 4.0x in large-scale agentic serving and RL training.
Shows that latency-sensitive environments induce an accuracy-latency Pareto frontier for LLM agents, and studies how to jointly optimize model latency and capability.
Combines FlashQ for headwise KV-cache and activation quantization with SAS for dequantization-free softmax, achieving 1.2-1.8x faster attention and up to 2.37x higher throughput than FP16.
Proposes a KV-cache compression recipe for large language model inference with near-lossless compression, 2x speedup, and 2x peak memory savings.
Designs a data distillation algorithm based on submodular functions and batch SGD to train strong models from smaller distilled datasets.
Projects
A TVM and Maestro based model profiling tool for FLOPs, memory usage, and layer-level latency analysis.
More informationEfficient pipeline parallelism with compression algorithms, built on Gpipe and low-rank approximation to reduce activation-transfer bandwidth.
More informationA PyTorch, JIT, and ONNX FLOPs counter. I contributed the ONNX counter to the project.
More information