Publications | Xudong Liao

2026

EuroSys

Learn-to-Probe: Achieving Signal Distinguishability in Learning-based Congestion Control

Han Tian , Wenbo Li, Junxue Zhang, Xudong Liao, Decang Sun , Donghui Chen , Bin Huang, Wenxue Li , Yong Wang, and Kai Chen

In Proceedings of the 21th ACM European Conference on Computer Systems (EuroSys 2026) , 2026
EuroSys

MFS: An Efficient Model Family Serving System for LLMs

Yunxuan Zhang, Hao Wang, Han Tian, Liu Yang, Xudong Liao, Wenxue Li, Ping Yin, Bowen Liu, and Kai Chen

In Proceedings of the 21th ACM European Conference on Computer Systems (EuroSys 2026) , 2026

2025

SIGCOMM

MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training

Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin , Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu , Ying Zhang, Xiaofeng Ye , Yiming Zhang, and Kai Chen

In Proceedings of the 2025 ACM SIGCOMM Conference (SIGCOMM 2025) , 2025

Abs arXiv PDF

Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named experts, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain static during the distributed training process. In this paper, we advocate for a first-of-its-kind system, called mFabric, that unlocks topology reconfiguration during distributed MoE training. Towards this vision, we first perform a production measurement study and show that the MoE dynamic communication pattern has strong locality, alleviating the requirement of global reconfiguration. Based on this, we design and implement a regionally reconfigurable high-bandwidth domain on top of existing electrical interconnects using optical circuit switching (OCS), achieving scalability while maintaining rapid adaptability. We have built a fully functional mFabric prototype with commodity hardware and a customized collective communication runtime that trains state-of-the-art MoE models with in-training topology reconfiguration across 32 A100 GPUs. Large-scale packet-level simulations show that mFabric delivers comparable performance as the non-blocking fat-tree fabric while boosting the training cost efficiency (e.g., performance per dollar) of four representative MoE models by 1.2x–1.5x and 1.9x–2.3x at 100 Gbps and 400 Gbps link bandwidths, respectively.
SIGCOMM

Coflow Scheduling for LLM Training

Xinchen Wan, Xinyu Yang, Kaiqiang Xu, Xudong Liao, Yilun Jin, Yijun Sun, Zhenghang Ren, Han Tian, and Kai Chen

In Proceedings of the 2025 ACM SIGCOMM Conference (SIGCOMM 2025 (Short)) , 2025

PDF
ATC

Towards Optimal Rack-scale μs-level CPU Scheduling through In-Network Workload Shaping

Xudong Liao, Han Tian, Xinchen Wan, Chaoliang Zeng, Hao Wang, Junxue Zhang, Mengyu Ma, Guyue Liu, and Kai Chen

In 2025 USENIX Annual Technical Conference (ATC 2025) , 2025

PDF
OSDI

Enabling Efficient GPU Communication over Multiple NICs with FuseLink

Zhenghang Ren , Yuxuan Li , Zilong Wang, Xinyang Huang, Wenxue Li, Kaiqiang Xu, Xudong Liao, Yijun Sun, Bowen Liu, Han Tian, Junxue Zhang , Mingfei Wang, Zhizhen Zhong, Guyue Liu , Ying Zhang, and Kai Chen

In Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2025) , 2025

PDF
EuroSys

Achieving Fairness Generalizability for Learning-based Congestion Control with Jury

Han Tian, Xudong Liao, Decang Sun, Chaoliang Zeng, Yilun Jin, Junxue Zhang, Xinchen Wan , Zilong Wang , Yong Wang, and Kai Chen

In Proceedings of the 20th ACM European Conference on Computer Systems (EuroSys 2025) , 2025

PDF
INFOCOM

A Generic and Efficient Communication Framework for Message-level In-Network Computing

Xinchen Wan , Luyang Li, Han Tian, Xudong Liao, Xinyang Huang, Chaoliang Zeng , Zilong Wang, Xinyu Yang, Ke Cheng, Qingsong Ning, Guyue Liu, Layong Luo, and Kai Chen

In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM 2025) , 2025

PDF
ASPLOS

Design and Operation of Shared Machine Learning Clusters on Campus

Kaiqiang Xu, Decang Sun, Hao Wang, Zhenghang Ren, Xinchen Wan, Xudong Liao , Zilong Wang, Junxue Zhang, and Kai Chen

In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2025) , 2025

2024

EuroSys

Astraea: Towards Fair and Efficient Learning-based Congestion Control

Xudong Liao^*, Han Tian^*, Chaoliang Zeng, Xinchen Wan, and Kai Chen

In Proceedings of the 19th ACM European Conference on Computer Systems (EuroSys 2024) , 2024

Abs arXiv PDF Code

Recent years have witnessed a plethora of learning-based solutions for congestion control (CC) that demonstrate better performance over traditional TCP schemes. However, they fail to provide consistently good convergence properties, including fairness, fast convergence and stability, due to the mismatch between their objective functions and these properties. Despite being intuitive, integrating these properties into existing learning-based CC is challenging, because: 1) their training environments are designed for the performance optimization of single flow but incapable of cooperative multi-flow optimization, and 2) there is no directly measurable metric to represent these properties into the training objective function. We present Astraea, a new learning-based congestion control that ensures fast convergence to fairness with stability. At the heart of Astraea is a multi-agent deep reinforcement learning framework that explicitly optimizes these convergence properties during the training process by enabling the learning of interactive policy between multiple competing flows, while maintaining high performance. We further build a faithful multi-flow environment that emulates the competing behaviors of concurrent flows, explicitly expressing convergence properties to enable their optimization during training. We have fully implemented Astraea and our comprehensive experiments show that Astraea can quickly converge to fairness point and exhibit better stability than its counterparts. For example, Astraea achieves near-optimal bandwidth sharing (i.e., fairness) when multiple flows compete for the same bottleneck, delivers up to 8.4x faster convergence speed and 2.8x smaller throughput deviation, while achieving comparable or even better performance over prior solutions.
NSDI

Accelerating Neural Recommendation Training with Embedding Scheduling

Chaoliang Zeng^*, Xudong Liao^*, Xiaodian Cheng, Han Tian, Xinchen Wan, Hao Wang, and Kai Chen

In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 2024) , 2024

Abs PDF Code

Deep learning recommendation models (DLRM) are extensively adopted to support many online services. Typical DLRM training frameworks adopt the parameter server (PS) in CPU servers to maintain memory-intensive embedding tables, and leverage GPU workers with embedding cache to accelerate compute-intensive neural network computation and enable fast embedding lookups. However, such distributed systems suffer from significant communication overhead caused by the embedding transmissions between workers and PS. Prior work reduces the number of cache embedding transmissions by compromising model accuracy, including oversampling hot embeddings or applying staleness-tolerant updates. This paper reveals that many of such transmissions can be avoided given the predictability and infrequency natures of in-cache embedding accesses in distributed training. Based on this observation, we explore a new direction to accelerate distributed DLRM training without compromising model accuracy, i.e., embedding scheduling—with the core idea of proactively determining "where embeddings should be trained" and "which embeddings should be synchronized" to increase the cache hit rate and decrease unnecessary updates, thus achieving a low communication overhead. To realize this idea, we design Herald, a real-time embedding scheduler consisting of two main components: an adaptive location-aware inputs allocator to determine where embeddings should be trained and an optimal communication plan generator to determine which embeddings should be synchronized. Our experiments with real-world workloads show that Herald reduces 48%-89% embedding transmissions, leading up to 2.11x and up to 1.61x better performance with TCP and RDMA, respectively, over 100 Gbps Ethernet for end-to-end DLRM training.

2023

SIGMOD

Scalable and Efficient Full-Graph GNN Training for Large Graphs

Xinchen Wan, Kaiqiang Xu, Xudong Liao, Yilun Jin, Kai Chen , and Xin Jin

In Proceedings of the ACM on Management of Data (SIGMOD 2023) , 2023

Abs PDF

Graph Neural Networks (GNNs) have emerged as powerful tools to capture structural information from graph-structured data, achieving state-of-the-art performance on applications such as recommendation, knowledge graph, and search. Graphs in these domains typically contain hundreds of millions of nodes and billions of edges. However, previous GNN systems demonstrate poor scalability because large and interleaved computation dependencies in GNN training cause significant overhead in current parallelization methods. We present G3, a distributed system that can efficiently train GNNs over billion-edge graphs at scale. G3 introduces GNN hybrid parallelism which synthesizes three dimensions of parallelism to scale out GNN training by sharing intermediate results peer-to-peer in fine granularity, eliminating layer-wise barriers for global collective communication or neighbor replications as seen in prior works. G3 leverages locality-aware iterative partitioning and multi-level pipeline scheduling to exploit acceleration opportunities by distributing balanced workload among workers and overlapping computation with communication in both inter-layer and intra-layer training processes. We show via a prototype implementation and comprehensive experiments that G3 can achieve as much as 2.24x speedup in a 16-node cluster, and better final accuracy over prior works.
TON

Efficient DRL-Based Congestion Control With Ultra-Low Overhead

Han Tian^*, Xudong Liao^*, Chaoliang Zeng, Decang Sun, Junxue Zhang, and Kai Chen

IEEE/ACM Transactions on Networking, 2023

DOI PDF

2022

CoNEXT

Spine: An Efficient DRL-Based Congestion Control with Ultra-Low Overhead

Han Tian^*, Xudong Liao^*, Chaoliang Zeng, Junxue Zhang, and Kai Chen

In Proceedings of the 18th International Conference on Emerging Networking EXperiments and Technologies (CoNEXT 2022) , 2022

Abs DOI PDF

Previous congestion control (CC) algorithms based on deep reinforcement learning (DRL) directly adjust flow sending rate to respond to dynamic bandwidth change, resulting in high inference overhead. Such overhead may consume considerable CPU resources and hurt the datapath performance. In this paper, we present Spine, a hierarchical congestion control algorithm that fully utilizes the performance gain from deep reinforcement learning but with ultra-low overhead. At its heart, Spine decouples the congestion control task into two subtasks in different timescales and handles them with different components: i) a lightweight CC executor that performs fine-grained control responding to dynamic bandwidth changes, and ii) an RL agent that works at a coarse-grained level that generates control sub-policies for the CC executor. Such two-level control architecture can provide fine-grained DRL-based control with a low model inference overhead. Real-world experiments and emulations show that Spine achieves consistent high performance across various network conditions with an ultra-low control overhead reduced by at least 80% compared to its DRL-based counterparts, similar to classic CC schemes such as Cubic.
EuroSys

Multi-Objective Congestion Control

Yiqing Ma, Han Tian, Xudong Liao, Junxue Zhang , Weiyan Wang, Kai Chen , and Xin Jin

In Proceedings of the 17th European Conference on Computer Systems (EuroSys 2022) , 2022

Abs DOI PDF

Decades of research on Internet congestion control (CC) have produced a plethora of algorithms that optimize for different performance objectives. Applications face the challenge of choosing the most suitable algorithm based on their needs, and it takes tremendous efforts and expertise to customize CC algorithms when new demands emerge. In this paper, we explore a basic question: can we design a single CC algorithm to satisfy different objectives? We propose MOCC, the first multi-objective congestion control algorithm that attempts to address this question. The core of MOCC is a novel multi-objective reinforcement learning framework for CC to automatically learn the correlations between different application requirements and the corresponding optimal control policies. Under this framework, MOCC further applies transfer learning to transfer the knowledge from past experience to new applications, quickly adapting itself to a new objective even if it is unforeseen. We provide both user-space and kernel-space implementation of MOCC. Real-world Internet experiments and extensive simulations show that MOCC supports well multi-objective, competing or outperforming the best existing CC algorithms on each individual objectives, and quickly adapting to new application objectives in 288 seconds (14.2x faster than prior work) without compromising old ones.

2021

ArXiv

Tacc: A full-stack cloud computing infrastructure for machine learning tasks

Kaiqiang Xu, Xinchen Wan, Hao Wang, Zhenghang Ren, Xudong Liao, Decang Sun, Chaoliang Zeng, and Kai Chen

arXiv preprint arXiv:2110.01556, 2021

PDF
Book

Datacenter Traffic Optimization with Deep Reinforcement Learning

Li Chen, Justinas Lingys, Kai Chen, and Xudong Liao

2021

DOI HTML