Amin Vahdat

Amin Vahdat

Amin Vahdat is a Fellow and Chief Technologist for AI Infrastructure at Google, where his team is responsible for delivering industry-leading infrastructure which spans custom silicon, data centers, network, and supply chain and operations. This infrastructure serves Alphabet, Google and the world, and Artificial Intelligence technologies that empower ML developers and solve customers’ most pressing business challenges. In the past, he was Vice President and General Manager for Google's compute, storage, and network hardware and software infrastructure. Until 2019, he was the Technical Lead and Vice President for the Networking organization at Google.

Before joining Google, Amin was the Science Applications International Corporation (SAIC) Professor of Computer Science and Engineering at UC San Diego (UCSD). He received his doctorate from the University of California Berkeley in computer science, and is a Fellow of the Association for Computing Machinery (ACM).

Amin has been recognized with a number of awards, including the National Science Foundation (NSF) CAREER award, the UC Berkeley Distinguished EECS Alumni Award, the Alfred P. Sloan Fellowship, the Association for Computing Machinery's SIGCOMM Networking Systems Award, and the Duke University David and Janet Vaughn Teaching Award. Amin was awarded the SIGCOMM lifetime achievement award for his contributions to data center and wide area networks. He was inducted into the National Academy of Engineering in 2023 for his contributions to the design and implementation of datacenter and planet-scale networks that power cloud computer systems.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
Managing and Securing Google's Fleet of Multi-Node Servers
Richard Hanley
Havard Skinnemoen
Andrés Lagar-Cavilla
Michael Wong
Jon McCune
Jeff Andersen
Kishan Prasad
Patrick Leis
Shiva Rao
Chris Koch
Jad Baydoun
Anna Sapek
Communications of the ACM, 69:3 (2026), pp. 82 - 92
Preview abstract Server hardware and software co-design for a secure, efficient cloud. View details
Preview abstract Datacenter network hotspots, defined as links with persistently high utilization, can lead to performance bottlenecks.In this work, we study hotspots in Google’s datacenter networks. We find that these hotspots occur most frequently at ToR switches and can persist for hours. They are caused mainly by bandwidth demand-supply imbalance, largely due to high demand from network-intensive services, or demand exceeding available bandwidth when compute/storage upgrades outpace ToR bandwidth upgrades. Compounding this issue is bandwidth-independent task/data placement by data-center compute and storage schedulers. We quantify the performance impact of hotspots, and find that they can degrade the end-to-end latency of some distributed applications by over 2× relative to low utilization levels. Finally, we describe simple improvements we deployed. In our cluster scheduler, adding hotspot-aware task placement reduced the number of hot ToRs by 90%; in our distributed file system, adding hotspot-aware data placement reduced p95 network latency by more than 50%. While congestion control, load balancing, and traffic engineering can efficiently utilize paths for a fixed placement, we find hotspot-aware placement – placing tasks and data under ToRs with higher available bandwidth – is crucial for achieving consistently good performance. View details
Fathom: Understanding Datacenter Application Network Performance
Junhua Yan
Mubashir Adnan Qureshi
Soheil Hassas Yeganeh
Van Jacobson
Yousuk Seung
Proceedings of ACM SIGCOMM 2023
Preview abstract We describe our experience with Fathom, a system for identifying the network performance bottlenecks of any service running in the Google fleet. Fathom passively samples RPCs, the principal unit of work for services. It segments the overall latency into host and network components with kernel and RPC stack instrumentation. It records these detailed latency metrics, along with detailed transport connection state, for every sampled RPC. This lets us determine if the completion is constrained by the client, network or server. To scale while enabling analysis, we also aggregate samples into distributions that retain multi-dimensional breakdowns. This provides us with a macroscopic view of individual services. Fathom runs globally in our datacenters for all production traffic, where it monitors billions of TCP connections 24x7. For five years Fathom has been our primary tool for troubleshooting service network issues and assessing network infrastructure changes. We present case studies to show how it has helped us improve our production services. View details
Change Management in Physical Network Lifecycle Automation
Virginia Beauregard
Kevin Grant
Angus Griffith
Jahangir Hasan
Chen Huang
Quan Leng
Jiayao Li
Alexander Lin
Zhoutao Liu
Ahmed Mansy
Bill Martinusen
Nikil Mehta
Andrew Narver
Anshul Nigham
Melanie Obenberger
Sean Smith
Kurt Steinkraus
Sheng Sun
Edward Thiele
Proc. 2023 USENIX Annual Technical Conference (USENIX ATC 23)
Preview abstract Automated management of a physical network's lifecycle is critical for large networks. At Google, we manage network design, construction, evolution, and management via multiple automated systems. In our experience, one of the primary challenges is to reliably and efficiently manage change in this domain -- additions of new hardware and connectivity, planning and sequencing of topology mutations, introduction of new architectures, new software systems and fixes to old ones, etc. We especially have learned the importance of supporting multiple kinds of change in parallel without conflicts or mistakes (which cause outages) while also maintaining parallelism between different teams and between different processes. We now know that this requires automated support. This paper describes some of our network lifecycle goals, the automation we have developed to meet those goals, and the change-management challenges we encountered. We then discuss in detail our approaches to several specific kinds of change management: (1) managing conflicts between multiple operations on the same network; (2) managing conflicts between operations spanning the boundaries between networks; (3) managing representational changes in the models that drive our automated systems. These approaches combine both novel software systems and software-engineering practices. While this paper reports on our experience with large-scale datacenter network infrastructures, we are also applying the same tools and practices in several adjacent domains, such as the management of WAN systems, of machines, and of datacenter physical designs. Our approaches are likely to be useful at smaller scales, too. View details
CAPA: An Architecture For Operating Cluster Networks With High Availability
Bingzhe Liu
Mukarram Tariq
Omid Alipourfard
Rich Alimi
Deepak Arulkannan
Virginia Beauregard
Patrick Conner
Brighten Godfrey
Xander Lin
Mayur Patel
Joon Ong
Amr Sabaa
Alex Smirnov
Manish Verma
Prerepa Viswanadham
Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)
Preview abstract Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years View details
Improving Network Availability with Protective ReRoute
Abdul Kabbani
Van Jacobson
Jim Winget
Brad Morrey
Uma Parthavi Moravapalle
Steven Knight
SIGCOMM 2023
Preview abstract We present PRR (Protective ReRoute), a transport technique for shortening user-visible outages that complements routing repair. It can be added to any transport to provide benefits in multipath networks. PRR responds to flow connectivity failure signals, e.g., retransmission timeouts, by changing the FlowLabel on packets of the flow, which causes switches and hosts to choose a different network path that may avoid the outage. To enable it, we shifted our IPv6 network architecture to use the FlowLabel, so that hosts can change the paths of their flows without application involvement. PRR is deployed fleetwide at Google for TCP and Pony Express, where it has been protecting all production traffic for several years. It is also available to our Cloud customers. We find it highly effective for real outages. In a measurement study on our network backbones, adding PRR reduced the cumulative region-pair outage time for RPC traffic by 63--84%. This is the equivalent of adding 0.4--0.8 "nines'" of availability. View details
Understanding Host Interconnect Congestion
Gautam Kumar
Khaled Elmeleegy
Masoud Moshref
Rachit Agarwal
Saksham Agarwal
Sylvia Ratnasamy
Association for Computing Machinery, New York, NY, USA (2022), 198–204
Preview abstract We present evidence and characterization of host congestion in production clusters: adoption of high-bandwidth access links leading to emergence of bottlenecks within the host interconnect (NIC-to-CPU data path). We demonstrate that contention on existing IO memory management units and/or the memory subsystem can significantly reduce the available NIC-to-CPU bandwidth, resulting in hundreds of microseconds of queueing delays and eventual packet drops at hosts (even when running a state-of-the-art congestion control protocol that accounts for CPU-induced host congestion). We also discuss implications of host interconnect congestion to design of future host architecture, network stacks and network protocols. View details
Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking
Joon Ong
Arjun Singh
Mukarram Tariq
Rui Wang
Jianan Zhang
Virginia Beauregard
Patrick Conner
Rishi Kapoor
Stephen Kratzer
Nanfang Li
Hong Liu
Karthik Nagaraj
Jason Ornstein
Samir Sawhney
Ryohei Urata
Lorenzo Vicisano
Kevin Yasumura
Shidong Zhang
Junlan Zhou
Proceedings of ACM SIGCOMM 2022
Preview abstract We present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. A key enabler for these improvements is evolving Jupiter from a Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: A datacenter interconnection layer employing Micro-ElectroMechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic engineering, and automated network operations for incremental capacity delivery and topology engineering. We show that the combination of traffic and topology engineering on direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from source to destination aggregation blocks, while the remaining transits one additional block, achieving an average blocklevel path length of 1.4 in our fleet today. OCS also achieves 3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect. View details
Aquila: A unified, low-latency fabric for datacenter networks
Hema Hariharan
Eric Lance
Moray Mclaren
Stephen Wang
Zhehua Wu
Sunghwan Yoo
Raghuraman Balasubramanian
Prashant Chandra
Michael Cutforth
Peter James Cuy
David Decotigny
Rakesh Gautam
Rick Roy
Zuowei Shen
Ming Tan
Ye Tang
Monica C Wong-Chan
Joe Zbiciak
Aquila: A unified, low-latency fabric for datacenter networks (2022)
Preview abstract Datacenter workloads have evolved from the data intensive, loosely-coupled workloads of the past decade to more tightly coupled ones, wherein ultra-low latency communication is essential for resource disaggregation over the network and to enable emerging programming models. We introduce Aquila, an experimental datacenter network fabric built with ultra-low latency support as a first-class design goal, while also supporting traditional datacenter traffic. Aquila uses a new Layer 2 cell-based protocol, GNet, an integrated switch, and a custom ASIC with low-latency Remote Memory Access (RMA) capabilities co-designed with GNet. We demonstrate that Aquila is able to achieve under 40 μs tail fabric Round Trip Time (RTT) for IP traffic and sub-10 μs RMA execution time across hundreds of host machines, even in the presence of background throughput-oriented IP traffic. This translates to more than 5x reduction in tail latency for a production quality key-value store running on a prototype Aquila network. View details
Preview abstract A modern datacenter hosts thousands of services with a mix of latency-sensitive, throughput-intensive, and best-effort traffic with high degrees of fan-out and fan-in patterns. Maintaining low tail latency under high overload conditions is difficult, especially for latency-sensitive (LS) RPCs. In this paper, we consider the challenging case of providing service-level objectives (SLO) to LS RPCs when there are unpredictable surges in demand. We present Aequitas, a distributed sender-driven admission control scheme that is anchored on the key conceptual insight: Weighted-Fair Quality of Service (QoS) queues, found in standard NICs and switches, can be used to guarantee RPC level latency SLOs by a judicious selection of QoS weights and traffic-mix across QoS queues. Aequitas installs cluster-wide RPC latency SLOs by mapping LS RPCs to higher weight QoS queues, and coping with overloads by adaptively apportioning LS RPCs amongst QoS queues based on measured completion times for each queue. When the network demand spikes unexpectedly to 25× of provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-p and admits 15× more RPCs meeting SLO target compared to pFabric when RPC sizes are not aligned with priorities. View details
×