Pulling Back the Curtain on Google’s Network Infrastructure

We adopted a set of principles to organize our networks that is now the primary driver for networking research and industrial innovation, Software Defined Networking (SDN). We observed that we could arrange emerging merchant switch silicon around a Clos topology to scale to the bandwidth requirements of a data center building. The topology of all five generations of our data center networks follow the blueprint below. Unfortunately, this meant that we would potentially require 10,000+ individual switching elements. Even if we could overcome the scalability challenges of existing network protocols, managing and configuring such a vast number of switching elements would be impossible.

So, our next observation was that we knew the exact configuration of the network we were trying to build. Every switch would play a well-defined role in a larger-ensemble. That is, from the perspective of a datacenter building, we wished to build a logical building-scale network. For network management, we built a single configuration for the entire network and pushed the single global configuration to all switches. Each switch would simply pull out its role in the larger whole. Finally, we transformed routing from a pair-wise, fully distributed (but somewhat slow and high overhead) scheme to a logically-centralized scheme under the control of a single dynamically-elected master as illustrated in the figure below from our paper.

Our SIGCOMM paper on Jupiter is one of four Google-authored papers being presented at that conference. Taken together, these papers show the benefits of embedding research within production teams, reflecting both collaborations with PhD students carrying out extended research efforts with Google engineers during internships as well as key insights from deployed production systems:

  • Our work on Bandwidth Enforcer shows how we can allocate wide area bandwidth among tens of thousands of individual applications based on centrally configured policy, substantially improving network utilization while simultaneously isolating services from one another.
  • Condor addresses the challenges of designing data center network topologies. Network designers can specify constraints for data center networks; Condor efficiently generates candidate network designs that meet these constraints, and evaluates these candidates against a variety of target metrics.
  • Congestion control in datacenter networks is challenging because of tiny buffers and very small round trip times. TIMELY shows how to manage datacenter bandwidth allocation while maintaining highly responsive and low latency network roundtrips in the data center.

These efforts reflect the latest in a long series of substantial Google contributions in networking. We are excited about being increasingly open about results of our research work: to solicit feedback on our approach, to influence future research and development directions so that we can benefit from community-wide efforts to improve networking, and to attract the next-generation of great networking thinkers and builders to Google. Our focus on Google Cloud Platform further increases the importance of being open about our infrastructure. Since the same network powering Google infrastructure for a decade is also the underpinnings of our Cloud Platform, all developers can leverage the network to build highly robust, manageable, and globally scalable services.