Jul 07, 2023

The Landlords of AI

Chris Sharp is CTO of Digital Realty

: Why modularity will be critical for data centers and the AI economy

The sudden emergence of large-scale commercial AI over the past year, especially new generative AI applications such as ChatGPT, has pushed a new set of technical requirements onto the data center facilities where these applications reside. The infrastructure that supports them will draw more power, chew through more data, and use more bandwidth than ever before, all within facilities that may have been built 20 years ago. These facilities now need to adapt to support what may be in some cases an order of magnitude increase in power draw per rack.

A seven-part article on what large language models and what the next wave of workloads mean for compute, networking, and data center design

The only way to achieve this is with a modular design.

Data centers may seem like highly static entities. They’re typically enormous brick-and-mortar buildings with row after row of generators and other equipment outside, all designed carefully to keep the facility operating under everything from typical day-to-day conditions to a total blackout of the electrical grid, without interruption. However, the modern data center is anything but static; many facilities are designed from the beginning to be highly modular, and a given data center floor may be adapted for changes in network topology, airflow considerations and physical redundancy several times a year if required. What drives this need, and how is it fulfilled?

The widespread emergence of AI deployments in the data center shows how quickly customer requirements can change. Where only last year a data center operator may have been able to plan on an average of 10-kilowatt power draw per rack of customer equipment, the need for increasingly large blocks of 25, 50 or even 100-kilowatt racks at different places across that same data center facility is here and will only continue to grow. With a traditional static design, this can create many problems in terms of performance, maintenance, and redundancy.

Firstly, such dense racks often require more network bandwidth to operate at their highest level of efficiency. This is often overlooked, and a customer will be very unhappy if they deploy such a dense rack (or 10, or 100 of them) and then can’t get the bandwidth that they require.

Secondly, an uneven increase in power draw across the floor of a data center can often stress a cooling system that was not designed to accommodate these types of hot spots. A dense rack on one end of a row in the data center could easily lead to increased temperatures at the other end.

Finally, resiliency and redundancy measures are based on where specific electrical loads are across the facility and how they are distributed. If a very dense cluster of equipment is added in one area, static designs may not be able to ensure that it is covered by enough reliable generator capacity.

As you can imagine, for the AI customer, each of these concerns is a significant issue ranging from the inability to operate their AI equipment at its maximum performance potential, to potentially incurring unwanted downtime in the event of a power outage or other stress on the local electrical grid. By using a highly adaptable modular design framework, these issues can be addressed in data centers of any age.

For one, spaces can be repurposed or designed-in from the beginning of the facility to be used as additional network rooms to allow for the installation of more network circuits, switches, and routers to boost network bandwidth to the customer over time. Additionally, a modular method of designing and deploying overhead cable trays allows the data center operator to physically bring that connectivity to the customer, which is often overlooked in static, non-flexible designs. Some AI-enabling technologies such as InfiniBand can use large, heavy cabling, which can only be feasibly installed modularly to avoid real performance and operational issues down the line.

Understanding the true state of cooling across a facility through the use of CFD (Computational Fluid Dynamics) provides the data center operator with the means to identify trapped airflow, unintended patterns of airflow that may result in sub-optimal cooling, and where additional air capacity exists that can be used to cool dense, hot AI deployments. Many data center facilities can also be modular enough to be upgraded from an air-only cooling configuration to a hybrid setup where air and liquid cooling (both AALC and DLC) are available, on an as-needed basis, allowing AI deployments to take place as part of an existing data center floor or larger suite.

With a modular power configuration – where the data center is conceptualized as a series of blocks each with its own supporting power, backup and cooling infrastructure – core components can be sized and deployed appropriately based on the customer deployment in relatively small increments to ensure that as deployments are added to a space, even if they differ wildly in power consumption, they can be supported at the expected level of resiliency.

These are just a few examples of how a modular approach to data center design helps to ensure that AI deployments, even at very high rack densities, can be supported in a highly performant, robust, and cost-effective fashion in an existing data center facility. Modular designs will be the difference between being able to support current and future generations of AI deployments in existing sites and needing to build.