AI DC – Renaissance and New Thinking Required. Article 4 of 5

datacenterprimerja
Apr 15
6 min read

Article 4: The Operations Workforce the AI DC Needs

James Soh

This article speaks most directly to operational leaders. The implications run through to C-level leadership and the operations workforce.

There is a boundary that exists in almost every data center operations team even in Southeast Asia.

It runs along the edge of the data hall. On one side, the facilities team. Power, cooling, physical infrastructure, building management systems. On the other side, the compute. Servers, networking, storage, software. Two domains, two teams, two sets of responsibilities, two sets of escalation paths.

In a traditional data center, that boundary made operational sense. The facility was the product. The compute was the tenant's problem. The operations team maintained the shell and the systems that served it. What ran inside the racks was outside their scope and outside their training.

In an AI data center, that boundary is a hindrance to operations excellence.

This is not a minor inefficiency or an organisational preference to be revisited at the next headcount review. It is a structural liability that will produce financial, operational, and reputational failures if it is not addressed before the facility goes live.

Article 2 described the knowledge boundary that contracted across the industry during the x86 era and named the cost of that contraction at the C-level and project leadership layer. Article 3 addressed it for design and construction professionals. The same contracted boundary runs all the way through to the operations floor. And on the operations floor, the consequences of that contraction are immediate, physical, and measured in minutes.

This article is addressed primarily to operational leaders. You are the people with the authority to redefine where the boundary sits. The C-level sets the mandate and the investment. The operations workforce delivers the execution. But the operational leader is the one who decides what competence means in their organisation. That decision has never mattered more than it does right now.

The VAX Cluster Operator

Let me go back to where this series began.

When I administered a DECVax cluster, my job did not stop at the computer room door. It started there. The machine was my responsibility. The operating system. The scheduler. The batch queues that managed long compute jobs. Three operators on shift maintained a live view of every critical batch job: month-end closes, quarterly report runs, payroll cycles.

Missing a window was not a system event. It was a business failure. The interactive sessions serving users who needed near-immediate response. The interconnects between nodes. The behaviour of the system under full load versus partial load. The thermal characteristics of the hardware. What happened when a component failed and how the system compensated. That was not a one-person operation. The operators, the systems administrators, the DEC field service team, and the facilities engineers all moved from the same shared picture of the machine. When DEC’s remote service link flagged an anomaly, their engineer was on site before we raised a fault. Everyone acted on the same information.

No one waited for someone else’s domain to declare a problem first.

I was not a facilities engineer. But I understood the environment the machine needed to survive and perform. I understood that a cooling problem was a compute problem. I understood that a power event had specific and predictable consequences for the workload.

I understood the machine as a system, not as a black box sitting on raised floor.

That discipline (the operator who knows the machine) is what the AI data center needs to rediscover. Not because the technology is the same. It is not. But because the principle is the same. The facility and the compute are one integrated system. Operating them as separate domains introduces a gap that failure will find.

That integrated discipline delivered 99.98 percent system availability through the 1990s. The AI data center demands the same from its operations team, at a power density and performance dependency that leaves no margin for the domain gap.

What Happens When the CDU Fails

Let me make this concrete.

A cooling distribution unit serving a Vera Rubin NVL72 rack develops a fault. Coolant flow to the rack drops below threshold. GPU junction temperatures begin to climb.

In a traditional operations model, the CDU fault is a facilities event. It is logged, escalated to the mechanical team, and a work order is raised. The severity assessment is based on facilities criteria. How quickly can the unit be serviced? Is there redundancy in the cooling circuit? What is the impact on the data hall environment?

The compute dimension is absent from that assessment because the facilities team does not have visibility into it. They do not know that the rack is running a training job consuming 200 kilowatts continuously. They do not know that GPU thermal throttling begins within minutes of coolant flow degradation. They do not know that the training job, if interrupted at this point, loses hours of completed work and must restart from the last checkpoint. They do not know that the customer SLA has a direct financial penalty clause triggered by compute downtime.

By the time the compute team is in the room, the damage is done.

This is not a hypothetical scenario. It is a predictable failure mode in any AI DC operations model that treats the facility and the compute as separate domains. The CDU fault is not a facilities event. It is a compute outage with a facilities cause. That distinction determines how fast the right people respond and with what information.

Running at Max Changes Everything

Traditional data center operations were designed around headroom. The facility ran well below its thermal and power limits most of the time. When something went wrong, the thermal mass of underloaded systems bought time. A partial cooling failure in a cloud data center gave the operations team minutes, sometimes longer, to assess and respond before compute performance was affected.

An AI data center running training workloads operates at or near nameplate power continuously. There is no thermal headroom to absorb a fault. The margin between normal operating condition and thermal damage to hardware is measured in minutes, not hours.

NVIDIA engineered dynamic power management into the Vera Rubin platform through DSX Max-Q precisely because power swings during training workloads are severe enough to stress facility power distribution systems. The chip vendor is solving facility-level problems in silicon and firmware because the integration between compute and facility is that tight.

The operations team that does not understand what is happening inside the rack cannot interpret what the facility monitoring systems are telling them in the context of compute risk. They are reading instruments without understanding the machine the instruments are attached to.

The Operational Leader as Protagonist

This is fixable. But it requires operational leaders to make an explicit decision.

The decision is not about technology. It is about scope. The operational leader must decide that the knowledge boundary of their team includes the compute layer. Not at the depth of a GPU engineer. But at the depth of operational awareness. What is running. At what load.

What the thermal and power implications are. What a facility fault means for the workload. What the recovery sequence looks like when both systems are involved.

That decision has organisational consequences. Job descriptions need to change. Training programmes need to include compute fundamentals. Escalation paths need to include both the facilities team and the compute team from the first moment of a significant fault. Monitoring and DCIM systems need to present facility data in the context of compute impact, not just facility status.

C-level leadership must fund and mandate this. The operational leader cannot bridge the domain gap without headcount, training budget, and organisational authority to redefine scope. An AI DC operations team resourced and structured like a traditional colocation operations team is not fit for purpose. Recognising that is a C-level responsibility.

The Professional Opportunity

I want to be direct with the operations workforce reading this.

The AI data center is not a threat to your professional relevance. It is the most significant expansion of your professional domain in the history of the industry.

The engineer who develops genuine operational knowledge of both the facility layer and the compute layer, who can read a CDU fault in the context of GPU thermal risk, who understands the relationship between power distribution events and training job integrity, who can communicate meaningfully with both the mechanical team and the compute team during an incident, is the most valuable person in an AI data center.

That person should be in most DC operators, especially in Southeast Asia, and especially for Neocloud operators. The operations workforce that builds this capability now, before the AI DC buildout in the region reaches full maturity, will define the professional standard for the next decade.

The VAX cluster was the AI system of its day. A tightly integrated machine that demanded a whole team to keep it running: the operator on shift who knew the scheduler, the systems administrator who understood the workload, the DEC field service engineer who arrived before the fault was raised, the facilities engineer who kept the environment stable. No one owned a boundary. Everyone owned the outcome. The Vera Rubin NVL72 demands exactly the same. The operations workforce that understands what the whole system depends on to run smoothly is the one that keeps it running.

Next: Article 5 -- Full Circle: The Machine Is the Building Again

AI DC – Renaissance and New Thinking Required. Article 4 of 5

Recent Posts

Comments