Data Center Operations Excellence (Part 2): One Misstep Triggered Big Consequences

datacenterprimerja
Feb 27
4 min read

James Soh. First published on 12th of August, 2025

In high‑availability data center environments, we put in place multiple layers of safeguards — policies, designated safe‑work procedures, redundant power design, and monitoring — so that one small misstep should not cause downtime. But sometimes, those layers align in exactly the wrong way.

In Part 1, the incident did not caused an outage. In part 2's example, there are perhaps more instances of IT operations incidents than there are incidents caused by the data center operator's personnel. By the way, I have moved Part 2 "Data Center Operations Excellence Part 2: When Supervision Isn't Enough" to be Part 3 instead. I cannot just have the data center facilities and operations team always be the target. IT operations is fair game too.

What happens when you have safeguards in place and have proper procedures, and still experience an incident? We'll explore a IT operations service scenario and the swiss cheese model for incident to occur and examine the factors that undermine experienced personnel during seemingly routine operations

The Safe Rack Policy

Back when I was part of the network management team for a large organization’s data center, an outage incident was shared with all staff — as a clear reminder that documented procedures exist for a reason, and must be followed without exception.

An IT operations engineer from our outsourced provider entered the data hall for authorised work to load server OS patches onto rack‑mounted production servers.

The proper procedure, documented and briefed: Use the designated “safe rack” to plug in a notebook. This rack is kept separate from production PDUs so there’s no risk of affecting live systems when powering temporary equipment.

What happened instead: He bypassed that step, unlocked a production rack that had the servers he was to work on, and plugged his notebook charger into an outlet on the in‑rack power strip.

He didn’t check that one of the IT production server’s dual power supplies was mistakenly connected to the same power strip — a mistake done by someone else much earlier, or even after the first install. We will never know.

Then his notebook charger shorted. The result? The power strip’s fuse opened, and that one server went down an hour before the pre-announced server maintenance window for the server OS patch, and because it powered down before a proper OS shutdown process there are corrupted data that took time and effort to recover from. That IT outsource service provider has to pay a penalty charge, but the damage has wider repercussions.

How the Safeguards Lined Up Wrong — the “Swiss Cheese” Effect

Procedural – The safe rack policy was bypassed. Could have enhanced procedure to have remote OS patch upload by two persons in pairs.
Design – Both PSUs on the same server connected to same power strip removed redundancy.
Physical – No outlet covers or lockouts on the production PDU sockets.
Monitoring – if utilization of the server power in that rack was checked, maybe the slightly loaded one side of the power strip can give early detection of the server power PSU connecting to same power strip.
Resource - Designate remote PC/terminal for remote loading procedure for OS patching upload, and in pairs of personnel, one do and one check.

When the “holes” in multiple layers of defense line up, a single action can have a direct path to service impact.

This Pattern Isn’t Unique

Variations of this have occurred across the industry:

Breaker trips from devices plugged into overloaded or incorrect circuits.
Servers or network equipment lost power because “redundant” A/B feeds come from the same source.
Wrong data cables disconnected in error due to mis-labeling.
Foreign objects — even paper — left in racks, blocking airflow and leading to overheating.

Risk Mitigation That Works

Enforce SOPs — and explain the why behind them to get team buy‑in.
Audit redundancy — test and prove that A/B feeds are independent. Don’t assume.
Fit physical safeguards — such as socket covers or lockouts, especially for production PDUs.
Use intelligent PDUs with active monitoring — for load, surge, and anomaly alerts.
Keep documentation and labeling accurate — update immediately after any change.
Do pre‑ and post‑work verification checks — every single time.

Strengthen the Layers

Beyond these fixes:

Build a defence‑in‑depth approach combining procedural, technical, and physical safety layers.
Run regular training and refreshers — especially on power redundancy and safe‑rack use.
Conduct post‑incident reviews without blame — to find and fix systemic weaknesses.
Carry out simulation drills — so the team can respond quickly if a power event occurs.

The Takeaway

Operational excellence isn’t just about reacting after outages. It’s about building everyday habits, system design, and procedural discipline so that no single mistake can bring services down.

Whether it’s an act of curiosity in front of a locked rack (Part 1), or the convenience of bypassing a designated safe rack (Part 2), small deviations can ripple quickly in high‑stakes environments. And if you searched the Internet, a DNS entry error, or a network route change error, has caused major Internet service problems. Could they be also examples of bypassing some of the controls mentioned above. Food for thoughts.

What’s one small, daily safeguard in your operations that has quietly saved you from a bigger problem?

Data Center Operations Excellence (Part 2): One Misstep Triggered Big Consequences

Recent Posts

Comments