Data Center Operations Excellence Part 3: When Buddy System Failed
- datacenterprimerja
- Feb 27
- 4 min read
James Soh. First published on 19th of August, 2025
The Best Laid Plans
Two facility operations engineers accompanied one of two service engineers tasked with routine maintenance on a power panel that operates the UPS. The other service engineer worked unattended in another part of the UPS room. They followed protocol—supervision was in place, procedures documented, and the team experienced.
Then it happened. The tall service engineer’s sleeve caught the UPS bypass switch—a handlebar-type switch unprotected by a cover which would have been good practice. Instantly, client data hall equipment lost power. Recovery was fast, but the disruption’s impact was real.
"We did everything right," the manager reflected. "We had supervision, qualified personnel, and established procedures. But we still had an incident."
This scenario shows the evolution needed in operations thinking: supervision alone isn’t enough. True operational excellence moves beyond oversight to full risk management that accounts for human factors, environment, and failure modes.
Beyond Basic Supervision The incident reveals how compliance with procedures can still produce failure. Earlier focus was on knowledge and supervision. Now excellence requires systematic risk management addressing:
Human factors in design
Failure mode analysis
Multi-layered safeguards
Environmental resilience
The Five Dimensions Under Stress The UPS event impacted:
Availability: equipment lost power despite fast recovery
Security: disruptions triggered resets and monitoring gaps
Integrity: risk of data corruption and inconsistency
Confidentiality: restart vulnerabilities
Energy Efficiency: emergency mode bypassed normal efficiency
A single sleeve contact endangered all dimensions simultaneously, showing how failures propagate systemically.
Human Factors
The Good Work Space Problem This highlights anthropometric variability. The work space width and the switch was unprotected such that a tall engineer’s sleeve caught it—an overlooked double-layer design gap.
Best practices include:
Access and work space survey
Adequate clearances
Movement protocols
Recessed or guarded switches
Communication and briefing
The Routine Numbness Trap Experienced staff, repeatedly performing familiar tasks, risk cognitive lapses:
Attention fade
Assumptions of no danger
Complacency
Automatic, unmonitored actions
This degrades multiple defensive layers simultaneously.
Broken Buddy System and Supervision One engineer was supervised; the other was not, fracturing oversight:
Divided attention weakens supervision
Efficiency drives cut safety corners
Full, uninterrupted monitoring is crucial
Advanced Risk Management
Swiss Cheese Model The incident illustrates aligned holes in protective layers:
Unprotected switch
Split supervision
No ergonomic consideration
Constrained workspace
Mitigations require independent, layered defenses: physical barriers, procedural checks, technical interlocks, organizational culture.
Failure Mode & Environmental Resilience Analysis should find:
Single failure points
Common cause vulnerabilities
Predictive human error models
Rapid recovery plans
Improved environment: better signage, guarded controls, and workspace design.
Building Excellence
Five pillars:
Design for Operations
Predictive risk methods
Adaptive procedures
Continuous learning
Technology-enabled awareness
Practical Guidelines
When Electrical Consultant/Design Engineer specify the switches/breakers, prefers protection cover and dual confirmation switch wherever possible. Design work space sufficient for free movement.
Strict method statements requiring continuous buddy supervision
Anti-complacency briefings and communication
Full job closure protocols
Avoid divided supervision; prioritize resource allocation
Workspace and control improvements
Cultural Evolution Move from compliance to culture of excellence emphasizing:
Everyone should consider for safe operations, from design, procurement, and project delivery teams.
Proactive risk identification
Root cause analysis
Continuous improvement
Leadership alignment
Lessons Learned
Early mistake is very hard to overcome except extreme care and pre-survey, full buddy system without breaking into solo work.
Routine work can cause oversight
Familiarity breeds hidden risk
Training must target complacency
External partner integration is essential
Documentation supports oversight accountability
Integrating Dimensions Safety and performance require managing design, availability, security, integrity, confidentiality, and efficiency in tandem.
Technology as Force Multiplier Use monitoring, communication, decision support, and knowledge management to augment human capacity.
Another Story Incident Summary: The Hidden Risks of Fatigue and Routine
During a routine maintenance check on a 3+1 UPS system, an experienced technician (A) and his new trainee (B) faced a critical moment. It was early afternoon, but both were fatigued after starting work at 7 a.m. In the rush to complete their checklist, they missed a crucial step: setting the UPS under maintenance to de-link the synchronous tie with the other units.
While measuring voltage with a multimeter, technician A’s unsteady hand caused the voltmeter leads to accidentally touch two copper lugs simultaneously. This resulted in a short circuit that triggered the UPS to switch into bypass mode and shutdown, cascading panic signals through the synchronizing cabling to the other UPS units which also went into bypass and then shutdown. The sudden electrical surge tripped the upstream breakers, cutting power to the data hall’s supply A.
Thankfully, both technicians wore protective rubber boots, and the electrical current traveled through the voltmeter leads rather than through their bodies, preventing injury.
Summary Sorry about the long article, this evokes many dimensions from early prevention through design for safe operations, and using multiple layers of operations resource (buddy) and procedures to minimize the risk. Sometimes, operations overworked to overcome things that should have been thought of and easily implemented early on during design and specification stage of the data center MEP infrastructure. Operations still need to be diligent and work without falling into routine numbness underpinned by culture and technology.

Comments