top of page

Data Center Operations Excellence Part 3: When Buddy System Failed

  • Writer: datacenterprimerja
    datacenterprimerja
  • Feb 27
  • 4 min read

James Soh. First published on 19th of August, 2025

The Best Laid Plans

Two facility operations engineers accompanied one of two service engineers tasked with routine maintenance on a power panel that operates the UPS. The other service engineer worked unattended in another part of the UPS room. They followed protocol—supervision was in place, procedures documented, and the team experienced.

Then it happened. The tall service engineer’s sleeve caught the UPS bypass switch—a handlebar-type switch unprotected by a cover which would have been good practice. Instantly, client data hall equipment lost power. Recovery was fast, but the disruption’s impact was real.


"We did everything right," the manager reflected. "We had supervision, qualified personnel, and established procedures. But we still had an incident."


This scenario shows the evolution needed in operations thinking: supervision alone isn’t enough. True operational excellence moves beyond oversight to full risk management that accounts for human factors, environment, and failure modes.


Beyond Basic Supervision The incident reveals how compliance with procedures can still produce failure. Earlier focus was on knowledge and supervision. Now excellence requires systematic risk management addressing:

  • Human factors in design

  • Failure mode analysis

  • Multi-layered safeguards

  • Environmental resilience


The Five Dimensions Under Stress The UPS event impacted:

  • Availability: equipment lost power despite fast recovery

  • Security: disruptions triggered resets and monitoring gaps

  • Integrity: risk of data corruption and inconsistency

  • Confidentiality: restart vulnerabilities

  • Energy Efficiency: emergency mode bypassed normal efficiency


A single sleeve contact endangered all dimensions simultaneously, showing how failures propagate systemically.


Human Factors

The Good Work Space Problem This highlights anthropometric variability. The work space width and the switch was unprotected such that a tall engineer’s sleeve caught it—an overlooked double-layer design gap.


Best practices include:

  • Access and work space survey

  • Adequate clearances

  • Movement protocols

  • Recessed or guarded switches

  • Communication and briefing


The Routine Numbness Trap Experienced staff, repeatedly performing familiar tasks, risk cognitive lapses:

  • Attention fade

  • Assumptions of no danger

  • Complacency

  • Automatic, unmonitored actions

This degrades multiple defensive layers simultaneously.


Broken Buddy System and Supervision One engineer was supervised; the other was not, fracturing oversight:

  • Divided attention weakens supervision

  • Efficiency drives cut safety corners

  • Full, uninterrupted monitoring is crucial


Advanced Risk Management

Swiss Cheese Model The incident illustrates aligned holes in protective layers:

  • Unprotected switch

  • Split supervision

  • No ergonomic consideration

  • Constrained workspace


Mitigations require independent, layered defenses: physical barriers, procedural checks, technical interlocks, organizational culture.


Failure Mode & Environmental Resilience Analysis should find:

  • Single failure points

  • Common cause vulnerabilities

  • Predictive human error models

  • Rapid recovery plans


Improved environment: better signage, guarded controls, and workspace design.

Building Excellence

Five pillars:


  1. Design for Operations

  2. Predictive risk methods

  3. Adaptive procedures

  4. Continuous learning

  5. Technology-enabled awareness


Practical Guidelines

  • When Electrical Consultant/Design Engineer specify the switches/breakers, prefers protection cover and dual confirmation switch wherever possible. Design work space sufficient for free movement.

  • Strict method statements requiring continuous buddy supervision

  • Anti-complacency briefings and communication

  • Full job closure protocols

  • Avoid divided supervision; prioritize resource allocation

  • Workspace and control improvements


Cultural Evolution Move from compliance to culture of excellence emphasizing:

  • Everyone should consider for safe operations, from design, procurement, and project delivery teams.

  • Proactive risk identification

  • Root cause analysis

  • Continuous improvement

  • Leadership alignment


Lessons Learned

  • Early mistake is very hard to overcome except extreme care and pre-survey, full buddy system without breaking into solo work.

  • Routine work can cause oversight

  • Familiarity breeds hidden risk

  • Training must target complacency

  • External partner integration is essential

  • Documentation supports oversight accountability


Integrating Dimensions Safety and performance require managing design, availability, security, integrity, confidentiality, and efficiency in tandem.


Technology as Force Multiplier Use monitoring, communication, decision support, and knowledge management to augment human capacity.


Another Story Incident Summary: The Hidden Risks of Fatigue and Routine

During a routine maintenance check on a 3+1 UPS system, an experienced technician (A) and his new trainee (B) faced a critical moment. It was early afternoon, but both were fatigued after starting work at 7 a.m. In the rush to complete their checklist, they missed a crucial step: setting the UPS under maintenance to de-link the synchronous tie with the other units.


While measuring voltage with a multimeter, technician A’s unsteady hand caused the voltmeter leads to accidentally touch two copper lugs simultaneously. This resulted in a short circuit that triggered the UPS to switch into bypass mode and shutdown, cascading panic signals through the synchronizing cabling to the other UPS units which also went into bypass and then shutdown. The sudden electrical surge tripped the upstream breakers, cutting power to the data hall’s supply A.


Thankfully, both technicians wore protective rubber boots, and the electrical current traveled through the voltmeter leads rather than through their bodies, preventing injury. 


Summary Sorry about the long article, this evokes many dimensions from early prevention through design for safe operations, and using multiple layers of operations resource (buddy) and procedures to minimize the risk. Sometimes, operations overworked to overcome things that should have been thought of and easily implemented early on during design and specification stage of the data center MEP infrastructure. Operations still need to be diligent and work without falling into routine numbness underpinned by culture and technology.

Recent Posts

See All

Comments


bottom of page