Infrastructure is Only as Reliable as the People Who Operate It

datacenterprimerja
Feb 27
6 min read

James Soh. First published on 17th of October, 2025.

A data center can have redundant power systems, N+1 cooling, and state-of-the-art monitoring—yet still fail spectacularly because an operator skipped a log entry, missed an escalation, or handed off a shift without proper documentation.

It was 2:47 AM when Chen, four months into his first operations role, noticed water beneath the floor grates during his rounds in Data Hall 3A. Just a few drops, but directly between customer racks and three power distribution units.

His first instinct? Keep moving. "Probably just condensation. Someone else will catch it." The ticket queue was full, and he didn't want to be "that guy" who over-escalated. He almost walked past it.

Back at the Operations Room, his senior partner asked the routine question: "Any anomalies in your round?" Chen hesitated. Then admitted: "There were a few drops of water near the PDUs in Hall 3A. Probably nothing, but..."

"Show me. Now." Within minutes, they'd cordoned off the area and traced it to a loosening coolant line fitting. What seemed like "just a few drops" would have become a major leak within hours, potentially taking offline client IT and network equipment.

That's operational excellence: seeing something wrong, reporting it even when uncertain, and having a culture where "probably nothing" is never good enough. The story is adapted from a real event that I had witnessed.

I've spent years working with data center operations teams, and I've learned this fundamental truth:

Technology alone doesn't create reliability. People do.

The question isn't whether your infrastructure is robust. The question is whether your people have the competence, discipline, and culture to operate it flawlessly, day after day, shift after shift.

The Hidden Cost of Operational Gaps

Consider these scenarios that play out in data centers worldwide:

A minor coolant leak goes unreported during rounds because "someone else will catch it." Hours later, it cascades into a thermal event affecting multiple racks.
An operator receives a Smart Hands request but doesn't verify the exact rack location. They work on the wrong equipment, triggering a client outage and contractual penalties.
During shift handover, an ongoing alarm is mentioned verbally but not documented. The incoming team forgets to follow up, and a small issue becomes a critical failure.

Each incident shares a common thread: not a technology failure, but a human systems failure. Gaps in training, discipline, communication, or culture.

A Four-Stage Development Journey

Transforming newcomers into operations leaders who prevent these failures requires a structured approach. I've developed a framework built on four progressive stages:

Foundation (Understanding Your Environment)

New operators must master the physical landscape (Data Halls, Meet-Me Rooms, Operations Centers) and understand why each zone exists. More importantly, they need to grasp the business impact of their actions.

Every safe entry, thorough check, and clear log entry doesn't just protect equipment. It protects client trust, company reputation, and career advancement. When operators understand this connection, they approach work differently.

Key outcomes: Navigate safely, understand escalation paths, recognize your role in the bigger picture.

Development (Building Capabilities)

This is where discipline becomes habit. Operators learn:

Shift handovers that leave nothing to chance: overlapping shifts, joint physical reviews, standardized documentation
Communication protocols using structured methods like SBAR (Situation, Background, Assessment, Recommendation)
Documentation rigor where every action is logged immediately, objectively, and completely

I cannot overstate the importance of shift handovers. A 15-minute overlap with proper documentation prevents more outages than any redundant system you can install.

Key outcomes: Execute flawless handovers, document accurately, communicate clearly.

Practice (Daily Excellence)

Competence comes from repetition under guidance. Operators develop fluency in:

Environmental rounds that catch anomalies before they cascade
Scenario handling for water leaks, UPS alarms, and emergency responses
Smart Hands execution where precision and documentation build client trust

The best operators don't just follow procedures. They internalize the reasoning behind them. They practice "what if" scenarios during routine rounds, mentally rehearsing responses to potential incidents.

Key outcomes: Perform rounds confidently, respond to incidents correctly, serve clients professionally.

Integration (Becoming the Backbone)

Senior operators and shift leads become force multipliers through:

Mentorship that transfers not just skills but judgment and culture
Incident leadership that coordinates complex responses
Continuous improvement that strengthens the entire team

This is where individuals become the foundation of organizational resilience.

Key outcomes: Lead incidents, mentor others, drive improvement, advance careers.

The Three Pillars of Operational Excellence

Throughout this developmental journey, three interrelated capabilities must be cultivated simultaneously:

1. Technical Competence

Systematic skill development from facility basics through complex Smart Hands execution and incident response. This includes understanding power systems, cooling infrastructure, network topology, fire suppression, and monitoring systems.

But technical knowledge alone is insufficient. An operator who knows everything about UPS systems but doesn't follow proper escalation procedures is still a liability.

2. Operational Discipline

Rigorous adherence to procedures, documentation standards, safety protocols, and quality execution. This means:

Never delaying log entries until end of shift
Always using zone-appropriate PPE
Following escalation chains without skipping steps
Documenting observable facts, not vague assessments like "looked fine"

Discipline is what separates world-class operations from mediocre ones. It's the invisible force that prevents service failures and builds team trust.

3. Human Excellence

Effective communication, mentorship, teamwork, and continuous improvement. This is the cultural dimension that transforms groups of individuals into resilient, high-performing teams.

Human excellence means:

Encouraging questions without penalty
Using incidents as learning opportunities, not blame sessions
Sharing knowledge across shifts and roles
Empowering operators to halt unsafe work
Recognizing that today's newcomer is tomorrow's mentor

Why These Pillars Are Inseparable

Here's the critical insight: these three capabilities must develop together.

Technical skills without discipline lead to inconsistent, unreliable results. You might execute brilliantly one day and create an outage the next.

Discipline without human excellence creates rigid, unresponsive operations that can't adapt to novel situations or learn from mistakes.

Excellence in collaboration without technical competence simply can't sustain uptime when real incidents occur.

It's the integration of all three that creates operational resilience.

The Business Impact Connection

Operations leaders sometimes struggle to get management buy-in for comprehensive training programs. Here's how to frame the business case:

Client Trust & Retention: Vigilant, disciplined operators keep client operations running smoothly. Happy clients renew contracts and provide referrals.

Equipment Longevity & Cost Predictability: Consistent application of best practices extends equipment life by years and reduces unplanned outages, enabling confident financial forecasting.

Company Reputation & Market Position: Your operational track record differentiates you in competitive markets. Word travels fast when you're reliable, and when you're not.

Talent Development & Retention: Operators who receive structured development and mentorship are more engaged, perform better, and stay longer, reducing costly turnover.

Every dollar invested in operational excellence returns multiples through these channels.

Career Pathways Beyond Operational Mastery

For operators wondering "what's next?", the pathways are diverse:

Shift Lead roles requiring advanced technical competency and incident leadership
Senior positions and mentorship where you guide the next generation
Technical specialization in power systems, cooling, automation, or compliance
Supervisory and management trajectories with strategic responsibilities
Cross-functional projects working with engineering, IT, and business teams

Success in any of these paths requires the same foundation: combining deep technical knowledge with professional soft skills and cultural stewardship.

The Mentorship Multiplier

If I could emphasize one element that accelerates this entire framework, it's mentorship.

Great mentors don't just teach procedures. They model discipline, demonstrate judgment under pressure, and transfer culture. They ask questions that develop critical thinking:

"What's your next step?"

"Why did you choose that approach?"

"What would you do differently?"

When mentorship is embedded into daily operations (not just formal training sessions), the entire team's capability multiplies. Newcomers progress faster, errors decrease, and a culture of continuous learning takes root.

The strongest data center operations teams I've encountered all share this trait: a deep commitment to developing others.

Putting It Into Practice

If you're leading a data center operations team, here are concrete steps to implement this framework:

Audit your current state: Where are the gaps? Is it technical training? Documentation discipline? Mentorship culture?
Establish baseline standards: Define what "good" looks like for each developmental stage and capability pillar.
Create structured onboarding: Don't leave newcomer development to chance. Build a progression from shadowing through supervised practice to independence.
Formalize handover protocols: Implement overlapping shifts, standardized documentation, and joint physical reviews.
Embed mentorship: Make it a core expectation, not an extra duty. Rotate responsibilities so all senior staff develop teaching skills.
Use incidents as learning: Every event—successful or not—is a teaching opportunity. Conduct blame-free post-mortems focused on system improvement.
Measure and recognize: Track progress through the developmental stages. Celebrate operators who demonstrate excellence across all three pillars.

The Bottom Line

Infrastructure is only as reliable as the people who operate it. All the technology and systems are enablers, the people needs diligence and are sometimes the last line of defense against major failures.

Every action (safe entry, thorough check, clear log, timely escalation) protects equipment, builds client trust, and advances careers.

The journey from newcomer to integral team member is only the beginning. Real and meaningful impact comes from achieving mastery yourself, then helping others do the same to secure the future of your data center.

The three pillars (Technical Competence, Operational Discipline, and Human Excellence) aren't just a training framework. They're a philosophy for building operations teams that don't just react to incidents but prevent them, don't just meet SLAs but exceed them, and don't just operate infrastructure but protect the business that depends on it.

What's been your experience with operational excellence in data centers? What challenges have you faced in developing operations teams? I'd love to hear your perspectives in the comments.

This article is adapted from Chapter 9 of my upcoming book on Data Center Operations Management. For more insights on building resilient operations teams, follow me here on LinkedIn.