The Four Ages of Reliability Engineering

In 2003, a Google engineer named Ben Treynor Sloss was handed a team of seven software engineers and told to keep Google’s production systems running. His approach — treating operations as a software engineering problem — would eventually reshape an entire industry. But in the two decades that followed, the world changed beneath our feet: monoliths shattered into microservices, on-prem servers migrated to ephemeral cloud infrastructure, and the sheer complexity of modern distributed systems outpaced any human team’s ability to reason about them in real time. Now, we are entering a new era where AI agents don’t just assist operations; they drive them.

This is the story of how we got here, told through four ages.

Age I: The Era of Heroic Operations (Pre-2003)

Before SRE had a name, there were sysadmins. They carried pagers, wrote shell scripts, maintained wikis full of runbooks, and were the human control plane of every production system.

This era had defining characteristics:

Tribal knowledge was the primary operational currency — and liability. When the senior sysadmin left, so did half the team’s ability to recover from outages.
Runbooks lived in wikis (or worse, in people’s heads). Each incident response was a manual, human-driven process that depended entirely on who was on call.
Operations and development were separate organisations. Developers threw code over the wall; ops teams caught it and kept it running. The incentives were misaligned — developers wanted to ship fast, operators wanted stability.
Scaling was linear. More systems meant more operators. There was no abstraction layer between human effort and system complexity.

The model worked — until it didn’t. As systems grew from dozens of servers to thousands, the heroic-operator model hit a hard ceiling. Google was among the first to feel this constraint acutely: by the early 2000s, their infrastructure was growing faster than any operations team could scale.

The fundamental problem was not a shortage of talented operators. It was that operations was treated as a craft, not an engineering discipline. Crafts don’t scale. Engineering does.

Age II: Software Engineering Drives Operations (2003–2017)

Ben Treynor Sloss’s insight was deceptively simple: “SRE is what happens when you ask a software engineer to design an operations team.” That single sentence encoded a paradigm shift.

The Core Principles

Google’s SRE framework, formally published in the Site Reliability Engineering book in 2016, introduced concepts that are now foundational:

Service Level Objectives (SLOs) as contracts. Instead of vague commitments to “keep things up,” SRE demanded quantifiable reliability targets. An SLO of 99.95% availability isn’t a goal — it’s a contract that defines how much unreliability is acceptable. This was revolutionary because it made reliability negotiable rather than absolute.

Error budgets as a governance mechanism. If your service has a 99.95% SLO, you have a 0.05% error budget per period. Spend it on risky deployments; save it by being conservative. This elegantly resolved the dev/ops tension: both teams shared the same budget.

Toil elimination as a first-class objective. SRE defined “toil” as manual, repetitive, automatable work that scales linearly with system size. The mandate was clear: if a human does it repeatedly, automate it. Google set a target that SRE teams should spend no more than 50% of their time on toil.

Software engineering as the solution to operational problems. SREs didn’t just respond to incidents — they wrote software to prevent them. Monitoring systems, deployment pipelines, capacity planning tools, and automated remediation scripts were all engineering artifacts, maintained with the same rigour as production code.

The Adoption Wave

By 2016, Google had over 1,000 SREs. The publication of the SRE book catalysed industry-wide adoption. Netflix, LinkedIn, Twitter, Dropbox, and hundreds of other companies established SRE practices. The DORA (DevOps Research and Assessment) programme, led by Dr. Nicole Forsgren, provided empirical evidence that these practices correlated with organisational performance.

Key tooling emerged: Prometheus for monitoring (inspired by Google’s Borgmon), Kubernetes for orchestration (evolved from Borg), and a growing ecosystem of open-source reliability tools under the CNCF umbrella.

The Limits

Yet even Google’s model had inherent constraints:

Runbooks were still authored and maintained by humans. Automation reduced toil but didn’t eliminate the need for human judgement in novel situations.
Incident response remained human-driven. Monitoring could detect problems and page engineers, but diagnosis, decision-making, and remediation required people.
The 50% toil target was aspirational. Many SRE teams reported spending 60–80% of their time on toil, particularly during periods of rapid growth.
SRE didn’t scale to small teams. The full SRE model assumed dedicated reliability engineers — a luxury that startups and smaller companies couldn’t afford. This gap gave rise to Platform Engineering — an approach that packaged SRE best practices into self-service “Golden Paths,” making reliability accessible without requiring every team to hire dedicated SREs. Platform Engineering became the bridge between the SRE ideal and the reality of constrained engineering organisations.

The discipline had transformed operations from a craft into an engineering practice. But the next question was inevitable: could machines do the engineering?

Age III: AI Assists the Engineer (2017–2024)

The third age arrived not with a manifesto but with a gradual accumulation of machine learning capabilities applied to operational data.

The AIOps Promise

Gartner coined the term “AIOps” (Artificial Intelligence for IT Operations) in 2016, and by 2018, a wave of startups and platform features began delivering on parts of the vision:

Anomaly detection. Systems like Moogsoft, BigPanda, and Datadog’s Watchdog applied statistical and ML models to time-series metrics, identifying deviations that static thresholds would miss. Instead of alerting on “CPU > 90%,” these systems could learn normal patterns and alert on unexpected behaviour — a subtle but significant improvement.

Alert correlation and noise reduction. A single infrastructure failure might generate hundreds of alerts across monitoring systems. AIOps platforms could group related alerts into incidents, reducing the cognitive load on on-call engineers. Moogsoft (later acquired by Dell for APEX AIOps) pioneered this with their correlation engine, clustering alerts by topology, time, and textual similarity.

Root cause suggestion. Platforms began offering probable root cause analysis by correlating deployment events, configuration changes, and infrastructure telemetry with incident timing. This didn’t replace human investigation but gave engineers a starting point.

Predictive capacity planning. ML models trained on historical usage patterns could forecast resource needs, enabling proactive scaling before demand hit.

What Worked

AI-assisted SRE delivered measurable improvements:

MTTD (Mean Time to Detect) decreased significantly. Anomaly detection caught issues minutes before threshold-based alerts would have fired.
Alert noise dropped by 90%+ in well-configured deployments. Engineers could focus on real problems instead of drowning in false positives.
Postmortem analysis improved. Automated timeline construction and change correlation made root cause analysis faster and more thorough.

What Didn’t

The fundamental limitation of Age III was the recommendation gap: AI could identify problems and suggest actions, but a human still had to decide and execute.

No autonomous action. AIOps platforms surfaced insights but stopped short of remediation. The human was still in the loop for every operational decision.
Pattern matching, not reasoning. ML models excelled at recognising patterns they’d seen before but struggled with novel failure modes. They couldn’t reason about why a system was failing in a new way.
Siloed intelligence. Each tool (monitoring, logging, tracing, deployment) had its own AI features, but they didn’t compose into a coherent operational intelligence. The human brain remained the only integration layer.
Static playbooks with ML triggers. The most advanced automation was still “if ML model says X, execute predefined script Y.” The scripts themselves were authored and maintained by humans — and this was the hidden cost of Age III. You could automate the trigger, but someone still had to write, test, version, and update the remediation script. As infrastructure evolved, playbooks rotted silently. Age IV promises to not just trigger the action, but to generate the action itself.

By 2024, the industry had reached an inflection point. The tools were generating excellent insights, but the humans were still the bottleneck. Not because they lacked skill — because human decision-making doesn’t scale at the speed of modern distributed systems.

Age IV: AI Drives Reliability (2025–Present)

The fourth age is defined by a fundamental inversion: the AI is no longer assisting the human; the human is supervising the AI.

What Changed

Three converging forces made Agentic SRE possible in 2025:

Large Language Models gained reasoning capability. Models like GPT-4, Claude, and Gemini demonstrated the ability to reason over complex, multi-step problems — not just pattern-match. For SRE, this meant an AI could read a set of logs, correlate them with deployment history, form a hypothesis about root cause, and propose a specific remediation. Chain-of-thought reasoning transformed AI from a classifier into a diagnostician.

Tool-use and agent frameworks matured. LangChain, CrewAI, AutoGen, and cloud-native agent frameworks gave LLMs the ability to execute actions — query APIs, run commands, read dashboards, modify configurations. The recommendation gap closed because agents could act on their own conclusions.

Hardware caught up. Inference costs dropped dramatically. Specialised AI chips made it feasible to run reasoning-heavy agents continuously, not just on-demand. Running an agent that monitors, reasons, and acts 24/7 became economically viable for enterprises.

The Agentic SRE Model

In the agentic model, the operational loop transforms:

Stage	Age III (AI-Assisted)	Age IV (Agentic)
Detect	ML anomaly detection alerts human	Agent detects and immediately begins investigation
Triage	Human reads alert, assesses severity	Agent classifies severity from telemetry context
Diagnose	Human investigates with AI suggestions	Agent reasons over logs, metrics, traces, and change history
Decide	Human selects remediation	Agent selects action based on policy and confidence
Act	Human executes (or triggers script)	Agent executes remediation within guardrails
Verify	Human checks if fix worked	Agent validates against SLOs, rolls back if not
Learn	Human writes postmortem	Agent updates knowledge base and refines future responses

What do “guardrails” actually look like in practice? They are explicit, scoped permissions — not vague safety promises. For example: an agent may restart a failing pod but cannot drop a database table. An agent may scale a service up to a predefined cost ceiling ($50/hour in additional compute) but cannot modify IAM roles. An agent may roll back a deployment to the previous known-good version but cannot push a new release. These boundaries are codified as policies, enforced by the platform, and auditable — turning trust from a handshake into an engineering contract.

The human role shifts from operator to supervisor:

Defining policies and guardrails (what agents are allowed to do)
Setting SLOs and error budgets (what “good” looks like)
Reviewing agent decisions (audit and oversight)
Handling escalations (novel situations beyond agent confidence)
Designing systems (architecture that agents can operate)

Early Movers

The industry moved faster than most predicted:

Microsoft’s Azure SRE Agent (announced at Ignite 2025) is a customer-facing product integrated into the Azure portal, performing automated incident investigation and remediation for Azure customers’ workloads. Microsoft built it on the back of years of internal SRE automation for Azure’s own infrastructure, now productised for general use.

Amazon Q Developer expanded from a code assistant into an operational agent, available to AWS customers, capable of investigating infrastructure issues, analysing CloudWatch data, and suggesting (and in some cases executing) remediation steps directly within the AWS console.

PagerDuty’s Agentic AI moved beyond alert routing to include autonomous triage, context gathering, and suggested remediation workflows — aiming to reduce the human steps between page and resolution.

NeuBird AI, named a 2025 Gartner Cool Vendor, deployed AI SRE agents for incident resolution across healthcare, banking, and retail — sectors where reliability directly impacts human wellbeing and financial outcomes.

Shoreline.io pioneered the model of pre-authored remediation actions (“Op Packs”) orchestrated by an intelligent agent that could match incidents to appropriate responses and execute them autonomously.

The Critical Question

Every previous age of reliability engineering was defined by what it automated. Age I automated nothing. Age II automated deployment and monitoring. Age III automated detection and suggestion. Age IV automates decision-making and action.

This raises the defining question of our era: when an autonomous agent makes a decision that breaks production, who is responsible?

This isn’t a hypothetical. It’s the question that every organisation adopting Agentic SRE must answer before they give an agent access to production. The answer will shape not just technology choices but organisational structures, compliance frameworks, and the very nature of the SRE role.

What This Series Will Cover

This is the first post in a 12-part series: “Agentic SRE: When AI Takes the Pager.”

Over the coming days, we’ll explore:

The full landscape of SRE domains and their readiness for agentic automation
The current state of the art, challenges, and emerging trends
Local vs. remote agent deployment topologies
Deep dives into autonomous incident response, change management, chaos engineering, disaster recovery, and security
Architecture patterns for building production-grade, long-running SRE agents
The human factor — how SRE teams must evolve
And where this is all heading

Every post will be grounded in current research and documented industry practice. We’ll cite sources, show real systems, and be honest about what works, what doesn’t, and what’s still hype.

The pager is ringing. The question is: who answers?

References

Beyer, B., Jones, C., Petoff, J., & Murphy, N.R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. https://sre.google/sre-book/table-of-contents/
Beyer, B., Murphy, N.R., Rensin, D.K., Kawahara, K., & Thorne, S. (2018). The Site Reliability Workbook. O’Reilly Media. https://sre.google/workbook/table-of-contents/
Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press.
DORA Team. (2025). State of DevOps Report 2025. Google Cloud. https://dora.dev
Gartner. (2025). Hype Cycle for AIOps and Observability. Gartner Research.
Unite.AI. (2026). “Agentic SRE: How Self-Healing Infrastructure Is Redefining Enterprise AIOps in 2026.” https://www.unite.ai/agentic-sre-how-self-healing-infrastructure-is-redefining-enterprise-aiops-in-2026/
DevOps.com. (2025). “Agentic AI in Observability Platforms: Empowering Autonomous SRE.” https://devops.com/agentic-ai-in-observability-platforms-empowering-autonomous-sre/
NeuBird AI. (2026). “NeuBird AI Experiences Rapid Adoption of its AI SRE Agent for Incident Resolution.” BusinessWire. https://www.businesswire.com/news/home/20260204450140/en/
BigPanda. (2025). “Agentic ITOps: The Evolution of AIOps.” https://www.bigpanda.io/blog/agentic_itops_aiops_evolution/
Sloss, B.T. (2016). “Introduction” in Site Reliability Engineering. Google. https://sre.google/sre-book/introduction/

Age I: The Era of Heroic Operations (Pre-2003)#

Age II: Software Engineering Drives Operations (2003–2017)#

The Core Principles#

The Adoption Wave#

The Limits#

Age III: AI Assists the Engineer (2017–2024)#

The AIOps Promise#

What Worked#

What Didn’t#

Age IV: AI Drives Reliability (2025–Present)#

What Changed#

The Agentic SRE Model#

Early Movers#

The Critical Question#

What This Series Will Cover#

References#