When Netflix introduced Chaos Monkey over a decade ago, the premise was radically simple: randomly terminate instances in production to force engineers to build resilient systems. It was blunt, effective, and terrified everyone who wasn’t Netflix.
Over time, chaos engineering matured. We moved from random destruction to controlled experiments. Tools like Gremlin, Chaos Mesh, and LitmusChaos allowed SREs to precisely target blast radiuses—injecting latency into a specific microservice or dropping packets between two zones. But even with these tools, chaos engineering remained a high-friction activity. It required an SRE to hypothesize a failure mode, write the experiment code, schedule a “game day,” run it manually, and analyse the results.
This friction limited chaos engineering to the elite few. Most organizations still treat it as a special quarterly event, not a continuous practice.
Enter Agentic SRE.
In 2026, we are witnessing the shift from manual chaos engineering to Autonomous Chaos Engineering. Agents don’t just execute experiments; they design them. They analyse your architecture, hypothesize weaknesses, write the experiment code, execute it with a “hand on the brake,” and even draft the fix.
From Chaos Monkey to Chaos Mind
The evolution of chaos engineering can be mapped in four stages:
- Random Chaos (2010–2015): The “Monkey” era. Random termination. Blunt force trauma to infrastructure.
- Controlled Chaos (2016–2020): The “Experiment” era. Manual hypotheses. Game days. Precise fault injection tools (Gremlin, Chaos Mesh).
- Automated Chaos (2021–2024): The “Pipeline” era. Chaos as part of CI/CD. Running predefined experiments on every deploy.
- Autonomous Chaos (2025+): The “Agent” era. Agents generate hypotheses based on topology and incident history. Continuous background experimentation.
The key difference in the agentic era is intent. An agent doesn’t just break things; it investigates resilience. It builds a mental model of your system’s failure modes and actively tries to disprove your reliability claims.
The Agentic Chaos Loop
An autonomous chaos agent operates in a continuous loop, far faster and more consistently than any human team could manage.
1. Hypothesis Generation: “What if the payment gateway is slow?”
In the past, an SRE had to stare at an architecture diagram and ask, “What happens if Redis goes down?”
Today, agents do this. By ingesting your infrastructure-as-code (Terraform/Pulumi), service mesh topology (Istio/Linkerd), and incident history, agents can identify critical dependencies and propose experiments.
For example, AWS Fault Injection Service (FIS) integrated with Amazon Bedrock in May 2025 allows engineers to generate experiments using natural language. You can describe a scenario: “Simulate a partial brownout of the checkout-service dependency where 5% of requests add 200ms latency.” The agent translates this intent into a precise FIS experiment template [1][4].
More advanced agents, like the research-grade ChaosEater (Kikuta et al., Nov 2025), decompose this process. One sub-agent analyzes the system topology to find weak points (e.g., a single-point-of-failure database), while another sub-agent drafts the hypothesis: “If DB-primary fails, failover should happen in <30 seconds.” [3].
2. Experiment Design: Writing the Code
Once the hypothesis is set, the agent writes the experiment definition. This solves the “blank page problem” of chaos engineering.
Instead of an SRE fumbling with YAML syntax for Chaos Mesh or LitmusChaos, the agent generates the valid configuration. It defines:
- Target: Which pods/VMs to impact.
- Action: Network partition, CPU stress, pod kill, I/O delay.
- Steady State: The metric (e.g., “HTTP 200 rate > 99%”) that defines “normal.”
- Stop Condition: The specific alert that should kill the experiment immediately.
This automation lowers the barrier to entry significantly. An SRE can review and approve a generated experiment in minutes, rather than spending hours writing it from scratch.
3. Execution with “Hand on the Brake”
The scariest part of chaos engineering is the execution. What if I break production and can’t stop it?
Autonomous agents excel here because they watch metrics with millisecond precision. This is the “Hand on the Brake” pattern.
During an experiment, the agent monitors the Blast Radius. If the “Order Success Rate” drops below 99.5%, or if latency on a critical non-targeted service spikes, the agent triggers an immediate rollback.
Krkn-AI (Red Hat, Oct 2025) exemplifies this feedback-driven approach. It automates chaos testing for Kubernetes but continuously listens to the cluster’s health metrics. If the “chaos” starts to look like an “outage,” Krkn-AI halts the experiment faster than a human operator could hit a kill switch [2].
4. Learning & Remediation
The loop doesn’t end with the experiment. The agent analyzes the results.
- Did the system recover?
- Did the failover happen within the SLO?
- Did the alerts fire?
In 2025, we are seeing agents that don’t just report failure—they propose fixes. If a timeout caused a cascade, the agent might suggest a Pull Request to increase the retry backoff in the client library.
This closes the loop: Observe -> Hypothesize -> Break -> Analyze -> Fix.
Real-World Implementations (2025-2026)
The industry is moving quickly to adopt these patterns.
AWS Fault Injection Service (FIS) & Bedrock
Amazon’s integration of generative AI into FIS has been a game-changer. By allowing natural language definition of experiments, they’ve democratized access. Furthermore, new scenarios launched in late 2025 allow for simulating complex “grey failures”—partial disruptions across Availability Zones that are notoriously hard to script manually [1][5].
Microsoft Azure Chaos Studio
Azure’s Chaos Studio has expanded to include “agent-based faults” that can disrupt resources from within, not just at the control plane. This allows for deeper simulations, such as specific process crashes or memory leaks within a VM, orchestrated by the central chaos agent [6].
Krkn-AI
Red Hat’s Krkn-AI represents the open-source leading edge. It’s built for Kubernetes and focuses on “feedback-driven” chaos. It learns from previous experiments to tune future ones, effectively “fuzzing” your infrastructure to find the breaking points that a human wouldn’t think to test [2].
Safety: How to Trust the Chaos Agent?
Giving an AI agent permission to break production sounds insane. And it would be, without strict guardrails.
1. The Blast Radius Budget: Just as we have Error Budgets, we now have Chaos Budgets. An agent is allocated a specific “budget” of impact it can cause (e.g., “You can affect 0.1% of traffic”). If it exceeds this, it is locked out.
2. The Approval Gate: For novel experiments (new target, new fault type), the agent must request human approval. It presents the plan: “I want to inject 50ms latency into the cart service. Predicted impact: minimal. Rollback trigger: error rate > 1%.” The SRE clicks “Approve,” and the agent executes.
3. Distinct Control Plane: The chaos agent should run on a separate control plane. If the network goes down, the agent must fail open (stop the chaos), not fail closed (leave the chaos running forever).
The Future: Continuous Background Radiation
We are moving toward a world where chaos engineering isn’t an event; it’s background radiation.
Imagine a system where small, controlled faults are constantly being injected by an agent, 24/7.
- A pod is killed here.
- A network link is throttled there.
- A database failover is triggered during lunch.
This constant, low-level stress acts like an immune system for your infrastructure. It ensures that redundancy actually works. It ensures that timeouts are actually tuned. And because it’s managed by an agent that watches SLOs like a hawk, it remains safe.
In this future, “perfect stability” isn’t the goal. Resilience to constant change is. And the only way to achieve that scale of resilience is to have a machine constantly testing it.
References
- AWS Fault Injection Service (FIS) & Bedrock. (2025, May 13). “Chaos engineering made clear: Generate AWS FIS experiments using natural language through Amazon Bedrock”. Amazon Web Services Blog.
- Red Hat Developer. (2025, Oct 21). “Krkn-AI: A feedback-driven approach to chaos engineering”. Red Hat Developer.
- Kikuta, et al. (2025, Nov 11). “ChaosEater: Advanced agentic workflows for scalable, end-to-end fully automated CE”. Emergent Mind / Journal of Cloud Systems.
- AWS. (2025). “Resilience Testing Tools - AWS Fault Injection Service”. Amazon Web Services.
- Daily AWS. (2025, Nov 12). “AWS Fault Injection Service (FIS) launches new test scenarios for partial failures”.
- Microsoft. (2025). “Azure Chaos Studio - Chaos engineering experimentation”. Microsoft Azure Product Documentation.