Durable SRE Investigations: Putting HolmesGPT into production
Martez Killens
Solutions Engineer
Alice Gibbons
Customer Success Lead
What does it take to put a real AI agent into production on Diagrid Catalyst? This is the second webinar in our series on building practical agents on Diagrid Catalyst. This one is the SRE version: an agent your on-call team can actually run.
We start the session with a Durable SRE Investigator. It is a Chainlit chat UI in front of HolmesGPT that triages live incidents. Ask it "auth-service is in CrashLoopBackOff, what's going on?" and HolmesGPT reasons across Kubernetes, ArgoCD, Prometheus, Grafana, GitHub, and MongoDB to pick the right tools and assemble an answer. Because every tool call is recorded with its inputs and outputs, on-call can scroll back through any past investigation, inspect what each tool returned, and re-run any individual step against the current state of your systems. That is what turns the agent from a one-shot answer engine into something on-call can interrogate and trust during a real incident.
We will also share what we learned building it. Why we picked HolmesGPT and Chainlit so we were not writing the SRE brain or the chat UI from scratch. How feeding conversation history back into each query turned the agent from a question-answering box into something on-call can iterate with. And where small choices, like how skills match a question and which toolsets we expose, changed how usable the agent felt in practice.
Durability is what makes any of this credible. Most SRE agent demos do not survive a pod restart. Wrapping the agent in a Catalyst workflow persists every LLM call and every tool invocation as a step, so the investigation picks up where it left off if anything dies mid-run.
We close with an agent you can run yourself: an Incident Docs Agent that reads runbooks and postmortems from Azure Blob or S3 and surfaces the right precedent when a new incident comes in. The repo is yours to clone, deploy on Catalyst, and take further.
Who this is for
- Platform and SRE teams running Diagrid Catalyst (or evaluating it) for agentic AI
- Engineers building production AI agents who need durability past the happy path
- AIOps and observability teams evaluating agentic AI for incident response
- HolmesGPT users looking to put it into production beyond ad-hoc CLI use
- Architects deciding between bespoke agent infrastructure and a managed runtime
Demo details
Live, end-to-end. Open the Chainlit UI, ask "auth-service is in CrashLoopBackOff, what's going on?", and watch HolmesGPT call kubectl, ArgoCD, Prometheus, and GitHub MCP in sequence, with each tool invocation streamed in as a durable step. The headline moment: /replay <instance_id> <seq> re-runs any individual tool call from a past investigation directly against current state: no LLM, no workflow restart, no token spend. Same primitives that make the workflow crash-safe make every step independently inspectable and re-runnable.


