What does it take to put a real AI agent into production on Diagrid Catalyst? This is the second webinar in our series on building practical agents on Diagrid Catalyst. This one is the SRE version: an agent your on-call team can actually run.

We start the session with a Durable SRE Investigator. It is a Chainlit chat UI in front of HolmesGPT that triages live incidents. Ask it "auth-service is in CrashLoopBackOff, what's going on?" and HolmesGPT reasons across Kubernetes, ArgoCD, Prometheus, Grafana, GitHub, and MongoDB to pick the right tools and assemble an answer. Because every tool call is recorded with its inputs and outputs, on-call can scroll back through any past investigation, inspect what each tool returned, and re-run any individual step against the current state of your systems. That is what turns the agent from a one-shot answer engine into something on-call can interrogate and trust during a real incident.

We will also share what we learned building it. Why we picked HolmesGPT and Chainlit so we were not writing the SRE brain or the chat UI from scratch. How feeding conversation history back into each query turned the agent from a question-answering box into something on-call can iterate with. And where small choices, like how skills match a question and which toolsets we expose, changed how usable the agent felt in practice.

Durability is what makes any of this credible. Most SRE agent demos do not survive a pod restart. Wrapping the agent in a Catalyst workflow persists every LLM call and every tool invocation as a step, so the investigation picks up where it left off if anything dies mid-run.

We close with an agent you can run yourself: an Incident Docs Agent that reads runbooks and postmortems from Azure Blob or S3 and surfaces the right precedent when a new incident comes in. The repo is yours to clone, deploy on Catalyst, and take further.

Who this is for

Platform and SRE teams running Diagrid Catalyst (or evaluating it) for agentic AI
Engineers building production AI agents who need durability past the happy path
AIOps and observability teams evaluating agentic AI for incident response
HolmesGPT users looking to put it into production beyond ad-hoc CLI use
Architects deciding between bespoke agent infrastructure and a managed runtime

Demo details

Live, end-to-end. Open the Chainlit UI, ask "auth-service is in CrashLoopBackOff, what's going on?", and watch HolmesGPT call kubectl, ArgoCD, Prometheus, and GitHub MCP in sequence, with each tool invocation streamed in as a durable step. The headline moment: /replay <instance_id> <seq> re-runs any individual tool call from a past investigation directly against current state: no LLM, no workflow restart, no token spend. Same primitives that make the workflow crash-safe make every step independently inspectable and re-runnable.

Durable SRE Investigations: Putting HolmesGPT into production

Who this is for

Demo details

Register Now

Related Webinars

Can your AI Platform do this? Agents on Trial: Proving What Your Workflows Actually Did

Build Reliable Agentic Apps with Aspire, MAF, and Catalyst

Who Let the Agents Out? Your client_id Is Not An Identity