Learn

This section explains how RunWhen works, what problems it solves, and links to the detailed documentation and hands-on tutorials that will help you get productive.

What is RunWhen?

RunWhen is an AI SRE platform that automates troubleshooting and remediation for Kubernetes and cloud environments. It connects AI-powered AI Assistants to your infrastructure through a library of automated diagnostic and remediation tasks — so your team can investigate incidents in natural language instead of memorizing kubectl commands and runbooks.

What it replaces:

Manual runbooks and tribal knowledge
Waiting for the one person who knows the system
Copy-pasting kubectl commands from Slack threads
Starting incident triage from scratch every time

What it provides:

Reduce MTTR — AI-powered root cause analysis backed by automated diagnostic tasks
Scale expertise — Encode troubleshooting knowledge into reusable tasks available to all teams
Automate triage — Run diagnostics immediately when alerts fire, before humans get involved
Developer self-service — Give developers the ability to investigate issues in their own environments without waiting for SRE escalation

How RunWhen Works

RunWhen is different from traditional observability tools. Instead of just collecting metrics and logs, RunWhen runs automated tasks that actively investigate your environment and produce structured insights designed for AI consumption.

Background Mode: Building Production Insights

RunWhen continuously runs tasks in the background to understand your environment:

Health Monitoring — Check pod status, deployment health, service availability
Discovery — Identify resources, map dependencies, catalog configurations
Baseline Tasks — Establish normal behavior, detect anomalies, track trends

The results are structured production insights — not raw metrics. This means less token consumption and more accurate AI responses, because the data is already structured for LLM consumption.

Interactive Mode: Workspace Chat

When you ask a question or an alert fires, the AI Assistant combines background insights with new targeted diagnostics:

You ask a question (or an alert fires)
Assistant checks existing production insights from background tasks
Assistant runs new diagnostic tasks targeted at your question
AI analyzes everything — background insights + new diagnostics
You get an actionable answer — root cause, remediation steps, or what to investigate next

The AI Assistant may surface existing Issues that have already been identified, or it may suggest and run new Tasks to gather more information. Both paths are part of the normal workflow.

Key Concepts

Each concept below has its own page in this section.

Foundations — the durable structure of a workspace:

Concept	What it is
Workspaces	The tenancy boundary — users, SLXs, AI Assistants, secrets, and configuration live inside one.
SLXs	Service Level Expectations — the core unit of operational knowledge. Defines what to monitor, how to measure it, and what to do when something goes wrong.
Tasks & CodeBundles	The automated scripts (Python, Bash, SQL, REST) that run in your environment to collect insights or take action.
Secrets	Credentials managed at the workspace level, stored in Vault, never readable via the API.

The reasoning agent:

Concept	What it is
AI Assistants	The AI agents users interact with. Each has its own confidence thresholds, permissions, and assignments.

Workspace Studio — the four surfaces your team authors to make assistants behave well in your environment:

Concept	What it is
Rules	Standing guidance that shapes how an assistant interprets findings — de-noise, re-prioritize, reframe.
Commands	Named, reusable investigation procedures. Invoked on-demand via slash command, or fired automatically on a cron schedule with email/Slack delivery.
Knowledge	Short, retrievable notes that supply narrative facts telemetry cannot — ownership, architecture, history.
Workflows	Event-driven automations — when an external event fires (alert, webhook, SLO violation), start an investigation and route the result.

Operational state — what the platform produces and tracks:

Concept	What it is
Issues & RunSessions	Issues are findings raised by tasks; RunSessions are recorded investigations. They reinforce each other: investigations cite issues, issues auto-resolve when the underlying condition clears.

For a complete glossary, see Terms and Concepts.

Why teams adopt RunWhen

Three patterns repeat across teams that get value quickly. They are framed as ideas — for the journeys themselves, see Common User Journeys.

Self-service over escalation. When developers can interrogate their own services in natural language, fewer questions land in the platform team’s queue.
Fewer cold starts on incidents. Automated diagnostics run before a human is paged, so on-call engineers begin with structured findings instead of an empty terminal.
Knowledge that outlives staff turnover. Tasks encode automation; Rules encode interpretation; Knowledge notes encode the narrative facts that telemetry cannot capture.

Where to next

If you want to…	Go to
See the product in action	Live demos — Sandbox scenarios with ready prompts
Learn how to use the UI	Use — Workspace Chat, Studio, journeys
Set up your own workspace	Configure — agents, integrations, Slack
Look up a specific term	Terms and Concepts