Skip to content

Learn

This section explains how RunWhen works, what problems it solves, and links to the detailed documentation and hands-on tutorials that will help you get productive.

What is RunWhen?

RunWhen is an AI SRE platform that automates troubleshooting and remediation for Kubernetes and cloud environments. It connects AI-powered AI Assistants to your infrastructure through a library of automated diagnostic and remediation tasks — so your team can investigate incidents in natural language instead of memorizing kubectl commands and runbooks.

What it replaces:

  • Manual runbooks and tribal knowledge
  • Waiting for the one person who knows the system
  • Copy-pasting kubectl commands from Slack threads
  • Starting incident triage from scratch every time

What it provides:

  • Reduce MTTR — AI-powered root cause analysis backed by automated diagnostic tasks
  • Scale expertise — Encode troubleshooting knowledge into reusable tasks available to all teams
  • Automate triage — Run diagnostics immediately when alerts fire, before humans get involved
  • Developer self-service — Give developers the ability to investigate issues in their own environments without waiting for SRE escalation

How RunWhen Works

RunWhen is different from traditional observability tools. Instead of just collecting metrics and logs, RunWhen runs automated tasks that actively investigate your environment and produce structured insights designed for AI consumption.

Background Mode: Building Production Insights

RunWhen continuously runs tasks in the background to understand your environment:

  • Health Monitoring — Check pod status, deployment health, service availability
  • Discovery — Identify resources, map dependencies, catalog configurations
  • Baseline Tasks — Establish normal behavior, detect anomalies, track trends

The results are structured production insights — not raw metrics. This means less token consumption and more accurate AI responses, because the data is already structured for LLM consumption.

Interactive Mode: Workspace Chat

When you ask a question or an alert fires, the AI Assistant combines background insights with new targeted diagnostics:

  1. You ask a question (or an alert fires)
  2. Assistant checks existing production insights from background tasks
  3. Assistant runs new diagnostic tasks targeted at your question
  4. AI analyzes everything — background insights + new diagnostics
  5. You get an actionable answer — root cause, remediation steps, or what to investigate next

The AI Assistant may surface existing Issues that have already been identified, or it may suggest and run new Tasks to gather more information. Both paths are part of the normal workflow.

Key Concepts

Each concept below has its own page in this section.

Foundations — the durable structure of a workspace:

ConceptWhat it is
WorkspacesThe tenancy boundary — users, SLXs, AI Assistants, secrets, and configuration live inside one.
SLXsService Level Expectations — the core unit of operational knowledge. Defines what to monitor, how to measure it, and what to do when something goes wrong.
Tasks & CodeBundlesThe automated scripts (Python, Bash, SQL, REST) that run in your environment to collect insights or take action.
SecretsCredentials managed at the workspace level, stored in Vault, never readable via the API.

The reasoning agent:

ConceptWhat it is
AI AssistantsThe AI agents users interact with. Each has its own confidence thresholds, permissions, and assignments.

Workspace Studio — the four surfaces your team authors to make assistants behave well in your environment:

ConceptWhat it is
RulesStanding guidance that shapes how an assistant interprets findings — de-noise, re-prioritize, reframe.
CommandsNamed, reusable investigation procedures. Invoked on-demand via slash command, or fired automatically on a cron schedule with email/Slack delivery.
KnowledgeShort, retrievable notes that supply narrative facts telemetry cannot — ownership, architecture, history.
WorkflowsEvent-driven automations — when an external event fires (alert, webhook, SLO violation), start an investigation and route the result.

Operational state — what the platform produces and tracks:

ConceptWhat it is
Issues & RunSessionsIssues are findings raised by tasks; RunSessions are recorded investigations. They reinforce each other: investigations cite issues, issues auto-resolve when the underlying condition clears.

For a complete glossary, see Terms and Concepts.

Why teams adopt RunWhen

Three patterns repeat across teams that get value quickly. They are framed as ideas — for the journeys themselves, see Common User Journeys.

  • Self-service over escalation. When developers can interrogate their own services in natural language, fewer questions land in the platform team’s queue.
  • Fewer cold starts on incidents. Automated diagnostics run before a human is paged, so on-call engineers begin with structured findings instead of an empty terminal.
  • Knowledge that outlives staff turnover. Tasks encode automation; Rules encode interpretation; Knowledge notes encode the narrative facts that telemetry cannot capture.

Where to next

If you want to…Go to
See the product in actionLive demos — Sandbox scenarios with ready prompts
Learn how to use the UIUse — Workspace Chat, Studio, journeys
Set up your own workspaceConfigure — agents, integrations, Slack
Look up a specific termTerms and Concepts