Architecture & Overview

Conceptual Overview

RunWhen is a system for troubleshooting investigations and remediations that uses Agentic AI backed by thousands of automations (“Tasks”). Tasks are pre-configured and installed in your environment to collect data and (where appropriate) make changes.

For example, the problem may be “the cartservice is down” or “postgres instance 3as41 is slow.” The system responds by running Tasks like “check cartservice logs for errors” or “run postgres diagnostics for dev-db-5s2” and summarizing their results, or checking to see if Tasks similar to these have been run recently.

About 95% of most teams Tasks come from the public library, while 5% are private tasks that are wrappers around existing bash, python, SQL, REST, etc.

Software Components: Conceptual Overview

There are three major components of the software:

A Backend (RunWhen SaaS or self-hosted*)
A Local Agent (typically one for each major security zone, e.g. dev/test/prod or prod-1-a, prod-1-b, etc.)
An LLM endpoint (RunWhen provided or bring-your-own)

The Backend is responsible for the Agentic loops and supporting platform services that take a user query, an alert, a ticket, a chat, etc. and convert that to a set of Tasks to run or historic Task output to find.

The Local Agent responsible for running Tasks is installed as a set of pods running in one the customers' Kubernetes clusters. Most users have one Local Agent instance for dev and one for prod, or one per security zone. Details on this are below.

Installation Requirements

Installation requirements depend on the type of installation (see more below):

Requirements for a hybrid deployment that uses the Local Agent and RunWhen Saas Backend can be found here
Requirements for a deployment that uses the Local Agent and “self-hosted” Backend in your own AWS, GCP or Azure VPCs can be found here

Local Agent In Detail

The Local Agent is a set of pods, installed via a Helm chart into a RunWhen namespace on a customers' Kubernetes cluster. More details and locations of each of these images can be found here.

There are three required container images and four more that are typical of most installations.

The Workspace Builder pod creates and sends configuration data needed for Tasks to the backend system. See "Workspace Building" below.
The Task Runner pod receives instructions on when to run various Tasks, maintaining a local queue of Task requests. See "Runtime" below.
Each Task request is delegated to a Worker pod. Worker pods are built off the same base container image, but the final image varies slightly. See "CodeCollection Building" below for details. A default RunWhen installation includes container images corresponding to:
1. rw-cli-codecollection - Tasks emulating specific CLI interactions for Kubernetes and major cloud providers
2. rw-public-codecollection - Tasks that integrate with numerous OSS packages and DevOps tools
3. rw-workspace-utils - Various utility Tasks involving the RunWhen platform itself
4. rw-generic-codecollection - Customizable wrappers for existing/private CLI commands, bash, python, etc.
The Open Telemetry Collector pod is responsible for the health monitoring of the rest of the Local Agent.

Background: Build Time and Run Time

To achieve this, the system is built around three major processes - Workspace Building, CodeCollection Building and Runtime.

Workspace Building:

Workspace building is the process of creating the configuration required for the automated Tasks from RunWhen libraries that are allowed to run in your environment. It involves a process on the Local Agent that uses read-only credentials provided to discover infrastructure, platform and application resources. It matches those resources with Tasks that the team has approved.

Example: A Workspace Builder pod is given a read-only Kubernetes service account with access to the online-boutique namespace and the approval to use the RunWhen rw-cli read-only CodeCollection which contains the “Inspect Kubernetes Warning Events for ${NAMESPACE}” task. The Workspace Builder pod generates a configuration file sent to the Backend similar to the json snippet below.

CODE

{
  taskTitle: Inspect Kubernetes Warning Events for ${NAMESPACE},
  configProvided: { NAMESPACE_NAME: online-boutique }
}

CodeCollection (Library) Building:

When new Tasks are added to either public or private libraries (source code), the source code is analyzed and built into container images using the RunWhen CI/CD pipeline. For public libraries (open source), the pipeline also uses LLMs to generate thousands of embeddings where the Task is relevant, adding them to the RunWhen knowledge base. The result is both container images stored in the GCP Artifact Registry and the public and private Knowledge Bases that power the RunWhen backend system.

The list of container images used by the default CodeCollections are available on request.

Putting It Together: Runtime

At runtime, the system takes REST calls (RunWhen Web UI, Slack, etc) with natural language and converts that into a vector search in a highly specialized, tenant-specific vector database. The result is a Task configuration built during Workspace Building time that can be executed on a Worker contianer image built during CodeCollection Building time. When the Local Agent executes this Task, metadata is sent back to the backend systems to see if further Tasks are required.