Architecture & Overview

Conceptual Overview

RunWhen is a system that takes natural language about an operational problem and runs pre-approved, automated Tasks.

For example, the problem may be “the cartservice is down” or “postgres instance 3as41 is slow.” The system responds by running Tasks like “check cartservice logs for errors” or “run postgres diagnostics for dev-db-5s2.”

About 95% of most teams Tasks come from the public library, while 5% are private tasks that are wrappers around existing bash, python, SQL, REST, etc.

Software Components: Conceptual Overview

There are two major components of the software: a Backend Component (multi-tenant SaaS, single tenant SaaS or self-hosted*) and a Local Agent component named "RunWhen Local."

The Local Agent responsible for running Tasks is installed as a set of pods running in one the customers' Kubernetes clusters. Most users have one Local Agent instance for dev and one for prod, or one per security zone. Details on this are below.

The Backend is responsible for converting natural language into a list of Tasks to run, and handling the metadata after each Task completes. It does this with a tenant-specific vector database (shown below) that handles incoming queries with a very specific form of vector search to find the right Task.

A conceptual overview of the system at runtime is shown below.

Local Agent In Detail

The Local Agent is a set of pods, installed via a Helm chart into a RunWhen namespace on a customers' Kubernetes cluster. More details and locations of each of these images can be found here.

There are three required container images and four more that are typical of most installations.

The Workspace Builder pod creates and sends configuration data needed for Tasks to the backend system. See "Workspace Building" below.
The Task Runner pod receives instructions on when to run various Tasks, maintaining a local queue of Task requests. See "Runtime" below.
Each Task request is delegated to a Worker pod. Worker pods are built off the same base container image, but the final image varies slightly. See "CodeCollection Building" below for details. A default RunWhen installation includes container images corresponding to:
1. rw-cli-codecollection - Tasks emulating specific CLI interactions for Kubernetes and major cloud providers
2. rw-public-codecollection - Tasks that integrate with numerous OSS packages and DevOps tools
3. rw-workspace-utils - Various utility Tasks involving the RunWhen platform itself
4. rw-generic-codecollection - Customizable wrappers for existing/private CLI commands, bash, python, etc.
The Open Telemetry Collector pod is responsible for the health monitoring of the rest of the Local Agent.

Background: Build Time and Run Time

To achieve this, the system is built around three major processes - Workspace Building, CodeCollection Building and Runtime.

Workspace Building:

Workspace building is the process of creating the configuration required for the automated Tasks from RunWhen libraries that are allowed to run in your environment. It involves a process on the Local Agent that uses read-only credentials provided to discover infrastructure, platform and application resources. It matches those resources with Tasks that the team has approved.

Example: A Workspace Builder pod is given a read-only Kubernetes service account with access to the online-boutique namespace and the approval to use the RunWhen rw-cli read-only CodeCollection which contains the “Inspect Kubernetes Warning Events for ${NAMESPACE}” task. The Workspace Builder pod generates a configuration file sent to the Backend similar to the json snippet below.

CODE

{
  taskTitle: Inspect Kubernetes Warning Events for ${NAMESPACE},
  configProvided: { NAMESPACE_NAME: online-boutique }
}

CodeCollection (Library) Building:

When new Tasks are added to either public or private libraries (source code), the source code is analyzed and built into container images using the RunWhen CI/CD pipeline. For public libraries (open source), the pipeline also uses LLMs to generate thousands of embeddings where the Task is relevant, adding them to the RunWhen knowledge base. The result is both container images stored in the GCP Artifact Registry and the public and private Knowledge Bases that power the RunWhen backend system.

The list of container images used by the default CodeCollections are available on request.

Putting It Together: Runtime

At runtime, the system takes REST calls (RunWhen Web UI, Slack, etc) with natural language and converts that into a vector search in a highly specialized, tenant-specific vector database. The result is a Task configuration built during Workspace Building time that can be executed on a Worker contianer image built during CodeCollection Building time. When the Local Agent executes this Task, metadata is sent back to the backend systems to see if further Tasks are required.

Data Protection

Data Security Framework

Please refer to the following links for both a conceptual view and a detailed view of the RunWhen data security framework:

Conceptual view - https://www.runwhen.com/security-trust
Detailed view - Security / Compliance

Encryption

Encryption: How is the data encrypted during transmission between the agent and the SaaS platform? Is TLS 1.2 or higher used? Are strong cipher suites employed? Is there an option for end-to-end encryption?

All communications from the agent to the SaaS platform are https, initiated by the agent or supporting services.
All communications from agent to platform use either golang or python standard libraries, leveraging openssl libraries built into the underlying container image (based on Debian Bookworm), OpenSSL 3.0.2 as of the time of this writing
All communications use either TLS 1.2 or 1.3 with strong ciphers.
mTLS certificates are retrieved from the RunWhen Backend upon first connection of the Local Agent’s Runner Service. They are rotated periodically in addition to an encrypted token containing metadata used for reconnection upon disconnection or credential rotation
All intra-communication between Local Agent services within the customer's environment uses temporary certificates and encryption keys where necessary

Frequently Asked Questions

Data Protection / Connection

Authentication

Authentication: How is the agent authenticated to the SaaS platform? Are strong API keys, certificates, or other secure authentication mechanisms used? Are these credentials protected at rest on the agent host?

To connect the Local Agent to the SaaS Platform on initial installation, a user must log in to the SaaS platform and create a connection token with a default of a 1 hour expiration. This token must be copy/pasted from the UI Platform to either i) a helm values file used to provision the local agent, or iii) pasted into the Local Agent UI. This ensures the initial token is never sent “over the wire” between RunWhen the Local Agent and has a short expiration period.
Once an initial connection has been established, the RunWhen platform generates a certificate (cert-manager) that is sent to the Local Agent for future authentication via mTLS. This certificate is stored in a Kubernetes Secret object that is local to the Local Agent. In addition to the certificate, an encrypted token containing metadata about the Local Agent is stored to reconnect to the RunWhen backend after the initial connection.

Integrity

Integrity: How is the integrity of the data ensured during transmission? Are checksums or other methods used to prevent tampering?

The platform relies on underlying openssl libraries and golang/python implementations of https for all communications. This ensures checksums by default.

Data Minimization

Data Minimization: Does the agent only collect and transmit the necessary metrics? Can the scope of collected data be configured to minimize sensitive information?

The Local Agent collects only metrics related to Task execution, i.e. areas within the scope of the Local Agent itself.
It does send Task metadata and output to the SaaS platform. The Task output can be configured to be retained locally and/or sent to customer-specified (roadmap) cloud buckets.

Data Transport

Transport Protocol: Is HTTPS used for communication, ensuring encrypted and authenticated communication?

Https is used for all communication

Data Protection

Data at Rest (on the Agent Host):

Credential Storage: How are the agent's credentials (API keys, tokens, etc.) stored on the on-premise server? Are they encrypted at rest using industry-standard methods? Avoid storing credentials in plain text.

As noted above, these credentials are stored in Kubernetes Secrets objects that are local to the Local Agent. The Agent uses Kubernetes API mechanisms to fetch them, and is abstracted away from the actual storage and internal transmission. It is up to the customer’s implementation of Secrets object storage whether these are encrypted at rest.

Data Caching (on the Agent Host):

Data Caching: Does the agent cache any data locally before sending it to the SaaS platform? If so, how is this data protected? Is there an option to disable local caching? If caching is necessary, is the cached data encrypted?

The Task metadata and output is not cached. It exists in memory and in a tmp file in the Local Agent very briefly before being uploaded to the RunWhen backend and (in the case of output) local storage or customer cloud buckets. The tmp files are deleted either after a successful upload (immediate), on the execution of the next Task (seconds later) or on a restart of the container (in case of any longer duration failure).

Agent Updates:

How are agent updates handled? Are they signed and verified to prevent malicious updates? Is the update process secure?

The Local Agent’s container image and related CodeCollection library images are stored on the Google Artifact Registry which supports Cosign and Binary Authorization. The Local Agent delegates to the customer cluster image management (either direct or via a proxy like Artifactory) when fetching the images.

Access Control:

Access Control: Who has access to the server where the agent is installed? Are appropriate access control measures in place to restrict access to authorized personnel only? Principle of least privilege should be followed.

The installation of the Local Agent is handled via Helm Charts for Kubernetes clusters. It relies on the customers’ existing processes in this area.

Secure Configuration:

Secure Configuration: Are there secure configuration options for the agent? Can unnecessary services or features be disabled? Is there a process for hardening the agent's environment?

All of the Agent’s configuration options are exposed via the Helm chart on install. Given the large enterprise nature of RunWhen’s install base, most defaults are aligned with security sensitive environments. (The ones that are not are specifically around use of non-transparent proxies and non-standard root certificate handling, found in a small number of environments.)