Terms and Concepts

Industry Terms & Concepts

SRE

Site Reliability Engineering principles and practices are visibly present within the RunWhen Platform, such as providing a platform to share Service Level Indicator, Service Level Objective and Automation code (TaskSets) in a common language across teams and organizations.

GitOps

GitOps practices don't just power RunWhen's infrastructure, but also many components within the platform itself. Many of the features within the RunWhen platform are versioned using Git, and patterns around commits and merging are present throughout most workflows.

RobotFramework

RobotFramework is an industry-leading open source automation framework used for creating powerful and flexible automation solutions. Rather than enacting barriers to entry by creating a domain-specific language, RunWhen leverages he extensibility and interoperability provided by RobotFramework to achieve quality and consistent code within our Code Bundles.

RunWhen Terms and Concepts

RunWhen Platform

RunWhen Platform refers to the commercial SaaS offering that provides Digital Assistants that are capable of running Tasks autonomously, interpreting the output, and identifying Issues that require human attention.

RunWhen Local

RunWhen Local refers to the open source tool that discovers Cloud and Kubernetes resources, matches them with open source troubleshooting tasks (see CodeBundles), and provides a copy-and-pasteable Troubleshooting Cheat Sheet. RunWhen Local is also used as an onboarding tool into RunWhen Platform.

RunWhen Authors

RunWhen Authors are software engineers that write and maintain technology specific troubleshooting commands in their own CodeCollections.

Workspace

A Workspace is a concept in the RunWhen Platform that represents a tenant. Each Workspace consists of a Map, SLXs, and Workspace configurations (such as Workflows, Secrets, RBAC, and so on).

ServiceLevelX(SLX)

An SLX is an encapsulation of a Service Level Indicators(SLI), Service Level Objectives(SLO), and TaskSet within the RunWhen Platform. It is also sometimes referred to as "a point on the map".

Service Level Indicator(SLI)

A Service Level Indicator (SLI) provides a way to measure reliability levels of services within your system. Within RunWhen, an SLI is a CodeBundle that results in a metric being pushed into the RunWhen Platform.

Service Level Objective(SLO)

A Service Level Objective (SLO) defines a target "healthy state" for service. For example, an SLO for a website might be that it responds with an HTTP 200 response code in less than 2 seconds, 99.9% of the time.

In the RunWhen Platform, an SLO queries the SLI for this value, performs the SLO measurement, and generates alerts when the SLO is not being achieved. The RunWhen Platform internal data model is the OpenSLO Spec and a set of alerting rules that follow the Google multi-window and multi-burn approach.

Alert

Alerts are generated when an SLO objective is not being met. Alerts can automatically trigger a RunSession.

CodeBundle

CodeBundles are the engine that power the integrations between your systems and the RunWhen Platform. A CodeBundle is written in RobotFramework and contains Tasks, each Task representing some form of healthcheck (SLI) or troubleshooting task. For example:

SLI Task: Count the number of HTTP 500s seen by the load balancer.
Troubleshooting Task: Fetch the logs from the load balancer, filter by error code and endpoint, and add all entries with HTTP 500 to a report.

RunWhen maintains public and open source CodeCollections that contains CodeBundles for many industry standard tools (such as Kubernetes, GCP, Azure, AWS, and so on).

CodeCollection

A CodeCollection is a Git repository that contains CodeBundles.

RunWhen maintains the following public and open source CodeCollections:

Task

A Task is an individual step in a CodeBundle that executes some code and provides a result. Tasks can be used to generate a metric for a Service Level Indicator(SLI) or for a as part of a TaskSet to perform troubleshooting, reporting, or remediation tasks.

TaskSet

A TaskSet the collection of Tasks within a CodeBundle that help users or Digital Assistants troubleshoot a specific technical component. A TaskSet is attached to an SLX, and with each task being able to be executed from the RunWhen Platform Map.

An example of a TaskSet could be:

CodeCollection Name: rw-cli-codecollection
CodeBundle Name: FluxCD Helm Health
TaskSet Tasks:
- List all available FluxCD Helmreleases in Namespace
- Fetch Installed FluxCD Helmrelease Versions in Namespace
- Fetch Mismatched FluxCD HelmRelease Version in Namespace
- Fetch FluxCD HelmRelease Error Messages in Namespace
- Check for Available Helm Chart Updates in Namespace

RunSession

A RunSession (also known as a "Troubleshooting Session) is a term used in the RunWhen Platform which encapsulates a trigger (such as a user query, or an SLO Alert) to start the session, a series of Tasks to execute and their results (known as a RunRequest) and the Report that contains all details about the RunSession.

RunRequest

A RunRequest belongs to a RunSession, and is created each time that a group of Tasks are selected to be executed. The RunRequest ID contains the information related to which Tasks were executed, when they started and stopped, and the result of their execution.

Report

A Report is the detailed output of all RunRequest data that belongs to a specific RunSession.

Issue

Created by CodeBundle Authors, Issues are raised from specific Task output and signals to the RunWhen Platform that something is wrong and requires further investigation. Issues most often include:

Severity: 1-4 (1 Critical / Major, 4 = Informational)
Title: A text string to display on the RunWhen Platform Map that describes the Issue.
Details: A string and command output that identifies why the Issue was raised.
Reproduce Hint: A string or copy of the command that generated the output which raised the Issue.
Next Steps: The suggested next step to perform when this Issue is raised.

Next Steps

Next Steps are created by CodeBundle Authors in order to indicate to the RunWhen Platform User, or Digital Assitants, what steps should be taken next in order to continue diagnosing the Issue that was raised. Next Steps will often look like another Task, signalling to the Digital Assistants to search for helpful tasks to autonomously run, or specific instructions to be escalated to the service owner.

Escalation

An Escalation is an activity that can be performed on an Issue in the RunWhen Platform. Clicking the Escalation button on an Issue will leave a comment for the service owner so that they can review the Issue Details and address accordingly.

Workflow

A Workflow in the RunWhen Platform is an automated set of actions that can take place when triggered. For example, if an SLO alert fires, a RunSession can be automatically started with a group of Tasks, and a Slack message can be sent with the URL of the RunSession.

Digital Assistant

A Digital Assistant (such as Eager Edgar, Cautious Cathy, or Admin Abby) is capable of suggesting Tasks from a user query (such as "Help Troubleshoot the Cart Service"), automatically executing all suggested tasks, and autonomously interpreting any Issues and running additional Tasks that best match the Next Steps.