Business Continuity And Operational Resilience
RunWhen is committed to maintaining operational continuity and platform availability in the event of a disruption. As a provider of AI-driven Site Reliability Engineering tools for enterprise cloud environments, RunWhen recognizes the importance of maintaining resilient systems and support operations for our customers, including during unexpected incidents.
This document outlines our current framework, procedures, and practices relating to business continuity, disaster recovery, and operations resilience.
Business Continuity Framework
RunWhen maintains a Business Continuity and Operations Resilience Framework designed to:
Protect platform availability and customer access during incidents
Ensure continuity of key operations (engineering, infrastructure, support)
Support timely recovery from service interruptions or facility disruptions
Minimize potential customer impact due to internal or external events
The framework includes playbooks for incident triage, escalation, resource failover, and communication procedures. It is reviewed by management at least annually.
Tactical Response & Resumption Plans
RunWhen maintains internal Business Continuity Playbooks that cover:
Crisis Management: Roles and responsibilities in the event of material system failure, data center outages, or major service incidents
Incident Management: Triage procedures, communication escalation paths, and on-call protocols
Disaster Recovery: Infrastructure-as-code recovery procedures, source control redundancy, and configuration backup practices
Business Resumption: Remote access workflows, distributed operations, and tooling continuity for critical staff
Please note that closely related Security Incident Procedures are publicly documented here.
Critical services such as infrastructure automation, access control, logging, and alert routing are hosted across redundant cloud zones. Engineering and support teams are fully remote-enabled and geographically distributed.
Roles and responsibilities during a service disruption are defined in RunWhen’s Incident Response playbooks. These designate functional leads for engineering, customer success, and communications. In the event of critical service interruption, responsibility for recovery coordination and stakeholder communication falls to Engineering Leadership and the CEO. Distributed infrastructure reduce reliance on any one physical data center location and support rapid reassignment of workloads in the event of physical outages.
Assessing Plan Adequacy
RunWhen conducts internal reviews of business continuity preparedness on an annual basis. These reviews include:
Assessment of system recovery procedures
Review of incident resolution logs and postmortems
Identification of bottlenecks or gaps in coverage
Documentation of areas for improvement and follow-up actions
Reviews are led by engineering leadership in collaboration with infrastructure and security stakeholders approximately annually.
Continuity Exercises & Lessons Learned
RunWhen performs daily “chaos testing” across all non-production environments by injecting unscheduled outages for all Kubernetes-based infrastructure twice per day. Forcing all development teams to work through these outages for all new code while it is still in its early stages is the strongest exercise our team has found in maintaining strong reliability foundations.
Findings from each exercise are shared with company leadership. Improvements and updates are tracked and prioritized based on exercise results and real-world incident learnings.
Contact & Documentation
RunWhen’s Security and Compliance Team maintains responsibility for business continuity readiness and documentation. Additional details, tabletop reports, or internal playbooks may be shared under NDA for enterprise customer reviews.
For more information, contact:
security-and-compliance@runwhen.com