About This Tutorial Series
This series walks you through investigating and resolving three different types of failures in the Sandbox's Online Boutique application. Each scenario covers a failure category that platform and application teams encounter every day — and teaches you how to use RunWhen to diagnose and fix it.
No Kubernetes expertise required. You describe the problem, the AI Assistant does the investigation.
Time required: 45-60 minutes for all three, or 15-20 minutes each.
The Application
The Online Boutique is Google's microservices demo — an 11-service e-commerce app where users browse products, add items to a cart, and check out. It's deployed across three namespaces simulating a standard promotion pipeline:
|
Environment |
Namespace |
What's Broken |
|---|---|---|
|
Development |
|
A code change broke a service — it's crashing on startup |
|
Testing |
|
A configuration error has services pointing to the wrong endpoints |
|
Production |
|
The Redis database backing the cart has hit its resource limits |
Each environment is managed via GitOps (Flux) from the demo-sandbox-online-boutique repository.
The Three Scenarios
Scenario 1: Crashing Code Deploy (Dev)
The problem: A developer merged a code change to the dev branch. Since the merge, one of the microservices crashes repeatedly on startup and the frontend returns errors.
What you'll investigate:
-
Pods stuck in
CrashLoopBackOff -
Container logs showing application errors
-
Tracing the crash to a code-level bug
What you'll learn: This is the most straightforward failure category — the pod is clearly broken, and the logs tell you exactly what went wrong. You'll learn the core investigate → diagnose → fix workflow that the other scenarios build on.
Difficulty: Beginner | Time: 15-20 minutes
Scenario 2: Misconfigured Service (Test)
The problem: The test environment was promoted from a recent build. All pods show Running — nothing is crashing — but the checkout flow is broken. Users can browse products but get errors when completing a purchase.
What you'll investigate:
-
Why pods can be "Running" but the app still broken
-
Service-to-service communication failures in logs
-
Environment variables and ConfigMaps with wrong values
What you'll learn: Configuration issues are sneakier than code errors. Everything looks healthy on the surface. You'll learn to look beyond pod status into runtime behavior, and trace functional failures back to specific config values.
Difficulty: Intermediate | Time: 15-20 minutes
Scenario 3: Database Connection Failure (Prod)
The problem: The production environment is partially degraded. The frontend loads, users can browse products, but adding items to cart fails. No code changes were deployed recently.
What you'll investigate:
-
Partial application failures where some features work and others don't
-
The dependency chain from cart service → Redis → infrastructure resources
-
Resource pressure, OOMKill events, and readiness probe failures
What you'll learn: Infrastructure issues are the hardest to diagnose because the symptom is multiple layers removed from the root cause. The cart service logs errors, but the cart service code is fine — the problem is in the Redis instance it depends on. You'll learn to follow the dependency chain.
Difficulty: Intermediate | Time: 15-20 minutes
Comparing the Scenarios
|
|
Scenario 1: Code |
Scenario 2: Config |
Scenario 3: Infrastructure |
|---|---|---|---|
|
Symptom |
Pod crashing |
Feature errors, pods running |
Partial degradation |
|
Pod status |
CrashLoopBackOff |
Running |
Running (dependency Not Ready) |
|
Error location |
Startup logs |
Runtime logs |
Dependency logs |
|
Root cause |
Bug in code |
Wrong config value |
Resource limits |
|
Fix |
Rollback or fix code |
Fix ConfigMap / env var |
Increase resources |
|
Detection difficulty |
Easy |
Moderate |
Hard |
Skills You'll Build
|
Skill |
Scenario 1 |
Scenario 2 |
Scenario 3 |
|---|---|---|---|
|
Navigate the workspace map |
✅ |
✅ |
✅ |
|
Ask AI Assistants in natural language |
✅ |
✅ |
✅ |
|
Run and interpret diagnostic tasks |
✅ |
✅ |
✅ |
|
Read pod logs to find errors |
✅ |
|
|
|
Inspect configurations and environment variables |
|
✅ |
|
|
Trace service dependency chains |
|
✅ |
✅ |
|
Investigate infrastructure-layer issues |
|
|
✅ |
|
Verify fixes end-to-end |
|
|
✅ |
Where to Start
-
New to the platform? Start with Scenario 1 (Dev). It teaches the core workflow.
-
Comfortable with basics? Jump to Scenario 2 (Test) for a more nuanced investigation.
-
Want a challenge? Try Scenario 3 (Prod) for a multi-layer infrastructure problem.
-
Short on time? Try the quick-start prompts on the Sandbox Tutorials page — no tutorial needed.