Skip to content

Microservices Troubleshooting

About This Tutorial Series

This series walks you through investigating and resolving three different types of failures in the Sandbox’s Online Boutique application. Each scenario covers a failure category that platform and application teams encounter every day — and teaches you how to use RunWhen’s Workspace Chat and AI Assistants to diagnose and fix it.

No Kubernetes expertise required. You describe the problem, the AI Assistant does the investigation.

Time required: 45-60 minutes for all three, or 15-20 minutes each.


The Application

The Online Boutique is Google’s microservices demo — an 11-service e-commerce app where users browse products, add items to a cart, and check out. It’s deployed across three namespaces simulating a standard promotion pipeline:

EnvironmentNamespaceWhat’s Broken
Developmentonline-boutique-devA code change introduced a nil pointer bug — the checkout service crashes on startup
Testingonline-boutique-testA misconfigured service address causes checkout to fail — pods are running but the payment endpoint is wrong
Productiononline-boutique-prodThe checkout service hits a database connection pool limit under load, causing order failures

Each environment is managed via GitOps (Flux) from the demo-sandbox-online-boutique repository.


The Three Scenarios

Scenario 1: Crashing Code Deploy (Dev)

The problem: A developer merged a code change to the dev branch. Since the merge, the checkout service crashes repeatedly on startup and the application returns errors when users try to place orders.

What you’ll investigate:

  • Pods stuck in CrashLoopBackOff
  • Container logs showing a nil pointer dereference panic
  • Tracing the crash to a code-level bug

What you’ll learn: This is the most straightforward failure category — the pod is clearly broken, and the logs tell you exactly what went wrong. You’ll learn the core investigate -> diagnose -> fix workflow that the other scenarios build on.

Difficulty: Beginner | Time: 15-20 minutes

See: Scenario 1: Crashing Code Deploy (Dev)


Scenario 2: Misconfigured Service (Test)

The problem: The test environment was promoted from a recent build. All pods show Running status — nothing is crashing — but the checkout flow is broken. Users can browse products and add items to cart, but completing a purchase fails because the checkout service is configured with a wrong payment service address.

What you’ll investigate:

  • Why pods can be “Running” but the application is not working correctly
  • Service-to-service communication failures in logs
  • Environment variables with incorrect values
  • The difference between “running” and “healthy”

What you’ll learn: Configuration issues are sneakier than code errors. Everything looks healthy on the surface — pods are Running, deployments show available replicas. You’ll learn to look beyond pod status into application logs and environment variable configuration to find the real issue.

Difficulty: Intermediate | Time: 15-20 minutes

See: Scenario 2: Misconfigured Service (Test)


Scenario 3: Database Connection Failure (Prod)

The problem: The production environment is partially degraded. The frontend loads, users can browse products, but placing an order fails intermittently. The checkout service logs a fatal database connection pool error and panics — but the checkout code itself is fine.

What you’ll investigate:

  • Partial application failures where some features work and others don’t
  • Tracing a checkout failure back through the service to a database connection pool limit
  • The difference between an application bug and an infrastructure-layer resource exhaustion
  • Why the checkout service is in CrashLoopBackOff even though the code is correct

What you’ll learn: Infrastructure issues are the hardest to diagnose because the symptom is removed from the root cause. The checkout service panics, but the checkout code is fine — the problem is database connection pool exhaustion under load. You’ll learn to follow the dependency chain from user-facing symptom to infrastructure root cause.

Difficulty: Intermediate | Time: 15-20 minutes

See: Scenario 3: Database Connection Failure (Prod)


Comparing the Scenarios

Scenario 1: CodeScenario 2: ConfigScenario 3: Infrastructure
SymptomPod crashingFeature errors, pods runningIntermittent checkout failures
Pod statusCrashLoopBackOffRunningCrashLoopBackOff (under load)
Error locationStartup logsRuntime logs (DNS failure)Runtime logs (connection pool error)
Root causeNil pointer bug in codeWrong service address in env varDatabase connection pool exhaustion
FixRollback or fix codeCorrect the environment variableIncrease pool size or scale resources
Detection difficultyEasyModerateHard

Skills You’ll Build

SkillScenario 1Scenario 2Scenario 3
Ask AI Assistants in natural languageYesYesYes
Run and interpret diagnostic tasksYesYesYes
Review Issues surfaced by the platformYesYesYes
Read pod logs to find errorsYes
Inspect environment variables and configsYes
Trace service dependency chainsYes
Investigate infrastructure-layer issuesYes

Where to Start

  • New to the platform? Start with Scenario 1 (Dev). It teaches the core workflow.
  • Comfortable with basics? Jump to Scenario 2 (Test) for a more nuanced investigation.
  • Want a challenge? Try Scenario 3 (Prod) for a multi-layer infrastructure problem.
  • Short on time? Try the quick-start prompts on the Sandbox Tutorials page — no tutorial needed.