Overview
The Online Boutique production environment is partially degraded. The frontend loads and users can browse products, but placing an order fails. The checkout service is logging a fatal database error. This scenario requires tracing the user-facing symptom back through the checkout service to an exhausted database connection pool.
What you'll learn:
-
Investigate partial application failures where some features work and others don't
-
Trace errors from a service back to its database dependency
-
Understand database connection pool exhaustion and what causes it
-
Differentiate between application errors and infrastructure-layer failures
-
Work through a multi-layer investigation
Difficulty: Intermediate Time required: 15-20 minutes
The Problem
What's happening: User browses products (Frontend ✅) → User places an order → Frontend calls Checkout Service (⚠️) → Checkout Service tries to connect to the database → Database connection pool is full (100/100) → Checkout Service returns FATAL error → Frontend shows order failure to user.
What you know:
-
The
online-boutique-prodenvironment was stable until recently -
Users can browse products and add items to cart
-
Placing an order fails with an error
-
No code changes were deployed recently
What you need to find out:
-
Which service is actually failing?
-
Is it the checkout service itself, or something it depends on?
-
What's causing the failure at the infrastructure level?
Step 1: Ask About the Prod Environment in Workspace Chat
-
Open the Sandbox workspace and go to Workspace Chat
-
Ask Eager Edgar about the prod environment — for example: "Check the health of online-boutique-prod" or "Are there any issues in the production namespace?"
-
Eager Edgar will check the environment and surface issues across pods, deployments, and events
This scenario is more complex because the checkout service pod is running — it's not crashing. The failure only occurs when it tries to process an order and reaches for the database.
Step 2: Describe the Symptom to Eager Edgar
Start by describing what users are experiencing. Don't try to guess the root cause — let the AI Assistant investigate.
Sample Prompts
Prompt Examples:
-
"Users can't complete checkout in online-boutique-prod"
-
"Orders are failing in the production namespace"
-
"online-boutique-prod is partially down — browsing works but checkout doesn't"
-
"Check the health of online-boutique-prod"
What Eager Edgar Does
The interaction flow:
-
You say: "Checkout is failing in online-boutique-prod"
-
Edgar surfaces existing insights and suggests targeted diagnostics
-
Tasks check pods, logs, database connectivity, events, and resource usage
-
Edgar presents a multi-layer analysis
Eager Edgar will suggest tasks such as:
-
Check Pod Status — Review all pods in the namespace
-
Inspect Checkout Service Logs — Look for error details
-
Check Database Connectivity — Verify the data layer is reachable
-
Check Service Endpoints — Verify service discovery
-
Check Resource Usage — Review CPU, memory, and connection metrics
-
Check Recent Events — Look for cluster-level warnings
Step 3: Review the Findings
Eager Edgar runs the tasks and presents structured results. This investigation reveals a clear trail.
Expected Findings
Finding 1: Checkout Service Logs Show a Fatal Database Error
🔴 checkoutservice logs:
FATAL: Database connection pool exhausted - unable to process order...
Connection timeout after 30s. Active connections: 100/100.
This is the key finding. The checkout service is running, but every time it tries to process an order it cannot acquire a database connection — the pool is completely saturated.
Finding 2: Checkout Service Pod Is Running
✅ Pod 'checkoutservice-6b8d5c9f7-q4m2x' status: Running
Ready: 1/1
Restarts: 0
The pod itself is healthy. This isn't a crash or a code bug — the service starts and runs fine, but fails under load when it needs the database.
Finding 3: Database Pod Under Pressure
⚠️ Database pod resource status:
Active connections: 100 (at configured maximum)
Waiting queries: 47
Connection wait time: averaging 28s
CPU usage: 92%
Memory usage: high
Finding 4: Warning Events
🟡 Events in online-boutique-prod:
- checkoutservice: "Connection pool wait timeout exceeded"
- Repeated checkout failures logged over the past 30 minutes
Step 4: Trace the Dependency Chain
Investigation layers (top to bottom):
-
Layer 1 — User-Facing Symptom: Placing an order fails
-
Layer 2 — Service Layer: Checkout service logs
FATAL: Database connection pool exhausted -
Layer 3 — Data Layer: Database has 100/100 active connections, queries queuing
-
Layer 4 — Infrastructure: Connection pool limit is too low for current traffic, or connections are leaking
Each layer points to the one below it. The user sees "order failed," but the root cause is a database connection pool that can't keep up.
Following the Trail
This is the key lesson of this scenario — the user-facing symptom (checkout fails) is two layers removed from the root cause (database connection pool exhaustion).
|
Layer |
What You See |
What It Means |
|---|---|---|
|
User Experience |
"Place Order" fails |
Something downstream is broken |
|
Checkout Service |
|
The service code is fine — it can't get a database connection |
|
Database |
100/100 connections active, queries waiting |
The database is overwhelmed or connections aren't being released |
|
Infrastructure |
Pool limit too low, possible connection leak, or under-provisioned database |
The configuration or resources don't match the current load |
Root Cause
The database backing the checkout service has hit its connection pool limit. All 100 connections are in use, and new requests wait until they time out after 30 seconds. This could be caused by:
-
Connection pool limit too low: Traffic has grown beyond what the configured pool size can handle
-
Connection leak: The application is opening connections but not closing them properly, gradually exhausting the pool
-
Slow queries: Long-running queries hold connections open, preventing others from being served
-
Under-provisioned database: The database itself doesn't have enough CPU or memory to process queries fast enough, causing connections to back up
The fix needs to address the database layer, not the checkout service code.
Step 5: Ask for Remediation Guidance
Sample Follow-Up Prompts
Prompt Examples:
-
"How do I fix the database connection pool issue in online-boutique-prod?"
-
"The checkout service has exhausted its database connections — what should I do?"
-
"Increase the database connection pool limit for checkoutservice"
-
"What's the recommended fix for connection pool exhaustion?"
The Fix
Depending on the exact root cause, Eager Edgar may suggest one or more of these approaches:
Fix A: Increase the Connection Pool Size
If traffic has grown and the pool limit is simply too low, increase it. This is typically configured via an environment variable or ConfigMap on the checkout service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkoutservice
namespace: online-boutique-prod
spec:
template:
spec:
containers:
- name: server
env:
- name: DB_MAX_CONNECTIONS
value: "200" # Increased from 100
- name: DB_POOL_TIMEOUT
value: "60s" # Increased from 30s
Fix B: Scale Up the Database
If the database itself is under-provisioned (high CPU, slow queries), increase its resources:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: checkout-db
namespace: online-boutique-prod
spec:
template:
spec:
containers:
- name: postgres
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi" # Increased from 512Mi
cpu: "1000m" # Increased from 250m
Fix C: Add a Connection Pooler (PgBouncer)
For a more robust solution, deploy a connection pooler in front of the database to manage connections efficiently:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
namespace: online-boutique-prod
spec:
template:
spec:
containers:
- name: pgbouncer
image: pgbouncer/pgbouncer:latest
env:
- name: MAX_CLIENT_CONN
value: "500"
- name: DEFAULT_POOL_SIZE
value: "50"
- name: POOL_MODE
value: "transaction"
How to apply:
kubectl apply -f <fix>.yaml -n online-boutique-prod
After applying, restart the checkout service to pick up the new configuration:
kubectl rollout restart deployment/checkoutservice -n online-boutique-prod
Step 6: Verify the Fix
Verification Steps
This scenario requires verifying at every layer of the dependency chain — not just the component you fixed.
Verification chain: Is the database healthy? → Are connections below the limit? → Is checkout processing orders? → End-to-End OK ✅
Ask Eager Edgar to Verify
Prompt Examples:
-
"Is the database connection pool healthy in online-boutique-prod?"
-
"Can checkoutservice process orders now?"
-
"Check the full health of online-boutique-prod"
-
"Are there still connection timeout errors?"
Success Criteria
✅ Database active connections well below maximum
✅ No connection timeout errors in checkout service logs
✅ Checkout service processing orders successfully
✅ No queued/waiting database queries
✅ Order flow completes from frontend
✅ No warning or error events in the namespace
What You Learned
Key Takeaways
-
Follow the dependency chain. User-facing symptoms are often multiple layers removed from the root cause. Checkout failures led to the database, the database led to connection pool limits.
-
"Partially broken" often means a dependency is saturated. When some features work (browsing) and others don't (checkout), the broken features likely share a dependency that's under pressure.
-
Running pods can still fail under load. The checkout service wasn't crashing — it was healthy until it needed a database connection and couldn't get one. Load-dependent failures are harder to catch than crashes.
-
Connection pool exhaustion is one of the most common production issues. It looks like an application error (
FATAL: unable to process order) but the fix is at the infrastructure or configuration level — pool sizing, database resources, or adding a connection pooler. -
AI Assistants investigate breadth-first. Eager Edgar checked pods, logs, database connectivity, events, and resources in parallel — the same investigation would take a human much longer moving layer by layer.
Troubleshooting Pattern: Infrastructure / Data Layer Issues
The pattern: Feature broken but some services OK → Check logs of affected service → Find database/dependency error → Investigate dependency health and resource usage → Capacity or configuration issue? → If yes: Increase pool size / scale resources. If no: Check for connection leaks / slow queries.
Comparing All Three Scenarios
|
|
Scenario 1: Code |
Scenario 2: Config |
Scenario 3: Infrastructure |
|---|---|---|---|
|
Symptom |
Pod crashing |
Feature errors, pod running |
Checkout fails under load |
|
Pod status |
CrashLoopBackOff |
Running |
Running |
|
Error location |
Startup logs |
Runtime logs |
Runtime logs (database connection error) |
|
Root cause |
Bug in code |
Wrong config value |
Database connection pool exhausted |
|
Fix location |
App source code |
ConfigMap / env var |
Pool config / database resources |
|
Investigation depth |
Single pod |
Service + config |
Multi-layer trace |
|
Detection difficulty |
Easy |
Moderate |
Hard |
Tutorial Complete
Congratulations! You've completed all three Online Boutique troubleshooting scenarios. You now know how to:
-
✅ Use workspace chat to investigate issues
-
✅ Describe problems in natural language to AI Assistants
-
✅ Run diagnostic tasks and interpret structured results
-
✅ Identify root causes across code, configuration, and infrastructure failures
-
✅ Implement fixes and verify at every layer
-
✅ Recognize common failure patterns in Kubernetes applications
What's Next?
Apply these skills to your own environment:
-
Set up RunWhen for your cluster (see the Quick Start Guide)
-
Explore the task library with 3,400+ automated diagnostics
-
Learn about Engineering Assistants and how to customize them
Deepen your knowledge:
-
Review How RunWhen Works — Understand production insights and background tasks
-
Try the Kubernetes Resource Tuning Tutorial — Another hands-on scenario
-
Join the community Slack to discuss what you've learned