Scenario 3: Database Connection Failure (Prod)

Overview

The Online Boutique production environment is partially degraded. The frontend loads and users can browse products, but placing an order fails. The checkout service is logging a fatal database error. This scenario requires tracing the user-facing symptom back through the checkout service to an exhausted database connection pool.

What you'll learn:

Investigate partial application failures where some features work and others don't
Trace errors from a service back to its database dependency
Understand database connection pool exhaustion and what causes it
Differentiate between application errors and infrastructure-layer failures
Work through a multi-layer investigation

Difficulty: Intermediate Time required: 15-20 minutes

The Problem

What's happening: User browses products (Frontend ✅) → User places an order → Frontend calls Checkout Service (⚠️) → Checkout Service tries to connect to the database → Database connection pool is full (100/100) → Checkout Service returns FATAL error → Frontend shows order failure to user.

What you know:

The online-boutique-prod environment was stable until recently
Users can browse products and add items to cart
Placing an order fails with an error
No code changes were deployed recently

What you need to find out:

Which service is actually failing?
Is it the checkout service itself, or something it depends on?
What's causing the failure at the infrastructure level?

Step 1: Ask About the Prod Environment in Workspace Chat

Open the Sandbox workspace and go to Workspace Chat
Ask Eager Edgar about the prod environment — for example: "Check the health of online-boutique-prod" or "Are there any issues in the production namespace?"
Eager Edgar will check the environment and surface issues across pods, deployments, and events

This scenario is more complex because the checkout service pod is running — it's not crashing. The failure only occurs when it tries to process an order and reaches for the database.

Step 2: Describe the Symptom to Eager Edgar

Start by describing what users are experiencing. Don't try to guess the root cause — let the AI Assistant investigate.

Sample Prompts

Prompt Examples:

"Users can't complete checkout in online-boutique-prod"
"Orders are failing in the production namespace"
"online-boutique-prod is partially down — browsing works but checkout doesn't"
"Check the health of online-boutique-prod"

What Eager Edgar Does

The interaction flow:

You say: "Checkout is failing in online-boutique-prod"
Edgar surfaces existing insights and suggests targeted diagnostics
Tasks check pods, logs, database connectivity, events, and resource usage
Edgar presents a multi-layer analysis

Eager Edgar will suggest tasks such as:

Check Pod Status — Review all pods in the namespace
Inspect Checkout Service Logs — Look for error details
Check Database Connectivity — Verify the data layer is reachable
Check Service Endpoints — Verify service discovery
Check Resource Usage — Review CPU, memory, and connection metrics
Check Recent Events — Look for cluster-level warnings

Step 3: Review the Findings

Eager Edgar runs the tasks and presents structured results. This investigation reveals a clear trail.

Expected Findings

Finding 1: Checkout Service Logs Show a Fatal Database Error

🔴 checkoutservice logs:
   FATAL: Database connection pool exhausted - unable to process order...
   Connection timeout after 30s. Active connections: 100/100.

This is the key finding. The checkout service is running, but every time it tries to process an order it cannot acquire a database connection — the pool is completely saturated.

Finding 2: Checkout Service Pod Is Running

✅ Pod 'checkoutservice-6b8d5c9f7-q4m2x' status: Running
   Ready: 1/1
   Restarts: 0

The pod itself is healthy. This isn't a crash or a code bug — the service starts and runs fine, but fails under load when it needs the database.

Finding 3: Database Pod Under Pressure

⚠️ Database pod resource status:
   Active connections: 100 (at configured maximum)
   Waiting queries: 47
   Connection wait time: averaging 28s
   
   CPU usage: 92%
   Memory usage: high

Finding 4: Warning Events

🟡 Events in online-boutique-prod:
   - checkoutservice: "Connection pool wait timeout exceeded"
   - Repeated checkout failures logged over the past 30 minutes

Step 4: Trace the Dependency Chain

Investigation layers (top to bottom):

Layer 1 — User-Facing Symptom: Placing an order fails
Layer 2 — Service Layer: Checkout service logs FATAL: Database connection pool exhausted
Layer 3 — Data Layer: Database has 100/100 active connections, queries queuing
Layer 4 — Infrastructure: Connection pool limit is too low for current traffic, or connections are leaking

Each layer points to the one below it. The user sees "order failed," but the root cause is a database connection pool that can't keep up.

Following the Trail

This is the key lesson of this scenario — the user-facing symptom (checkout fails) is two layers removed from the root cause (database connection pool exhaustion).

Layer	What You See	What It Means
User Experience	"Place Order" fails	Something downstream is broken
Checkout Service	`FATAL: Database connection pool exhausted... Active connections: 100/100`	The service code is fine — it can't get a database connection
Database	100/100 connections active, queries waiting	The database is overwhelmed or connections aren't being released
Infrastructure	Pool limit too low, possible connection leak, or under-provisioned database	The configuration or resources don't match the current load

Root Cause

The database backing the checkout service has hit its connection pool limit. All 100 connections are in use, and new requests wait until they time out after 30 seconds. This could be caused by:

Connection pool limit too low: Traffic has grown beyond what the configured pool size can handle
Connection leak: The application is opening connections but not closing them properly, gradually exhausting the pool
Slow queries: Long-running queries hold connections open, preventing others from being served
Under-provisioned database: The database itself doesn't have enough CPU or memory to process queries fast enough, causing connections to back up

The fix needs to address the database layer, not the checkout service code.

Step 5: Ask for Remediation Guidance

Sample Follow-Up Prompts

Prompt Examples:

"How do I fix the database connection pool issue in online-boutique-prod?"
"The checkout service has exhausted its database connections — what should I do?"
"Increase the database connection pool limit for checkoutservice"
"What's the recommended fix for connection pool exhaustion?"

The Fix

Depending on the exact root cause, Eager Edgar may suggest one or more of these approaches:

Fix A: Increase the Connection Pool Size

If traffic has grown and the pool limit is simply too low, increase it. This is typically configured via an environment variable or ConfigMap on the checkout service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkoutservice
  namespace: online-boutique-prod
spec:
  template:
    spec:
      containers:
        - name: server
          env:
            - name: DB_MAX_CONNECTIONS
              value: "200"          # Increased from 100
            - name: DB_POOL_TIMEOUT
              value: "60s"          # Increased from 30s

Fix B: Scale Up the Database

If the database itself is under-provisioned (high CPU, slow queries), increase its resources:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: checkout-db
  namespace: online-boutique-prod
spec:
  template:
    spec:
      containers:
        - name: postgres
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "2Gi"         # Increased from 512Mi
              cpu: "1000m"          # Increased from 250m

Fix C: Add a Connection Pooler (PgBouncer)

For a more robust solution, deploy a connection pooler in front of the database to manage connections efficiently:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
  namespace: online-boutique-prod
spec:
  template:
    spec:
      containers:
        - name: pgbouncer
          image: pgbouncer/pgbouncer:latest
          env:
            - name: MAX_CLIENT_CONN
              value: "500"
            - name: DEFAULT_POOL_SIZE
              value: "50"
            - name: POOL_MODE
              value: "transaction"

How to apply:

kubectl apply -f <fix>.yaml -n online-boutique-prod

After applying, restart the checkout service to pick up the new configuration:

kubectl rollout restart deployment/checkoutservice -n online-boutique-prod

Step 6: Verify the Fix

Verification Steps

This scenario requires verifying at every layer of the dependency chain — not just the component you fixed.

Verification chain: Is the database healthy? → Are connections below the limit? → Is checkout processing orders? → End-to-End OK ✅

Ask Eager Edgar to Verify

Prompt Examples:

"Is the database connection pool healthy in online-boutique-prod?"
"Can checkoutservice process orders now?"
"Check the full health of online-boutique-prod"
"Are there still connection timeout errors?"

Success Criteria

✅ Database active connections well below maximum
✅ No connection timeout errors in checkout service logs
✅ Checkout service processing orders successfully
✅ No queued/waiting database queries
✅ Order flow completes from frontend
✅ No warning or error events in the namespace

What You Learned

Key Takeaways

Follow the dependency chain. User-facing symptoms are often multiple layers removed from the root cause. Checkout failures led to the database, the database led to connection pool limits.
"Partially broken" often means a dependency is saturated. When some features work (browsing) and others don't (checkout), the broken features likely share a dependency that's under pressure.
Running pods can still fail under load. The checkout service wasn't crashing — it was healthy until it needed a database connection and couldn't get one. Load-dependent failures are harder to catch than crashes.
Connection pool exhaustion is one of the most common production issues. It looks like an application error (FATAL: unable to process order) but the fix is at the infrastructure or configuration level — pool sizing, database resources, or adding a connection pooler.
AI Assistants investigate breadth-first. Eager Edgar checked pods, logs, database connectivity, events, and resources in parallel — the same investigation would take a human much longer moving layer by layer.

Troubleshooting Pattern: Infrastructure / Data Layer Issues

The pattern: Feature broken but some services OK → Check logs of affected service → Find database/dependency error → Investigate dependency health and resource usage → Capacity or configuration issue? → If yes: Increase pool size / scale resources. If no: Check for connection leaks / slow queries.

Comparing All Three Scenarios

	Scenario 1: Code	Scenario 2: Config	Scenario 3: Infrastructure
Symptom	Pod crashing	Feature errors, pod running	Checkout fails under load
Pod status	CrashLoopBackOff	Running	Running
Error location	Startup logs	Runtime logs	Runtime logs (database connection error)
Root cause	Bug in code	Wrong config value	Database connection pool exhausted
Fix location	App source code	ConfigMap / env var	Pool config / database resources
Investigation depth	Single pod	Service + config	Multi-layer trace
Detection difficulty	Easy	Moderate	Hard

Tutorial Complete

Congratulations! You've completed all three Online Boutique troubleshooting scenarios. You now know how to:

✅ Use workspace chat to investigate issues
✅ Describe problems in natural language to AI Assistants
✅ Run diagnostic tasks and interpret structured results
✅ Identify root causes across code, configuration, and infrastructure failures
✅ Implement fixes and verify at every layer
✅ Recognize common failure patterns in Kubernetes applications

What's Next?

Apply these skills to your own environment:

Set up RunWhen for your cluster (see the Quick Start Guide)
Explore the task library with 3,400+ automated diagnostics
Learn about Engineering Assistants and how to customize them

Deepen your knowledge:

Review How RunWhen Works — Understand production insights and background tasks
Try the Kubernetes Resource Tuning Tutorial — Another hands-on scenario
Join the community Slack to discuss what you've learned