Kubernetes: Application Storage Troubleshooting

This scenario shows how an application developer with no Kubernetes training can quickly diagnose and remediate an application issue that is caused by storage volumes that are full.

This tutorial heavily utilizes GIFS, which may take time to load in your browser.

Scenario Overview

The acme-fitness application is deployed in the dedicated RunWhen Workspace named Sandbox. It has a single public URL, is running in a Google GKE cluster, is deployed from a git repository, and is comprised application and database container images.

Application URL: http://acme-fitness.sandbox.runwhen.com/
Application Git Repository: https://github.com/runwhen-contrib/demo-sandbox-acme-fitness
If you would like to try placing an order in the application, login with the following credentials:
- username: walter
- password: vmware1!
- This is a sample application, please do not put in any personal information into the sample order.

If you click on the application URL above and try to place an order, you will be met with the following notice:

This tutorial will walk you through:

Engaging with Engineering Assistant Eager Edgar
Asking Eager Edgar for suggestions on what to run when we can't place orders
Running Eager Edgar's suggested tasks
Reviewing the task results, warnings, and suggested next steps
Identifying and remediating the root cause of the application failure an

https://youtu.be/-G9AIMmoAV0

Getting Started in the RunWhen Platform

Upon logging into the platform, you will be shown a list of Workspaces that are accessible to you. If this is your first time, you will see some public Workspaces - these are created for demonstration and exploration purposes.

Selecting a Workspace

For this scenario, please select the following workspace:

Sandbox

Upon clicking the workspace, you will be dropped into the workspace map - an interactive method of searching and navigating across the resources in the workspace.

Asking Eager Edgar to Help with Orders Failing

Since the main issue right now is that the application looks like the main acme-fitness application is unable to process orders, ask Eager Edgar for a list of recommended troubleshooting tasks.

Select the Command Bar, and
- Type acme-fitness to search for appropriate application group, and select the group
Select the Command Bar, and
- Type in a statement such as "Orders are failing", or "Check pod health", or "Check for storage"
- There are now two options:
  - Select RUN ALL to allow Eager Edgar to troubleshoot the issue on your behalf (See the posted video above, or click this yourself)
  - Select TROUBLESHOOT to review the suggested tasks, and
    - Select all tasks that you wish to run
    - Create a RunSession name
    - Click Run

Reviewing Issues & Running More Tasks

In this next step, review the Issues generated by that task. Notice that there are a few Issues, such as:

Deployment order in Namespace acme-fitness is unavailable
Deployment order in acme-fitness is generating error logs.

Each of these Issues will list some Suggested Next Steps. Select ASK on some of these suggestions and run the top suggestions that Eager Edgar provides.

Identifying and Remediating The Root Cause

After running a number of tasks, review the Issues tab (which highlights each issue, sorted by severity):

Notice that there is an Issue related to the PVC Storage Utilization is at 100% in acme-fitness
Review the Suggested Next Steps, and Ask Eager Edgar if any tasks match the suggestion of Expand Persistent Volume Claim acmefit-catalog-data in Namespace acme-fitness to 2Gi
Run the top suggested task

Identifying and Remediating the Root Cause

When the task completes, review the the Suggested Next Steps for:
- Pull Requests for manifest changes are open and in need of review for namespace recipes
- Visit the URL of the Pull Requests that were opened
- Click the Escalate icon on the Issue to notify the service owner that the PR requires approval

Reviewing the Report Output

At any time throughout the troubleshooting process it is possible to continue running tasks, ask for more suggestions, or review the output of the report. The report will continually highlight issues that might require additional investigation.

Looking at the report history:

A query such as Orders are failing led to a number of suggestions in the acme-fitness group
Tasks were run across many of the microservices components, looking for issues that might relate to the query
Issues were generated, indicating that fact:
- The orders microservice was not available/running
- Log errors indicated that there were storage capacity issues with orders
- The storage mounts (Kubernetes Persistent Volume Claims) were investigated and in need of expansion
Tasks were suggested that matched an available GitOps Remediation Task, such as:
- Expand Persistent Volume Claims in Namespace `acme-fitness`
GitHub Pull Requests were opened that would fix the root cause of the resource constraints:
- [RunWhen] - GitOps Manifest Updates for PersistentVolumeClaim-postgredb
- [RunWhen] - GitOps Manifest Updates for PersistentVolumeClaim-acmefit-catalog-data