Skip to main content
Skip table of contents

Kubernetes: Application Storage Troubleshooting

This scenario shows how an application developer with no Kubernetes training can quickly diagnose and remediate an application issue that is caused by storage volumes that are full.

This tutorial heavily utilizes GIFS, which may take time to load in your browser.

Scenario Overview

The acme-fitness application is deployed in the dedicated RunWhen Workspace named Sandbox. It has a single public URL, is running in a Google GKE cluster, is deployed from a git repository, and is comprised application and database container images.

If you click on the application URL above and try to place an order, you will be met with the following notice:

This tutorial will walk you through:

  • Engaging with Engineering Assistant Eager Edgar

  • Asking Eager Edgar for suggestions on what to run when we can't place orders

  • Running Eager Edgar's suggested tasks

  • Reviewing the task results, warnings, and suggested next steps

  • Identifying and remediating the root cause of the application failure an

https://youtu.be/-G9AIMmoAV0

Getting Started in the RunWhen Platform

Upon logging into the platform, you will be shown a list of Workspaces that are accessible to you. If this is your first time, you will see some public Workspaces - these are created for demonstration and exploration purposes.

Listing of Workspaces

Selecting a Workspace

For this scenario, please select the following workspace:

  • Sandbox

Upon clicking the workspace, you will be dropped into the workspace map - an interactive method of searching and navigating across the resources in the workspace.

Viewing the "Sandbox" Workspace Map

Asking Eager Edgar to Help with Orders Failing

Since the main issue right now is that the application looks like the main acme-fitness application is unable to process orders, ask Eager Edgar for a list of recommended troubleshooting tasks.

  • Select the Command Bar, and

    • Type acme-fitness to search for appropriate application group, and select the group

  • Select the Command Bar, and

    • Type in a statement such as "Orders are failing", or "Check pod health", or "Check for storage"

    • There are now two options:

      • Select RUN ALL to allow Eager Edgar to troubleshoot the issue on your behalf (See the posted video above, or click this yourself)

      • Select TROUBLESHOOT to review the suggested tasks, and

        • Select all tasks that you wish to run

        • Create a RunSession name

        • Click Run

ast-selectall.gif

Creating a RunSession

Reviewing Issues & Running More Tasks

In this next step, review the Issues generated by that task. Notice that there are a few Issues, such as:

  • Deployment order in Namespace acme-fitness is unavailable

  • Deployment order in acme-fitness is generating error logs.

Each of these Issues will list some Suggested Next Steps. Select ASK on some of these suggestions and run the top suggestions that Eager Edgar provides.

search2.gif

Reviewing Issues & Running More Tasks

search3.gif

Reviewing Issues and Running More Tasks

Identifying and Remediating The Root Cause

After running a number of tasks, review the Issues tab (which highlights each issue, sorted by severity):

  • Notice that there is an Issue related to the PVC Storage Utilization is at 100% in acme-fitness

  • Review the Suggested Next Steps, and Ask Eager Edgar if any tasks match the suggestion of Expand Persistent Volume Claim acmefit-catalog-data in Namespace acme-fitness to 2Gi

  • Run the top suggested task

search4.gif

Identifying and Remediating the Root Cause

  • When the task completes, review the the Suggested Next Steps for:

    • Pull Requests for manifest changes are open and in need of review for namespace recipes

    • Visit the URL of the Pull Requests that were opened

    • Click the Escalate icon on the Issue to notify the service owner that the PR requires approval

search5.gif

Reviewing the Proposed Fix

Reviewing the Report Output

At any time throughout the troubleshooting process it is possible to continue running tasks, ask for more suggestions, or review the output of the report. The report will continually highlight issues that might require additional investigation.

search7.gif

Viewing the RunSession Report

Viewing the RunSession Report

Looking at the report history:

  • A query such as Orders are failing led to a number of suggestions in the acme-fitness group

  • Tasks were run across many of the microservices components, looking for issues that might relate to the query

  • Issues were generated, indicating that fact:

    • The orders microservice was not available/running

    • Log errors indicated that there were storage capacity issues with orders

    • The storage mounts (Kubernetes Persistent Volume Claims) were investigated and in need of expansion

  • Tasks were suggested that matched an available GitOps Remediation Task, such as:

    • Expand Persistent Volume Claims in Namespace `acme-fitness`

  • GitHub Pull Requests were opened that would fix the root cause of the resource constraints:

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.