Kubernetes: Identify and Fix Configuration Issues

This scenario shows how an application developer with no Kubernetes training can identify the root cause (and remediate) a subtle configuration issue that is causing a microservice component failure.

This tutorial heavily utilizes GIFS, which may take time to load in your browser.

Scenario Overview

The online-boutique application has it's own dedicated RunWhen Workspace (named Online Boutique). It has a single public URL, is running in a Google GKE cluster, and is comprised of many microservices.

Application URL: https://online-boutique.sandbox.runwhen.com

If you click on the link above, you will immediately see 500 errors, but it might not be obvious why.

This tutorial will walk you through:

Engaging with Engineering Assistant Eager Edgar
Asking Eager Edgar for suggestions on what to run when our application is crashing, and
- Having Eager Edgar run all suggested tasks, and
- Follow up on any issues
Reviewing the task results, warnings, and suggested next steps
Identifying the root cause of the application failure

Getting Started in the RunWhen Platform

Upon logging into the platform, you will be shown a list of Workspaces that are accessible to you. If this is your first time, you will see some Public workspaces - these are created for demonstration and exploration purposes.

Selecting a Workspace

For this scenario, please select the following workspace:

Online Boutique

Upon clicking the workspace, you will be dropped into the workspace map - an interactive method of searching and navigating across the resources in the workspace.

In this map, you can pan & zoom, or use the search bar to find resources.

Asking Eager Edgar to Help

Since the main issue right now is that the application is failing and users are experiencing the error "could not retrieve cart", ask Eager Edgar for a list of recommended troubleshooting tasks. An existing tutorial can be used to help get you started:

Select the Command Bar ('/'), and
- Type "could not retrieve cart" and Let Eager Edgar suggest some tasks
Select Run All
- While this is running, Eager Edgar will respond to issues and search for better tasks, opening and closing issues along the way

Reviewing RunSession Results

RunSessions will have three pieces of information:

Suggestions: A summary of the most important issues to focus on, along with a grouping of similar or related issues
Timeline: The journey of who ran what, and when
Issues: All Issues generated during the RunSession, sorted by status (open/close) and severity (major/minor)

Reviewing Suggestions

Starting with Suggestions, we can review what Eager Edgar thinks is the most important issue that came up through the troubleshooting session. In this case, he's determined:

Many of the issues generated are symptoms of the same root cause
There is an open Pull Request (a remediation task) that is likely related to the failing service
The owner of the service should be contacted to approve the Pull Request

Reviewing the Timeline

Reviewing the timeline will highlight:

What started the RunSession (e.g. a Search Query)
What tasks were run, when, and by who
Any issues related to a specific task

Reviewing Issues

A complete list of issues can be reviewed, along with their state (open/closed), severity (major/minor), and additional task details, such as:

Suggested Next Steps - Additional steps that might help (though the Engineering Assistant may have already performed this action)
Details - A snipped of the log detail, with a link to the issue details in the Full Report
Comments - Any comments from the Engineering Assistant or a colleague related to the issue

Reviewing the Report Output

At any time throughout the troubleshooting process it is possible to continue running tasks, ask for more suggestions, or review the output of the report. The report will contain the complete and detailed output from each task that was run in the RunSession.

The Summary will each task and issue related to any service or point on the map. It can be sorted by severity or number of open issues.

The Full Report contains the verbose output of each and every task, issues identified, and links to the debug logs of the task execution.

Looking at the report history:

The starting point was a search query taken from the failing application "could not retrieve cart":
- A total of 20 tasks were suggested and run from that query
- An additional 7 tasks were run automatically based on the issues that came up
Tasks were run that confirmed the cartservice Deployment had no ready pods.
Thecartservice Deployment was investigated and it was determined that the manifest had an incorrect readiness probe configuration
A task was suggested to Remediate the Readiness and Liveness Probe for Deployments in Namespace, which matched to a GitOps Remediation Task that opened a GitHub Pull Request with the required fix.