Rotating a TLS Cert? Remember to Move Your Grafana Datasource Versions Forward

Apr 27, 2026 · 7 min read · Shea Stewart

A TLS and false-positive alert story.

TL;DR

When you rotate a TLS cert that a Grafana provisioned datasource depends on, also bump the version field in the datasource YAML. Grafana caches the datasource config — TLS material included — in its PostgreSQL data_source table, and only re-reads the on-disk file when version changes. Skip the bump and Grafana silently keeps using the old cached cert, even after a restart. Equivalent fix without a YAML edit: PUT the new cert via /api/datasources/name/<datasource>.

If you came here for that, you can stop. If you’d rather skip the story and see the long-term fix, jump to the daily task that catches this drift before the alert fires. Otherwise read on — the rest is the false-positive alert that surfaced this, and how the agent walked us back to the real cause.

Prologue

Twice a year you move your clocks forward (or back), and most of the time you remember. The times you forget, you find out the hard way. Grafana provisioned datasources have an equivalent ritual: after you rotate the TLS cert behind one, you also have to nudge the integer in the datasource YAML’s version field forward. The cert file on disk updates, the rotation log is clean — but Grafana itself won’t reload the new cert until that one number ticks up.

We forgot. The signal arrived as a noisy alert in our Grafana alerts channel — nominally about container OOM events, with Value: [no value] on it and an x509 expiry buried in the annotation. Instead of pulling a couple of us into the thread to debate, we handed it straight to the RunWhen Agent. What follows is what it surfaced, and the daily task we wrote afterwards so the next person on rotation doesn’t chase it from a misleading symptom.

Act I — “Is this valid?”

Slack thread on a Container OOM alert: Abid asks if it's valid; Shea hands it to @RunWhen Watcher, which surfaces an x509 expiry buried in the alert annotation.

The alert headline pointed at four containers in three namespaces and a memory-pressure story that, taken at face value, would have eaten the morning. The agent read the annotation block before the headline:

tls: failed to verify certificate: x509: certificate has expired
or is not yet valid: current time 2026-04-26T08:02:44Z is after
2026-04-22T22:15:34Z

That single line reframes the rest of the alert. Grafana can’t query Mimir’s Prometheus API, so the alert rule evaluates with no data and fires with grafana_state_reason = Error. The headline isn’t a finding — it’s a query failure dressed up as one.

The agent also flagged a separate, real problem in the same sweep: a high-severity issue with hundreds of occurrences where grafana-alloy-metrics is exceeding Mimir’s 500k per-user series limit on one of the tenants. Worth fixing — unrelated to the alert in this thread.

Act II — Run the cert checks from the thread

@RunWhen Watcher proposes two mimir cert tasks; Shea triggers them from the Slack message and the task results render back into the thread.

The agent’s recommendation: two existing tasks against the mimir namespace — Find Unhealthy Certificates and Find Failed Certificate Requests and Identify Issues. We triggered both from the thread.

Both came back clean: Unready Certificates: [], no failed cert requests. Every cert-manager-managed leaf certificate in mimir was Ready. The cert in the alert annotation isn’t a leaf cert and isn’t in the namespace those tasks scanned.

The failure is somewhere between cert-manager and the thing actually doing the TLS handshake — Grafana, in the shared cluster, querying Mimir over the public endpoint in another cluster.

Act III — The CA, not the leaves

Mid-thread: Shea identifies the CA cert in the shared cluster as what expired; the RunWhen Agent confirms with action items.

The cert-manager-managed leaf certs in mimir rotate frequently and were all green. The CA that signs them, in the shared cluster, has a 3-month TTL. Different rotation schedule, different surface. The CA expired on 2026-04-22T22:15:34Z — exactly the timestamp in the alert annotation — and Grafana in the shared cluster, which trusts that CA in the datasource bundle, immediately stopped being able to verify the chain Mimir presented.

The alert wasn’t lying about when. It was lying about what — a container-level headline pointing at a CA cert two clusters away.

”I rotated and restarted, and it’s still broken”

The RunWhen Agent explains that Grafana persists datasource TLS material in Postgres; Shea identifies the provisioning version field as the gate, and the agent confirms the silent no-op behaviour on restart.

Rotating the CA on disk and restarting Grafana didn’t fix it. The pod came up clean, ca.crt on the volume showed the new expiry, and Grafana kept reporting the old one — a date that no longer existed in any file we could find.

Grafana stores provisioned datasource TLS config in its PostgreSQL data_source table (json_data and secure_json_data). On startup it reads from the database, not from the provisioning YAML. The file on disk is consulted only when the provisioner decides it’s new — and the provisioner’s definition of “new” is: the integer in the version field changed.

Bump the cert content but leave version: 1? Silent no-op on restart. Grafana keeps using the cached tlsCACert from secure_json_data, and the same expiry date keeps showing up in error logs after every rollout.

Bumping version (or pushing the new cert via PUT /api/datasources/name/<datasource>) is what unsticks it.

The whole arc, in four steps:

Misleading alert fires → actually a TLS cert expiry
CA cert rotated → Grafana still broken (DB cache)
Grafana restarted → still broken (provisioning version check)
version bumped → finally picks up the new cert

Three “I fixed it but it’s still broken” layers before the real fix landed. None of them visible from the alert text.

Act IV — Closing the loop

The fix on the day was manual: rotate the cert, bump the integer, restart the pod. What we owed ourselves was a daily check that would have caught it before the alert ever fired. cert-manager doesn’t refresh ca.crt on a leaf Secret just because the upstream CA changed, so the only place the drift is visible is at the source — in the destination cluster, beside the workload that issued it.

We wrote one task per environment that runs on a daily cron 30 days ahead of expiry. Each task reads the client cert in its destination cluster and decides whether either side is close enough to expiry to act. If so, it forces a re-issue at the source, mirrors the new material back to the shared cluster, and opens a PR that bumps the affected datasource versions.

We didn’t sit down and hand-write the script. The requirements went to a coding agent connected to this workspace via the RunWhen Platform MCP server — the same tool we covered in From Laptop to Production Ops in One Prompt. The agent authored the Python, validated it against the contract, ran it against live infrastructure to confirm the cert mirror and the PR open cleanly, and committed it as one task per environment. The prompt was roughly:

Build a daily check, one task per destination cluster, that
compares the Grafana client TLS cert in the destination's mimir
namespace against its mirror in the shared cluster's grafana
namespace. If ca.crt or tls.crt is within 30 days of expiring,
force a re-issue at the source, mirror the fresh material into
shared, and open a PR that bumps the version integer on every
Grafana datasource referencing the cert. Use a GitHub App for the
PR so no PAT lives in the workspace secrets.

The MCP server is what gives the agent the same view of the workspace the rest of our operational tooling has — so it can author against it, test against it, and commit into it without us hand-rolling auth, secret plumbing, or runbook scaffolding.

Five-step pipeline: detect expiring cert in destination cluster; delete the Secret to force re-issue; mirror new material into shared; bump datasource versions and open a PR; Grafana picks up the new cert on the next pod restart, well inside the 30-day buffer.

One non-obvious detail: force re-issue by deleting the Secret, not the Certificate. Deleting the cert-manager Certificate CR loses the race against Flux re-applying it; cert-manager sees the still-valid leaf Secret it owned and adopts it again, leaving the stale ca.crt in place. Deleting the Secret leaves the Certificate intact but with nothing to reference, and cert-manager issues a fresh leaf from the current CA chain. That’s the path we used to pick up a rotated CA on a leaf that still has time on it.

After the PR merges, Grafana still has to be restarted to pick up the new cert. We rely on the natural restart cadence — preempt cycles, GKE auto-upgrades, unrelated Deployment reconciles — which lands well inside the 30-day buffer the task triggers on.

The result is that the workflow shows up in the same place every other piece of operational automation in this workspace lives — askable, runnable, and unsurprising. Asking the assistant for it returns the tasks themselves:

Workspace Chat with Eager Edgar selected, answering "What tasks can be run to rotate grafana tls certificates?" by listing four runnable Grafana TLS Sync tasks with checkboxes and a Run Selected Tasks button.

Selecting one and running it returns the result inline:

Workspace Chat with Eager Edgar selected, summarizing the output from running one of the tasks.

Three layers of “I fixed it but it’s still broken” is the price we paid for not having this task. The next time the CA rotates we’ll see the PR before we see the alert — and the lesson, instead of living in a Slack thread someone has to remember to read, runs every morning. That’s the version of operational learning we want: not a wiki page, not a Slack pin, but a task that fires whether anyone is watching or not.

If your stack has a similar gotcha — a piece of provisioned config that needs a nudge after a rotation, a credential whose drift only shows up two layers downstream, a manual step that everyone agrees should be automated “next quarter” — the prompt above is a reasonable starting point. Point an MCP-connected agent at your workspace, describe the check, and let the same surface that triages your alerts author the fix.