Skip to content
All posts

Hep SRE AI

It was “03:00 A.M” at midnight. I was sleeping and walking on the lovely green fields somewhere else. I didn't even know what time it was… After a while, my phone is ringing with a random number from our alerting system.

image7

Then eyes open, and all the lovely green fields are turned into iterm2 and Slack screens, basically, and a new adventure is about to start. Then you connect to VPN, check alert history, runbooks, release pins, recent changes, etc. Then, in the first 30 minutes, you will understand the reason for the regular incident, and then return to Slack and type “Hey, the incident has been resolved” properly.

Then, in daily meetings, everybody will call you the sleepless “HERO“ of midnight.

As a summary, modern SRE teams don’t suffer from a lack of data. They suffer from too much context, scattered everywhere.

  • Metrics live in Prometheus.
  • Alerts arrive via AlertManager.
  • Logs sit somewhere else.
  • Runbooks are half-updated.
  • Slack is noisy.
  • Dashboards multiply like rabbits.

On-call today is not about fixing systems.

It’s about context assembly under pressure, while half-awake, switching between tools, tabs, and mental models. This is where most incidents lose time.

The first 15 minutes;

Every experienced SRE knows this rule: The first 15 minutes decide whether an incident is boring or catastrophic.

In those first minutes, engineers try to answer the same questions over and over again:

  • What exactly is broken?

  • Is this new or recurring?

  • Which components are involved?

  • What changed recently?

  • Where should I look first?

Answering these usually requires:

  • multiple kubectl  commands, k9s screens .. etc

  • dashboard hopping

  • log digging

  • Slack archaeology

  • and a lot of guesswork

None of this is hard. It’s just slow, noisy, and error-prone when you’re tired.

This time period is so important for understanding any alerts, drawing lines, tracking the entire flow, and creating relationships between alert details.

At least you can decide whether I should wake someone on the team or ask the developers who are around. Check all the tabs, trying to fix the issues, answer the calls and messages via Slack, and try to collect proper information from the system. Here is the overall anatomy of an alert resolving procedure.

 

HepSRE AI - Mini Model 

HEPSRE AI – Mini compresses the first 15 minutes of an incident into a single, opinionated starting point.

Instead of manually collecting signals, it automatically gathers:

  • Kubernetes resource state

  • pod and application logs

  • related alerts and events

  • recent changes and metadata

…and turns them into a structured summary using LLMs.

 

The goal is not to replace engineers or make decisions for them.

The goal is simpler:

  • reduce blind context switching

  • surface likely causes early

  • provide actionable next steps

  • and let engineers start debugging with clarity, not panic

HEPSRE AI – Mini acts as a context layer between your alerting system and the human on-call engineer.

It removes noise, not responsibility.

To save time in this 15-minute period and comment on all the evidence from the system. We developed a new tool called HEP SRE AI - Mini. This is the mini version of our SRE AI systems, basically.

This tool lets us easily integrate the AlertManager with KubePrometheusStack and visualize them via Simple UI. Apart from that, you can easily integrate it with the UI and your kubectl command as a plugin.

 

Show Time:

Let’s take a look at how you can save your time with HEPSRE Mini. 

Suppose that here is an alerting hell on your production system:


Now you have to collect plenty of events, logs, and other details from your cluster and combine them in your knowledge base.

$ kubectl hepsre -pod nginx-pod -namespace default

The Kubectl hepsre will collect all the details and comment with them via Antropic Claude Sonet 4.5 LLM Models.

Here is the sample output from your terminal. 

It basically fetches all the outputs

image1

and basically provide a detailed answer for you.

image4

After it collects all the details from the Kubernetes cluster. You can see the recommendation from the following parts, like this:

image5

It also fetches logs and helps you debug application-level issues.

Apart from the kubectl plugin, the HEPSRE AI - Mini provides another daemon-based ability. This ability allows you to integrate LLM systems with the AlertManager. 

Here is the overview of integration with AlertManager.

image3

This approach lets you use LLM models even if you did not wake up and decreases the context switch during the on-call.

This project also provides a basic way to track alert history and findings by LLM Models.

Here is the overview of the HEP SRE AI - Mini project.

image2

You can also check the details of all the incidents:

image9

You can check the details for each incident and use this information for further Post Mortem contents, knowledgebase or maybe next incident :D 

Conclusion:

This blog post illustrates all the details of the HEPSRE AI - Mini project. As HEPAPI we would like to share our experience with the engineers and make their life easier and keep the alerting channels more silent . This project does not provide a proper replacement for an engineer, but it might clear all the dust and snow while you are on the road. This is a simple gift to the OSS Universe. 

May the force be with you <3