⎈ A Hands-On Guide: Setting Up Prometheus and AlertManager in Kubernetes with Custom Alerts 🛠️

⇢ Understanding Prometheus & AlertManager Setup in Kubernetes with Custom Rules: A Comprehensive Guide

Monitoring your Kubernetes cluster is crucial for maintaining the health and performance of your applications. In this guide, we’ll walk through setting up Prometheus and Alertmanager using Helm and configuring custom alert rules to monitor your cluster effectively. If you haven’t already, I recommend checking out my previous blog post on Kubernetes monitoring using Prometheus and Grafana for a comprehensive overview of setting up Prometheus and Grafana.

Prerequisites

Before we start, ensure you have the following:

A running Kubernetes cluster.
Helm installed on your local machine.

Step 1: Install Prometheus and Alertmanager using Helm

We’ll use the kube-prometheus-stack Helm chart from the Prometheus community. This chart includes Prometheus, Alertmanager, and Grafana, along with several pre-configured dashboards and alerting rules.

First, create a custom-values.yaml file to specify our custom configurations:

# custom-values.yaml
prometheus:
  service:
    type: NodePort
grafana:
  service:
    type: NodePort
alertmanager:
  service:
    type: NodePort

Next, install the kube-prometheus-stack using Helm:

helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack -f custom-values.yaml

This command will deploy Prometheus, Alertmanager, and Grafana to your cluster with the services exposed as NodePort.

Step 2: Verifying the Setup

To verify that Prometheus and Alertmanager are running correctly, you can access their web UIs. Since we exposed their services as NodePort, you can use kubectl port-forward to access them locally or you can use external IP of the cluster and nodeport of the respective service.

For Prometheus:

For Alertmanager:

For Grafana:

Access the default Alertmanager rules:
To access the alertmanager rules/alerts, navigate to Alerts section on prometheus UI:

Here we can see that three alerts are in Firing state, so these alerts we can see in AlertManager UI to manage:

Step 3: Configuring Custom Alert Rules

From the above steps we can see that the default alerts are configured in prometheus and alertmanager. Now, let’s add custom alert rules to monitor our Kubernetes cluster. We’ll create a PrometheusRule manifest to define these alerts.

Create a file named custom-alert-rules.yaml with the following content:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: kube-prometheus-stack
    app.kubernetes.io/instance: kube-prometheus-stack
    release: kube-prometheus-stack
  name: kube-pod-not-ready
spec:
  groups:
  - name: my-pod-demo-rules
    rules:
    - alert: KubernetesPodNotHealthy
      expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      
    - alert: KubernetesDaemonsetRolloutStuck
      expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
        description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: ContainerHighCpuUtilization
      expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Container High CPU utilization (instance {{ $labels.instance }})
        description: "Container CPU utilization is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: ContainerHighMemoryUsage
      expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Container High Memory usage (instance {{ $labels.instance }})
        description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: KubernetesContainerOomKiller
      expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
        description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: KubernetesPodCrashLooping
      expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Apply the manifest to your Kubernetes cluster:

kubectl apply -f custom-alert-rules.yaml

Once the PromethuesRule is created then check the newly created alerts on Prometheus UI.

That’s it we have successfully added our new custom alerts on alertmanager.

Step 4: Test the custom rules:

To ensure our custom alert rules are working correctly, we’ll simulate a failure by creating a pod with an incorrect image tag. This will help us verify if the alerts are triggered and properly reported in Alertmanager. KubernetesPodNotHealthy alert is responsible to report this alert.

Create a Pod with an Invalid Image

This will simulate a failure by using an incorrect image tag:

kubectl run nginx-pod --image=nginx:lates3

Note: The correct tag is latest, so lates3 is intentionally incorrect to cause the pod to fail.

Verify the Pod Status

Check the status of the pod to confirm that it is failing:

kubectl get pods nginx-pod
NAME        READY   STATUS             RESTARTS   AGE
nginx-pod   0/1     ImagePullBackOff   0          5m35s

You should see the pod in a ErrImagePull state. You can also describe the pod for more details

kubectl describe pod nginx-pod

This will provide information about why the pod is failing.

Check for Alerts in Alertmanager

Since you have set up custom alert rules, these should trigger an alert when the pod fails. Look for alerts related to pod failures. The custom alerts you configured should appear in the Alertmanager interface.

This process ensures that your custom alerting rules are working correctly and that you are notified when a pod fails.

Step 5: Understanding Custom Alert Rules

To better understand how to create and customize alert rules, let’s break down one of the alert rules defined in our custom-alert-rules.yaml. We'll use the KubernetesPodNotHealthy alert as an example:

- alert: KubernetesPodNotHealthy
  expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
    description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Alert Fields Breakdown

alert: The name of the alert (KubernetesPodNotHealthy).
expr: The Prometheus expression to evaluate. This alert triggers if any pod in a Pending, Unknown, or Failed state is detected.
for: The duration for which the condition should be true before the alert fires (1m or 1 minute).
labels: Additional labels to categorize the alert. In this case, we label it with a severity of critical.
annotations: Descriptive information about the alert. These fields can provide context when the alert is triggered:
— — summary: A brief description of the alert (Kubernetes Pod not healthy (instance labels.instance )).
— — description: A detailed description that includes dynamic values from the alert labels (Pod $labels.namespace /labels.pod has been in a non-running state for longer than 15 minutes.\n VALUE = value \n LABELS = $labels ).

These fields help to provide clarity and context when an alert is triggered, making it easier to diagnose and respond to issues in your cluster.

For more examples of custom Prometheus alert rules, you can refer to this Awesome Prometheus Alerts repository.

Step 5: Cleanup

If you want to remove Prometheus, Alertmanager, and Grafana from your Kubernetes cluster, you can do so with the following commands:

Uninstall the Helm Chart:

helm uninstall kube-prometheus-stack

Verify Resources Are Deleted:
Check that the Prometheus, AlertManager, and Grafana resources have been removed:

kubectl get all -l release=kube-prometheus-stack

Conclusion

In this guide, we have successfully set up Prometheus and Alertmanager in a Kubernetes cluster using Helm and configured custom alert rules to monitor the cluster’s health. We also explored the components of an alert rule to better understand how they work. This setup provides a robust monitoring solution that can be further extended and customized to suit your needs. For more examples of custom Prometheus alert rules, you can refer to this Awesome Prometheus Alerts repository.

⇢ Understanding Prometheus & AlertManager Setup in Kubernetes with Custom Rules: A Comprehensive Guide​

Prerequisites​

Step 1: Install Prometheus and Alertmanager using Helm​

Step 2: Verifying the Setup​

Step 3: Configuring Custom Alert Rules​

Step 4: Test the custom rules:​

Step 5: Understanding Custom Alert Rules​

Alert Fields Breakdown​

Step 5: Cleanup​

Conclusion​