Tuesday, 18 February 2025

A Production-Ready Guide: Mastering Metrics Collection with Prometheus and Grafana

In modern infrastructure and application management, visibility into system performance is non-negotiable. Prometheus and Grafana form a powerhouse duo for metrics collection, storage, and visualization. This guide provides a production-ready walkthrough of setting up Prometheus and Grafana, complete with security best practices, service management, and advanced configurations. Whether you’re monitoring a single server or a distributed system, this guide will equip you with actionable steps to build a robust monitoring stack.

Table of Contents

  1. Understanding the Tools

    • What is Prometheus?
    • What is Grafana?
    • How They Work Together
  2. Setting Up Prometheus

    • Installing Prometheus
    • Configuring Scrape Targets
    • Running as a Systemd Service
  3. Monitoring Host Metrics with Node Exporter

    • Installing Node Exporter
    • Systemd Service Setup
    • Verifying Metrics
  4. Installing Grafana Securely

    • Adding the Grafana Repository
    • HTTPS Configuration with Let’s Encrypt
    • Enabling Authentication
  5. Integrating Prometheus with Grafana

    • Adding Prometheus as a Data Source
    • Building a Dashboard: CPU, Memory, Disk, and Network
  6. Advanced Configurations

    • Setting Up Alerts in Grafana
    • Remote Storage with Thanos
    • Dynamic Service Discovery
  7. Security Best Practices

    • Securing Prometheus with Reverse Proxies
    • Grafana User Permissions
  8. Troubleshooting Common Issues

    • Firewall and Port Conflicts
    • Permission Errors

1. Understanding the Tools

What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It pulls metrics from targets at configured intervals, stores them in a time-series database, and allows querying via PromQL. Ideal for dynamic environments (e.g., Kubernetes), it supports multi-dimensional data collection.

What is Grafana?

Grafana is a visualization platform that turns metrics into interactive dashboards. It supports 50+ data sources, including Prometheus, and offers features like alerts, annotations, and plugins.

How They Work Together

Prometheus collects and stores metrics, while Grafana queries and visualizes them. Together, they provide end-to-end monitoring for infrastructure, applications, and services.

2. Setting Up Prometheus

Step 1: Download and Install

# Fetch the latest version from https://prometheus.io/download/
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/
prometheus-2.47.0.linux-amd64.tar.gz

tar xvfz prometheus-*.tar.gz
cd prometheus-*

Step 2: Configure Scrape Targets

Edit prometheus.yml to monitor Node Exporter:

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

Step 3: Run as a Systemd Service

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data
Restart=always

[Install]
WantedBy=multi-user.target

Reload and start:

sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus

Step 4: Verify

Access http://localhost:9090/targets. Targets should show UP status.

3. Monitoring Host Metrics with Node Exporter

Step 1: Install Node Exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/
node_exporter-1.6.1.linux-amd64.tar.gz

tar xvfz node_exporter-*.tar.gz
cd node_exporter-*

Step 2: Systemd Service Setup

Create /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/opt/node_exporter/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target

Start the service:

sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Step 3: Verify Metrics

Visit http://localhost:9100/metrics to see raw metrics (e.g., node_cpu_seconds_total).

4. Installing Grafana Securely

Step 1: Add Grafana Repository

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com 
stable main"
curl -fsSL https://apt.grafana.com/gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/grafana.gpg
sudo apt-get update
sudo apt-get install grafana

Step 2: Enable HTTPS with Let’s Encrypt

Install Certbot:

sudo apt-get install certbot python3-certbot-nginx

Obtain a certificate:

sudo certbot --nginx -d grafana.yourdomain.com

Update Grafana’s grafana.ini:

[server]
protocol = https
cert_file = /etc/letsencrypt/live/grafana.yourdomain.com/fullchain.pem
cert_key = /etc/letsencrypt/live/grafana.yourdomain.com/privkey.pem

Step 3: Configure Authentication

In Grafana’s UI:

  1. Navigate to Admin > Authentication > Generic OAuth.
  2. Set up OAuth2 with Google/GitHub or enable built-in login.

5. Integrating Prometheus with Grafana

Step 1: Add Prometheus as a Data Source

  1. In Grafana, go to Configuration > Data Sources > Add data source.
  2. Select Prometheus.
  3. Set URL to http://localhost:9090 (or http://prometheus:9090 if using DNS).

Step 2: Create a Dashboard

CPU Usage:

100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage:

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Disk Usage:

100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})

Network Traffic:

sum(rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])) by (device)

6. Advanced Configurations

Alerts in Grafana

  1. Edit a panel and navigate to the Alert tab.
  2. Set conditions (e.g., CPU Usage > 90%).
  3. Configure notifications via email, Slack, or PagerDuty.

Remote Storage with Thanos

To extend Prometheus’ retention:

  1. Deploy Thanos Sidecar alongside Prometheus.
  2. Configure object storage (e.g., AWS S3):
storage:
  thanos:
    object_storage:
      type: S3
      config:
        bucket: "thanos-storage"
        endpoint: "s3.amazonaws.com"

Dynamic Service Discovery

Use Kubernetes SD or Consul in prometheus.yml:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

7. Security Best Practices

Securing Prometheus

  1. Use Nginx as a reverse proxy with HTTPS and basic auth:
server {
    listen 9090 ssl;
    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;
    location / {
        auth_basic "Prometheus";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://localhost:9090;
    }
}

Grafana Permissions

  • Use role-based access control (RBAC) to limit dashboard access.
  • Disable anonymous login in grafana.ini.

8. Troubleshooting

Firewall Issues

Ensure ports are open:

sudo ufw allow 9090/tcp  # Prometheus
sudo ufw allow 3000/tcp  # Grafana
sudo ufw allow 9100/tcp  # Node Exporter

Permission Errors

Run services under dedicated users:

sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

You’ve now built a secure, scalable monitoring stack with Prometheus and Grafana. From basic metric collection to advanced alerting and dynamic discovery, this setup is ready for production workloads. To deepen your expertise:

  • Explore Grafana plugins like Loki for logs.
  • Integrate Alertmanager for centralized alert routing.
  • Experiment with custom exporters for application-specific metrics.

With these tools, you’re equipped to turn raw metrics into actionable insights, ensuring system reliability and performance.

 

Labels:

0 Comments:

Post a Comment

Note: only a member of this blog may post a comment.

<< Home