xpeditis2.0/docs/deployment/hetzner/12-monitoring-alerting.md
2026-03-26 18:08:28 +01:00

417 lines
11 KiB
Markdown

# 12 — Monitoring et alertes
---
## Stack de monitoring
```
Prometheus ← Scrape des métriques (pods, nodes, app)
Grafana ← Dashboards visuels
Loki ← Agrégation des logs (NestJS pino)
Alertmanager ← Envoi alertes (email, Slack)
Uptime Kuma ← Monitoring externe HTTP (health checks)
```
---
## Installation du kube-prometheus-stack
La stack la plus complète, déployée avec Helm :
```bash
# Ajouter le repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Créer le namespace monitoring
kubectl create namespace monitoring
# Installer kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--version 65.3.1 \
--set grafana.adminPassword="<MOT_DE_PASSE_FORT>" \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=2Gi \
--set grafana.persistence.storageClassName=hcloud-volumes \
--set prometheus.prometheusSpec.retention=7d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=hcloud-volumes \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName=hcloud-volumes \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=2Gi \
--set prometheusOperator.resources.requests.cpu=50m \
--set prometheusOperator.resources.requests.memory=128Mi \
--set prometheus.prometheusSpec.resources.requests.cpu=100m \
--set prometheus.prometheusSpec.resources.requests.memory=512Mi \
--set grafana.resources.requests.cpu=50m \
--set grafana.resources.requests.memory=128Mi
# Attendre que tout soit Running
kubectl rollout status deployment/prometheus-grafana -n monitoring --timeout=300s
kubectl get pods -n monitoring
```
---
## Exposer Grafana via Ingress
```bash
cat > /tmp/grafana-ingress.yaml << 'EOF'
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana
namespace: monitoring
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
traefik.ingress.kubernetes.io/router.entrypoints: "websecure"
traefik.ingress.kubernetes.io/router.tls: "true"
# Restreindre aux IPs de l'équipe
traefik.ingress.kubernetes.io/router.middlewares: "monitoring-ipwhitelist@kubernetescrd"
spec:
ingressClassName: traefik
tls:
- hosts:
- monitoring.xpeditis.com
secretName: monitoring-tls
rules:
- host: monitoring.xpeditis.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-grafana
port:
number: 80
---
# IP Whitelist pour Grafana (votre équipe seulement)
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: ipwhitelist
namespace: monitoring
spec:
ipWhiteList:
sourceRange:
- "<VOTRE_IP>/32"
- "10.0.0.0/16" # Réseau interne Hetzner
EOF
kubectl apply -f /tmp/grafana-ingress.yaml
```
---
## Installation de Loki (agrégation des logs)
```bash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack \
--namespace monitoring \
--set loki.persistence.enabled=true \
--set loki.persistence.size=5Gi \
--set loki.persistence.storageClassName=hcloud-volumes \
--set promtail.enabled=true \
--set loki.config.limits_config.retention_period=7d \
--set grafana.enabled=false # On utilise le Grafana déjà installé
# Ajouter Loki comme datasource dans Grafana
# Grafana → Data Sources → Add → Loki
# URL: http://loki:3100
```
---
## Configuration des alertes
### Alertes Xpeditis spécifiques
```bash
cat > /tmp/xpeditis-alerts.yaml << 'EOF'
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: xpeditis-alerts
namespace: xpeditis-prod
labels:
release: prometheus
spec:
groups:
- name: xpeditis.backend
interval: 30s
rules:
# Backend down
- alert: XpeditisBackendDown
expr: up{job="xpeditis-backend"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Backend Xpeditis indisponible"
description: "Aucun pod backend ne répond depuis 1 minute."
# Trop peu de replicas
- alert: XpeditisBackendLowReplicas
expr: kube_deployment_status_replicas_available{deployment="xpeditis-backend",namespace="xpeditis-prod"} < 1
for: 2m
labels:
severity: critical
annotations:
summary: "Moins d'1 replica backend disponible"
# CPU élevé (déclenchement autoscaling probable)
- alert: XpeditisHighCPU
expr: |
sum(rate(container_cpu_usage_seconds_total{
namespace="xpeditis-prod",
container="backend"
}[5m])) by (pod) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "CPU élevé sur pod {{ $labels.pod }}"
description: "Utilisation CPU > 80% depuis 5 minutes."
# Mémoire élevée
- alert: XpeditisHighMemory
expr: |
container_memory_usage_bytes{
namespace="xpeditis-prod",
container="backend"
} / container_spec_memory_limit_bytes{
namespace="xpeditis-prod",
container="backend"
} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Mémoire élevée sur pod {{ $labels.pod }}"
# Taux d'erreur HTTP élevé
- alert: XpeditisHighErrorRate
expr: |
sum(rate(traefik_service_requests_total{
service=~"xpeditis-prod-xpeditis-backend.*",
code=~"5.."
}[5m])) /
sum(rate(traefik_service_requests_total{
service=~"xpeditis-prod-xpeditis-backend.*"
}[5m])) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Taux d'erreur 5xx > 5% sur l'API backend"
# Pods en CrashLoopBackOff
- alert: XpeditisPodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{
namespace="xpeditis-prod"
}[1h]) > 5
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} redémarre trop souvent"
- name: xpeditis.database
rules:
# Pas d'alerte directe sur Neon (managed) — uniquement si self-hosted
- name: xpeditis.redis
rules:
# Redis mémoire élevée
- alert: RedisHighMemory
expr: |
redis_memory_used_bytes /
redis_memory_max_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis utilise > 85% de sa mémoire"
EOF
kubectl apply -f /tmp/xpeditis-alerts.yaml
```
### Configuration Alertmanager (Slack)
```bash
cat > /tmp/alertmanager-config.yaml << 'EOF'
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-prometheus-kube-prometheus-alertmanager
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: '<SLACK_WEBHOOK_URL>'
route:
group_by: ['alertname', 'namespace']
group_wait: 10s
group_interval: 10m
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: warning
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#xpeditis-monitoring'
icon_url: https://avatars.githubusercontent.com/u/3380462
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
send_resolved: true
- name: 'slack-critical'
slack_configs:
- channel: '#xpeditis-alerts-critiques'
color: 'danger'
title: '🚨 ALERTE CRITIQUE : {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
send_resolved: true
EOF
kubectl apply -f /tmp/alertmanager-config.yaml
```
---
## Dashboards Grafana recommandés
Importez ces dashboards depuis grafana.com (ID à entrer dans Grafana → Import) :
| Dashboard | ID | Usage |
|---|---|---|
| Kubernetes Cluster Overview | 6417 | Vue d'ensemble cluster |
| Kubernetes Deployments | 8588 | Détail des deployments |
| Node Exporter Full | 1860 | Métriques système des nœuds |
| Loki & Promtail | 12611 | Logs agrégés |
| Traefik 2 | 4475 | Métriques ingress/requêtes |
```bash
# Dans Grafana (https://monitoring.xpeditis.com)
# → + → Import
# → Entrer l'ID et cliquer "Load"
# → Sélectionner la datasource Prometheus
# → Import
```
---
## Uptime Kuma (monitoring externe)
Uptime Kuma monitore vos endpoints depuis l'extérieur du cluster, indépendamment de Prometheus :
```bash
# Déployer Uptime Kuma dans le cluster
cat > /tmp/uptime-kuma.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: uptime-kuma
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: uptime-kuma
template:
metadata:
labels:
app: uptime-kuma
spec:
containers:
- name: uptime-kuma
image: louislam/uptime-kuma:1
ports:
- containerPort: 3001
volumeMounts:
- name: data
mountPath: /app/data
resources:
requests:
cpu: 50m
memory: 128Mi
volumes:
- name: data
persistentVolumeClaim:
claimName: uptime-kuma-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: uptime-kuma-pvc
namespace: monitoring
spec:
accessModes: [ReadWriteOnce]
storageClassName: hcloud-volumes
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: uptime-kuma
namespace: monitoring
spec:
selector:
app: uptime-kuma
ports:
- port: 3001
targetPort: 3001
EOF
kubectl apply -f /tmp/uptime-kuma.yaml
```
Monitors à configurer dans Uptime Kuma :
| Monitor | URL | Intervalle |
|---|---|---|
| API Health | `https://api.xpeditis.com/api/v1/health` | 1 min |
| Frontend | `https://app.xpeditis.com/` | 1 min |
| API Login | `POST https://api.xpeditis.com/api/v1/auth/login` | 5 min |
---
## Commandes de monitoring rapides
```bash
# Top des pods par consommation CPU/RAM
kubectl top pods -n xpeditis-prod --sort-by=cpu
# Événements récents du namespace
kubectl get events -n xpeditis-prod --sort-by='.lastTimestamp' | tail -20
# Logs backend en temps réel (tous les pods)
stern xpeditis-backend -n xpeditis-prod
# Logs d'erreurs uniquement
kubectl logs -l app=xpeditis-backend -n xpeditis-prod --since=1h | grep -i error
# Status des HPAs
kubectl get hpa -n xpeditis-prod
# Métriques des nœuds
kubectl top nodes
```