417 lines
11 KiB
Markdown
417 lines
11 KiB
Markdown
# 12 — Monitoring et alertes
|
|
|
|
---
|
|
|
|
## Stack de monitoring
|
|
|
|
```
|
|
Prometheus ← Scrape des métriques (pods, nodes, app)
|
|
Grafana ← Dashboards visuels
|
|
Loki ← Agrégation des logs (NestJS pino)
|
|
Alertmanager ← Envoi alertes (email, Slack)
|
|
Uptime Kuma ← Monitoring externe HTTP (health checks)
|
|
```
|
|
|
|
---
|
|
|
|
## Installation du kube-prometheus-stack
|
|
|
|
La stack la plus complète, déployée avec Helm :
|
|
|
|
```bash
|
|
# Ajouter le repo
|
|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
helm repo update
|
|
|
|
# Créer le namespace monitoring
|
|
kubectl create namespace monitoring
|
|
|
|
# Installer kube-prometheus-stack
|
|
helm install prometheus prometheus-community/kube-prometheus-stack \
|
|
--namespace monitoring \
|
|
--version 65.3.1 \
|
|
--set grafana.adminPassword="<MOT_DE_PASSE_FORT>" \
|
|
--set grafana.persistence.enabled=true \
|
|
--set grafana.persistence.size=2Gi \
|
|
--set grafana.persistence.storageClassName=hcloud-volumes \
|
|
--set prometheus.prometheusSpec.retention=7d \
|
|
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=hcloud-volumes \
|
|
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
|
|
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName=hcloud-volumes \
|
|
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=2Gi \
|
|
--set prometheusOperator.resources.requests.cpu=50m \
|
|
--set prometheusOperator.resources.requests.memory=128Mi \
|
|
--set prometheus.prometheusSpec.resources.requests.cpu=100m \
|
|
--set prometheus.prometheusSpec.resources.requests.memory=512Mi \
|
|
--set grafana.resources.requests.cpu=50m \
|
|
--set grafana.resources.requests.memory=128Mi
|
|
|
|
# Attendre que tout soit Running
|
|
kubectl rollout status deployment/prometheus-grafana -n monitoring --timeout=300s
|
|
kubectl get pods -n monitoring
|
|
```
|
|
|
|
---
|
|
|
|
## Exposer Grafana via Ingress
|
|
|
|
```bash
|
|
cat > /tmp/grafana-ingress.yaml << 'EOF'
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: Ingress
|
|
metadata:
|
|
name: grafana
|
|
namespace: monitoring
|
|
annotations:
|
|
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
|
traefik.ingress.kubernetes.io/router.entrypoints: "websecure"
|
|
traefik.ingress.kubernetes.io/router.tls: "true"
|
|
# Restreindre aux IPs de l'équipe
|
|
traefik.ingress.kubernetes.io/router.middlewares: "monitoring-ipwhitelist@kubernetescrd"
|
|
spec:
|
|
ingressClassName: traefik
|
|
tls:
|
|
- hosts:
|
|
- monitoring.xpeditis.com
|
|
secretName: monitoring-tls
|
|
rules:
|
|
- host: monitoring.xpeditis.com
|
|
http:
|
|
paths:
|
|
- path: /
|
|
pathType: Prefix
|
|
backend:
|
|
service:
|
|
name: prometheus-grafana
|
|
port:
|
|
number: 80
|
|
---
|
|
# IP Whitelist pour Grafana (votre équipe seulement)
|
|
apiVersion: traefik.io/v1alpha1
|
|
kind: Middleware
|
|
metadata:
|
|
name: ipwhitelist
|
|
namespace: monitoring
|
|
spec:
|
|
ipWhiteList:
|
|
sourceRange:
|
|
- "<VOTRE_IP>/32"
|
|
- "10.0.0.0/16" # Réseau interne Hetzner
|
|
EOF
|
|
|
|
kubectl apply -f /tmp/grafana-ingress.yaml
|
|
```
|
|
|
|
---
|
|
|
|
## Installation de Loki (agrégation des logs)
|
|
|
|
```bash
|
|
helm repo add grafana https://grafana.github.io/helm-charts
|
|
helm repo update
|
|
|
|
helm install loki grafana/loki-stack \
|
|
--namespace monitoring \
|
|
--set loki.persistence.enabled=true \
|
|
--set loki.persistence.size=5Gi \
|
|
--set loki.persistence.storageClassName=hcloud-volumes \
|
|
--set promtail.enabled=true \
|
|
--set loki.config.limits_config.retention_period=7d \
|
|
--set grafana.enabled=false # On utilise le Grafana déjà installé
|
|
|
|
# Ajouter Loki comme datasource dans Grafana
|
|
# Grafana → Data Sources → Add → Loki
|
|
# URL: http://loki:3100
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration des alertes
|
|
|
|
### Alertes Xpeditis spécifiques
|
|
|
|
```bash
|
|
cat > /tmp/xpeditis-alerts.yaml << 'EOF'
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: xpeditis-alerts
|
|
namespace: xpeditis-prod
|
|
labels:
|
|
release: prometheus
|
|
spec:
|
|
groups:
|
|
- name: xpeditis.backend
|
|
interval: 30s
|
|
rules:
|
|
|
|
# Backend down
|
|
- alert: XpeditisBackendDown
|
|
expr: up{job="xpeditis-backend"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Backend Xpeditis indisponible"
|
|
description: "Aucun pod backend ne répond depuis 1 minute."
|
|
|
|
# Trop peu de replicas
|
|
- alert: XpeditisBackendLowReplicas
|
|
expr: kube_deployment_status_replicas_available{deployment="xpeditis-backend",namespace="xpeditis-prod"} < 1
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Moins d'1 replica backend disponible"
|
|
|
|
# CPU élevé (déclenchement autoscaling probable)
|
|
- alert: XpeditisHighCPU
|
|
expr: |
|
|
sum(rate(container_cpu_usage_seconds_total{
|
|
namespace="xpeditis-prod",
|
|
container="backend"
|
|
}[5m])) by (pod) > 0.8
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "CPU élevé sur pod {{ $labels.pod }}"
|
|
description: "Utilisation CPU > 80% depuis 5 minutes."
|
|
|
|
# Mémoire élevée
|
|
- alert: XpeditisHighMemory
|
|
expr: |
|
|
container_memory_usage_bytes{
|
|
namespace="xpeditis-prod",
|
|
container="backend"
|
|
} / container_spec_memory_limit_bytes{
|
|
namespace="xpeditis-prod",
|
|
container="backend"
|
|
} > 0.85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Mémoire élevée sur pod {{ $labels.pod }}"
|
|
|
|
# Taux d'erreur HTTP élevé
|
|
- alert: XpeditisHighErrorRate
|
|
expr: |
|
|
sum(rate(traefik_service_requests_total{
|
|
service=~"xpeditis-prod-xpeditis-backend.*",
|
|
code=~"5.."
|
|
}[5m])) /
|
|
sum(rate(traefik_service_requests_total{
|
|
service=~"xpeditis-prod-xpeditis-backend.*"
|
|
}[5m])) > 0.05
|
|
for: 2m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Taux d'erreur 5xx > 5% sur l'API backend"
|
|
|
|
# Pods en CrashLoopBackOff
|
|
- alert: XpeditisPodCrashLooping
|
|
expr: |
|
|
increase(kube_pod_container_status_restarts_total{
|
|
namespace="xpeditis-prod"
|
|
}[1h]) > 5
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Pod {{ $labels.pod }} redémarre trop souvent"
|
|
|
|
- name: xpeditis.database
|
|
rules:
|
|
# Pas d'alerte directe sur Neon (managed) — uniquement si self-hosted
|
|
|
|
- name: xpeditis.redis
|
|
rules:
|
|
# Redis mémoire élevée
|
|
- alert: RedisHighMemory
|
|
expr: |
|
|
redis_memory_used_bytes /
|
|
redis_memory_max_bytes > 0.85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Redis utilise > 85% de sa mémoire"
|
|
EOF
|
|
|
|
kubectl apply -f /tmp/xpeditis-alerts.yaml
|
|
```
|
|
|
|
### Configuration Alertmanager (Slack)
|
|
|
|
```bash
|
|
cat > /tmp/alertmanager-config.yaml << 'EOF'
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: alertmanager-prometheus-kube-prometheus-alertmanager
|
|
namespace: monitoring
|
|
stringData:
|
|
alertmanager.yaml: |
|
|
global:
|
|
resolve_timeout: 5m
|
|
slack_api_url: '<SLACK_WEBHOOK_URL>'
|
|
|
|
route:
|
|
group_by: ['alertname', 'namespace']
|
|
group_wait: 10s
|
|
group_interval: 10m
|
|
repeat_interval: 12h
|
|
receiver: 'slack-notifications'
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: 'slack-critical'
|
|
- match:
|
|
severity: warning
|
|
receiver: 'slack-notifications'
|
|
|
|
receivers:
|
|
- name: 'slack-notifications'
|
|
slack_configs:
|
|
- channel: '#xpeditis-monitoring'
|
|
icon_url: https://avatars.githubusercontent.com/u/3380462
|
|
title: '{{ template "slack.default.title" . }}'
|
|
text: '{{ template "slack.default.text" . }}'
|
|
send_resolved: true
|
|
|
|
- name: 'slack-critical'
|
|
slack_configs:
|
|
- channel: '#xpeditis-alerts-critiques'
|
|
color: 'danger'
|
|
title: '🚨 ALERTE CRITIQUE : {{ .CommonAnnotations.summary }}'
|
|
text: '{{ .CommonAnnotations.description }}'
|
|
send_resolved: true
|
|
EOF
|
|
|
|
kubectl apply -f /tmp/alertmanager-config.yaml
|
|
```
|
|
|
|
---
|
|
|
|
## Dashboards Grafana recommandés
|
|
|
|
Importez ces dashboards depuis grafana.com (ID à entrer dans Grafana → Import) :
|
|
|
|
| Dashboard | ID | Usage |
|
|
|---|---|---|
|
|
| Kubernetes Cluster Overview | 6417 | Vue d'ensemble cluster |
|
|
| Kubernetes Deployments | 8588 | Détail des deployments |
|
|
| Node Exporter Full | 1860 | Métriques système des nœuds |
|
|
| Loki & Promtail | 12611 | Logs agrégés |
|
|
| Traefik 2 | 4475 | Métriques ingress/requêtes |
|
|
|
|
```bash
|
|
# Dans Grafana (https://monitoring.xpeditis.com)
|
|
# → + → Import
|
|
# → Entrer l'ID et cliquer "Load"
|
|
# → Sélectionner la datasource Prometheus
|
|
# → Import
|
|
```
|
|
|
|
---
|
|
|
|
## Uptime Kuma (monitoring externe)
|
|
|
|
Uptime Kuma monitore vos endpoints depuis l'extérieur du cluster, indépendamment de Prometheus :
|
|
|
|
```bash
|
|
# Déployer Uptime Kuma dans le cluster
|
|
cat > /tmp/uptime-kuma.yaml << 'EOF'
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: uptime-kuma
|
|
namespace: monitoring
|
|
spec:
|
|
replicas: 1
|
|
selector:
|
|
matchLabels:
|
|
app: uptime-kuma
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: uptime-kuma
|
|
spec:
|
|
containers:
|
|
- name: uptime-kuma
|
|
image: louislam/uptime-kuma:1
|
|
ports:
|
|
- containerPort: 3001
|
|
volumeMounts:
|
|
- name: data
|
|
mountPath: /app/data
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 128Mi
|
|
volumes:
|
|
- name: data
|
|
persistentVolumeClaim:
|
|
claimName: uptime-kuma-pvc
|
|
---
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: uptime-kuma-pvc
|
|
namespace: monitoring
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
storageClassName: hcloud-volumes
|
|
resources:
|
|
requests:
|
|
storage: 1Gi
|
|
---
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: uptime-kuma
|
|
namespace: monitoring
|
|
spec:
|
|
selector:
|
|
app: uptime-kuma
|
|
ports:
|
|
- port: 3001
|
|
targetPort: 3001
|
|
EOF
|
|
|
|
kubectl apply -f /tmp/uptime-kuma.yaml
|
|
```
|
|
|
|
Monitors à configurer dans Uptime Kuma :
|
|
|
|
| Monitor | URL | Intervalle |
|
|
|---|---|---|
|
|
| API Health | `https://api.xpeditis.com/api/v1/health` | 1 min |
|
|
| Frontend | `https://app.xpeditis.com/` | 1 min |
|
|
| API Login | `POST https://api.xpeditis.com/api/v1/auth/login` | 5 min |
|
|
|
|
---
|
|
|
|
## Commandes de monitoring rapides
|
|
|
|
```bash
|
|
# Top des pods par consommation CPU/RAM
|
|
kubectl top pods -n xpeditis-prod --sort-by=cpu
|
|
|
|
# Événements récents du namespace
|
|
kubectl get events -n xpeditis-prod --sort-by='.lastTimestamp' | tail -20
|
|
|
|
# Logs backend en temps réel (tous les pods)
|
|
stern xpeditis-backend -n xpeditis-prod
|
|
|
|
# Logs d'erreurs uniquement
|
|
kubectl logs -l app=xpeditis-backend -n xpeditis-prod --since=1h | grep -i error
|
|
|
|
# Status des HPAs
|
|
kubectl get hpa -n xpeditis-prod
|
|
|
|
# Métriques des nœuds
|
|
kubectl top nodes
|
|
```
|