# 12 — Monitoring et alertes --- ## Stack de monitoring ``` Prometheus ← Scrape des métriques (pods, nodes, app) Grafana ← Dashboards visuels Loki ← Agrégation des logs (NestJS pino) Alertmanager ← Envoi alertes (email, Slack) Uptime Kuma ← Monitoring externe HTTP (health checks) ``` --- ## Installation du kube-prometheus-stack La stack la plus complète, déployée avec Helm : ```bash # Ajouter le repo helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Créer le namespace monitoring kubectl create namespace monitoring # Installer kube-prometheus-stack helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --version 65.3.1 \ --set grafana.adminPassword="" \ --set grafana.persistence.enabled=true \ --set grafana.persistence.size=2Gi \ --set grafana.persistence.storageClassName=hcloud-volumes \ --set prometheus.prometheusSpec.retention=7d \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=hcloud-volumes \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \ --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName=hcloud-volumes \ --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=2Gi \ --set prometheusOperator.resources.requests.cpu=50m \ --set prometheusOperator.resources.requests.memory=128Mi \ --set prometheus.prometheusSpec.resources.requests.cpu=100m \ --set prometheus.prometheusSpec.resources.requests.memory=512Mi \ --set grafana.resources.requests.cpu=50m \ --set grafana.resources.requests.memory=128Mi # Attendre que tout soit Running kubectl rollout status deployment/prometheus-grafana -n monitoring --timeout=300s kubectl get pods -n monitoring ``` --- ## Exposer Grafana via Ingress ```bash cat > /tmp/grafana-ingress.yaml << 'EOF' apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: grafana namespace: monitoring annotations: cert-manager.io/cluster-issuer: "letsencrypt-prod" traefik.ingress.kubernetes.io/router.entrypoints: "websecure" traefik.ingress.kubernetes.io/router.tls: "true" # Restreindre aux IPs de l'équipe traefik.ingress.kubernetes.io/router.middlewares: "monitoring-ipwhitelist@kubernetescrd" spec: ingressClassName: traefik tls: - hosts: - monitoring.xpeditis.com secretName: monitoring-tls rules: - host: monitoring.xpeditis.com http: paths: - path: / pathType: Prefix backend: service: name: prometheus-grafana port: number: 80 --- # IP Whitelist pour Grafana (votre équipe seulement) apiVersion: traefik.io/v1alpha1 kind: Middleware metadata: name: ipwhitelist namespace: monitoring spec: ipWhiteList: sourceRange: - "/32" - "10.0.0.0/16" # Réseau interne Hetzner EOF kubectl apply -f /tmp/grafana-ingress.yaml ``` --- ## Installation de Loki (agrégation des logs) ```bash helm repo add grafana https://grafana.github.io/helm-charts helm repo update helm install loki grafana/loki-stack \ --namespace monitoring \ --set loki.persistence.enabled=true \ --set loki.persistence.size=5Gi \ --set loki.persistence.storageClassName=hcloud-volumes \ --set promtail.enabled=true \ --set loki.config.limits_config.retention_period=7d \ --set grafana.enabled=false # On utilise le Grafana déjà installé # Ajouter Loki comme datasource dans Grafana # Grafana → Data Sources → Add → Loki # URL: http://loki:3100 ``` --- ## Configuration des alertes ### Alertes Xpeditis spécifiques ```bash cat > /tmp/xpeditis-alerts.yaml << 'EOF' apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: xpeditis-alerts namespace: xpeditis-prod labels: release: prometheus spec: groups: - name: xpeditis.backend interval: 30s rules: # Backend down - alert: XpeditisBackendDown expr: up{job="xpeditis-backend"} == 0 for: 1m labels: severity: critical annotations: summary: "Backend Xpeditis indisponible" description: "Aucun pod backend ne répond depuis 1 minute." # Trop peu de replicas - alert: XpeditisBackendLowReplicas expr: kube_deployment_status_replicas_available{deployment="xpeditis-backend",namespace="xpeditis-prod"} < 1 for: 2m labels: severity: critical annotations: summary: "Moins d'1 replica backend disponible" # CPU élevé (déclenchement autoscaling probable) - alert: XpeditisHighCPU expr: | sum(rate(container_cpu_usage_seconds_total{ namespace="xpeditis-prod", container="backend" }[5m])) by (pod) > 0.8 for: 5m labels: severity: warning annotations: summary: "CPU élevé sur pod {{ $labels.pod }}" description: "Utilisation CPU > 80% depuis 5 minutes." # Mémoire élevée - alert: XpeditisHighMemory expr: | container_memory_usage_bytes{ namespace="xpeditis-prod", container="backend" } / container_spec_memory_limit_bytes{ namespace="xpeditis-prod", container="backend" } > 0.85 for: 5m labels: severity: warning annotations: summary: "Mémoire élevée sur pod {{ $labels.pod }}" # Taux d'erreur HTTP élevé - alert: XpeditisHighErrorRate expr: | sum(rate(traefik_service_requests_total{ service=~"xpeditis-prod-xpeditis-backend.*", code=~"5.." }[5m])) / sum(rate(traefik_service_requests_total{ service=~"xpeditis-prod-xpeditis-backend.*" }[5m])) > 0.05 for: 2m labels: severity: warning annotations: summary: "Taux d'erreur 5xx > 5% sur l'API backend" # Pods en CrashLoopBackOff - alert: XpeditisPodCrashLooping expr: | increase(kube_pod_container_status_restarts_total{ namespace="xpeditis-prod" }[1h]) > 5 labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} redémarre trop souvent" - name: xpeditis.database rules: # Pas d'alerte directe sur Neon (managed) — uniquement si self-hosted - name: xpeditis.redis rules: # Redis mémoire élevée - alert: RedisHighMemory expr: | redis_memory_used_bytes / redis_memory_max_bytes > 0.85 for: 5m labels: severity: warning annotations: summary: "Redis utilise > 85% de sa mémoire" EOF kubectl apply -f /tmp/xpeditis-alerts.yaml ``` ### Configuration Alertmanager (Slack) ```bash cat > /tmp/alertmanager-config.yaml << 'EOF' apiVersion: v1 kind: Secret metadata: name: alertmanager-prometheus-kube-prometheus-alertmanager namespace: monitoring stringData: alertmanager.yaml: | global: resolve_timeout: 5m slack_api_url: '' route: group_by: ['alertname', 'namespace'] group_wait: 10s group_interval: 10m repeat_interval: 12h receiver: 'slack-notifications' routes: - match: severity: critical receiver: 'slack-critical' - match: severity: warning receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - channel: '#xpeditis-monitoring' icon_url: https://avatars.githubusercontent.com/u/3380462 title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}' send_resolved: true - name: 'slack-critical' slack_configs: - channel: '#xpeditis-alerts-critiques' color: 'danger' title: '🚨 ALERTE CRITIQUE : {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' send_resolved: true EOF kubectl apply -f /tmp/alertmanager-config.yaml ``` --- ## Dashboards Grafana recommandés Importez ces dashboards depuis grafana.com (ID à entrer dans Grafana → Import) : | Dashboard | ID | Usage | |---|---|---| | Kubernetes Cluster Overview | 6417 | Vue d'ensemble cluster | | Kubernetes Deployments | 8588 | Détail des deployments | | Node Exporter Full | 1860 | Métriques système des nœuds | | Loki & Promtail | 12611 | Logs agrégés | | Traefik 2 | 4475 | Métriques ingress/requêtes | ```bash # Dans Grafana (https://monitoring.xpeditis.com) # → + → Import # → Entrer l'ID et cliquer "Load" # → Sélectionner la datasource Prometheus # → Import ``` --- ## Uptime Kuma (monitoring externe) Uptime Kuma monitore vos endpoints depuis l'extérieur du cluster, indépendamment de Prometheus : ```bash # Déployer Uptime Kuma dans le cluster cat > /tmp/uptime-kuma.yaml << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: uptime-kuma namespace: monitoring spec: replicas: 1 selector: matchLabels: app: uptime-kuma template: metadata: labels: app: uptime-kuma spec: containers: - name: uptime-kuma image: louislam/uptime-kuma:1 ports: - containerPort: 3001 volumeMounts: - name: data mountPath: /app/data resources: requests: cpu: 50m memory: 128Mi volumes: - name: data persistentVolumeClaim: claimName: uptime-kuma-pvc --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: uptime-kuma-pvc namespace: monitoring spec: accessModes: [ReadWriteOnce] storageClassName: hcloud-volumes resources: requests: storage: 1Gi --- apiVersion: v1 kind: Service metadata: name: uptime-kuma namespace: monitoring spec: selector: app: uptime-kuma ports: - port: 3001 targetPort: 3001 EOF kubectl apply -f /tmp/uptime-kuma.yaml ``` Monitors à configurer dans Uptime Kuma : | Monitor | URL | Intervalle | |---|---|---| | API Health | `https://api.xpeditis.com/api/v1/health` | 1 min | | Frontend | `https://app.xpeditis.com/` | 1 min | | API Login | `POST https://api.xpeditis.com/api/v1/auth/login` | 5 min | --- ## Commandes de monitoring rapides ```bash # Top des pods par consommation CPU/RAM kubectl top pods -n xpeditis-prod --sort-by=cpu # Événements récents du namespace kubectl get events -n xpeditis-prod --sort-by='.lastTimestamp' | tail -20 # Logs backend en temps réel (tous les pods) stern xpeditis-backend -n xpeditis-prod # Logs d'erreurs uniquement kubectl logs -l app=xpeditis-backend -n xpeditis-prod --since=1h | grep -i error # Status des HPAs kubectl get hpa -n xpeditis-prod # Métriques des nœuds kubectl top nodes ```