apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: prometheus-slow-scrape spec: action: delay mode: all selector: pods: prometheus-ns: - prometheus-server-0 delay: latency: "3s" correlation: "100" jitter: "1s" duration: "5m" Apply with kubectl apply -f chaos.yaml . Prometheus will now see all outbound scrape requests delayed. One of the most insidious PCE experiments is injecting malformed OpenMetrics data.
Before we dive into code, let’s address the obvious question: Why would I voluntarily break my monitoring?
We all love Prometheus. It scrapes metrics, fires alerts, and helps us sleep at night. But here’s a painful truth most engineers realize at 3 AM: Your monitoring system can fail, and you won’t know about it until the real outage happens. prometheus chaos edition
The result? A telemetry system that survives real network partitions, overloaded exporters, and misconfigured rules. And a team that actually knows how to debug their monitoring stack under pressure.
@app.route('/metrics') def metrics(): if random.random() < 0.2: # 20% of the time return "malformed_metric{ invalid syntax", 200 return Response(real_metrics(), mimetype='text/plain') apiVersion: chaos-mesh
Despite its dramatic name, Prometheus Chaos Edition is not an official Prometheus release. It is a concept (and accompanying script/container) popularized by the Prometheus community and tools like kube-prometheus-stack chaos experiments.
# malicious_exporter.py from flask import Flask, Response import random app = Flask() Before we dive into code, let’s address the
What happens when your Prometheus server runs out of memory? What if a metric scrape takes 30 seconds because a target is thrashing? What if your alerting rules become corrupt?