Fighting Slow and Flaky CI/CD Pipelines Starts with Observability

Treat your CI/CD like you treat Prod

  • Did all runs fail on the same step?
  • Did all runs fail for the same reason?
  • Did the failure occur only in a specific branch?
  • Did the failure occur on a specific machine?
  • Which fail the most?
  • What’s the normal run time for identifying outliers?
DevDays Europe 2022: How We Gained Observability Into Our CI/CD Pipeline by Dotan Horovits

CI/CD health monitoring with the ELK Stack or OpenSearch

Step 1: Collect Data on CI/CD Pipeline Run

  • the branch
  • commit SHA
  • machine IP
  • run type (scheduled, triggered by merge/push)
  • failed step
  • step duration
  • build number

Step 2: Index & Store Data in Elasticsearch (or OpenSearch)

Step 3: Visualize with Kibana (or OpenSearch) Dashboards

Step 4: Report & Alert

CI/CD environment monitoring with Prometheus

Step 1: Collect Metrics with Telegraf

  1. Enable the Prometheus metrics plugin from the Jenkins Web UI.
    Just go to Manage Jenkins > Manage Plugins > Available > select Prometheus metrics and install).
  2. Install Telegraf with Prometheus input plugin to scrape the Jenkins servers ([[inputs.prometheus]] section in Telegraf configuration)

Step 2: Store Metrics in Prometheus

Step 3: Visualize with Grafana style dashboards

  • System metrics: such as CPU, memory, disk usage and load of the machines or compute instances on which your CI/CD pipeline runs.
  • Container metrics: such as the container CPU, Memory, I/O, Network (inbound, outbound) and Disk usage behavior, as well as the the container status (running/stopped/paused) by Jenkins machine
  • JVM metrics: such as thread count, heap memory usage, and garbage collection duration. This is relevant for Jenkins that is a Java-based program. Other CI/CD tools may require monitoring of their respective runtime environment.
System metrics for Jenkins environment
Container metrics for Jenkins environment
Jenkins metrics for nodes, queues, jobs and executors

CI/CD pipeline performance monitoring with Jaeger and Distributed Tracing

Step 1: Collect with OpenTelemetry

  1. Install OpenTelemetry Collector
  2. Install Jenkins OpenTelemetry plugin in Jenkins UI
  3. Configure the Jenkins plugin to send to the OpenTelemetry Collector endpoint in OTLP/gRPC protocol

Step 2: Store Data in Jaeger backend

Step 3: Visualize with Jaeger UI

Jenkins pipeline run visualized as a trace in the Timeline View in Jaeger UI




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dotan Horovits

Dotan Horovits

Technology evangelist, innovation enthusiast, Startup Nation resident, proud father.