Fighting Slow and Flaky CI/CD Pipelines Starts with Observability

12 min readJun 6, 2022

We all practice monitoring and observability in our Production environment. That’s how we know that our system runs well, that our environment is stable, and in case of issue — to root-cause and remediate quickly and efficiently. This helps reduce the Mean Time to Recovery, which is a crucial metric for software teams.

On this guide I’ll cover the following:

Treat your CI/CD like you treat Prod

Observability for the CI/CD pipelines is the step-child, with a less established practice. Lack of CI/CD observability results in unnecessarily long cycle time, or Lead Time for Changes, which is another crucial metric measuring how much time it takes a commit to get into production. It means your bug fixes, enhancements and new features will be rolled out with delay. Now imagine the frustration of the users waiting for it, of the business that wants to launch it, not to mention the other developers wanting to run their own pipelines and get stuck in the queue. Add to that the unfriendly experience of the Developer on Duty (DoD) needing to handle failed pipelines on his shift. You get the picture.

Some CI/CD tools provide some observability capabilities out of the box. In my company we use Jenkins and have explored its capabilities and plugins in that area. Jenkins lets you enter into individual runs and see how that run went. But it’s oftentime not enough, when you wish to monitor aggregated information from all pipeline’s runs, across all branches and machines, with your own filters and time ranges to really understand the patterns. We found basic aggregative questions tricky or cumbersome to answer, such as:

Did all runs fail on the same step?
Did all runs fail for the same reason?
Did the failure occur only in a specific branch?
Did the failure occur on a specific machine?
Which fail the most?
What’s the normal run time for identifying outliers?

If you also exhausted the built-in observability capabilities of your CI/CD tool, it’s time to set up proper observability — just like you have for your Production environment, with a dedicated monitoring and observability setup. In this article I’ll show how to achieve observability into your CI/CD pipeline in four steps. I’ll use Jenkins as the reference tool, as many know this popular open source project, and as in my company we’ve used it extensively. But even if you’re using other tools, you’ll find much of that largely applicable.

As you’ll see, it takes four simple steps to gain observability into your CI/CD pipeline:

Let’s see how to do these steps with different types of observability. As an open source enthusiast, I’ll demonstrate it with the popular open source stack, but the principles can be implemented on other equivalent tools of choice.

Devoxx UK 2023: How We Gained Observability Into Our CI/CD Pipeline by Dotan Horovits

CI/CD health monitoring with the ELK Stack or OpenSearch

The ELK Stack has long been a popular open source for log analytics, and many master the art of Kibana dashboarding, so I’ll use it for the CI/CD health monitoring. Note that since 2021 Elasticsearch and Kibana are no longer open source, but you can use their open source fork OpenSearch to achieve the same with Apache2 license.

Step 1: Collect Data on CI/CD Pipeline Run

In the ‘Collect’ phase you instrument your pipeline to capture all the relevant information. Here’s useful information to consider, though you should determine what’s relevant for you:

the branch
commit SHA
machine IP
run type (scheduled, triggered by merge/push)
failed step
step duration
build number

You can capture this information as environment variables or any other transient state that works for you. As you’ll see next, the persistence piece will be addressed in the “Store” phase.

Step 2: Index & Store Data in Elasticsearch (or OpenSearch)

After capturing all the data throughout the pipeline run, it’s time to persist it in the “Store” step. Here we can create a stage at the end of the pipeline (or extend an existing concluding stage), in which we create a JSON document with all this data, and write this document to an your Elasticsearch or OpenSearch cluster, or to a managed service such as Logz.io Log Management (disclaimer: I work at Logz.io).

I should note that Jenkins (and other tools) offer some persistence capabilities. However, Jenkins keeps these on the Jenkins machine, which burdens these machines and can impact the critical path of the CI/CD pipeline. You can use Jenkins plugins such as discard-old-build plugin to clean data from these machines at the end of the run, but it’s quite limited. In comparison, with Elasticsearch or similar services we can persist historical data in our control: control the duration and retention of the data, and to do that off the Jenkins server not to load the critical path. As we’ll see next, Elasticsearch’s capabilities offer a more powerful search experience.

Step 3: Visualize with Kibana (or OpenSearch) Dashboards

Once all the pipeline run data is stored in Elasticsearch, it’s easy to build kibana dashboards and visualizations to meet our observability needs (remember the example aggregative questions we listed above?)

Let’s look at some useful visualizations you should consider:
To check how stable our pipeline is, visualize success and failure rates, whether in general or at a specific time window:

To find problematic steps, visualize the failures segmented by pipeline steps, whether in general or at a specific time window:

Many times, pipeline runs fail not because of bugs in the released code but because of problematic machines. To detect such problematic build machines, visualize failures segmented by machine. Problematic machines will spike up, and in these cases it’d be easier to kill the problematic machine, let auto-scaling spin up a new one and start clean, before wasting time digging into the released code.

To detect problematic pipeline steps, visualize duration per step in an aggregated fashion, across pipeline runs, across branches and machines.

These are basic visualizations, but you should adapt and add to it according to your needs, environment and investigation process. It may even call for several dashboards for different personae with different monitoring needs or areas of responsibility.

Step 4: Report & Alert

Once the data is in Elasticsearch, define reports and alerts on top of that data to automate as much as possible. For example, the DoD (developer on duty) should receive a daily start-of-day report to slack, to make sure nothing happened during the night that calls for his or her urgent attention.

Define triggered alerts for the things that are critical to your SLO (service level objectives). Alerts can be defined using any of the data fields collected on the “Collect” step, and could be complex conditions such as “if sum of failures goes above X or average duration goes above Y — dispatch an alert”. Essentially, anything you can state as a Lucene query in Kibana, you can also automate as an alert. We’ve built this alerting mechanism on top of Elasticsearch and OpenSearch as part of our Log Management service, and you can use other supporting alerting mechanisms as well.

CI/CD environment monitoring with Prometheus

Many of the pipelines fail not because of the released code but because of the CI/CD environment. It could be a memory leak or high CPU usage in one of the machines or containers, or even in the JVM itself (in the case of Jenkins that is Java based). It could happen due to improper cleanup of previously run tests or tasks, or many other reasons.

It’s vital to be able to discern whether a run failed because of the code or environmental reasons. Identifying an environment problem can save us wasting time looking for bugs in the released code. We’ve started addressing that in our Kibana dashboard above by monitoring failure rate per machine in an aggregated fashion. Now let’s take it to the next level. This calls for observability into the CI/CD environment.

Let’s see how to monitor metrics from the Jenkins servers and the environment, following the same flow.

Step 1: Collect Metrics with Telegraf

I recommend using Telegraf for collecting metrics from Jenkins. Telegraf is an open source project (MIT license) with a rich suite of plugins (I’ll mention a few useful ones here, you can find the full list in Telegraf’s plugin directory).

In the “Collect” stage, we need to collect metrics from Jenkins in a Prometheus format as follows:

Enable the Prometheus metrics plugin from the Jenkins Web UI.
Just go to Manage Jenkins > Manage Plugins > Available > select Prometheus metrics and install).
Install Telegraf with Prometheus input plugin to scrape the Jenkins servers ([[inputs.prometheus]] section in Telegraf configuration)

Now your Jenkins exposes its metrics and Telegraf is collecting them.

* Note: As OpenTelemetry reaches GA for metrics in 2022, it may become the new standard, so it’s best to keep an eye on that option as well.

Step 2: Store Metrics in Prometheus

Prometheus is the golden standard for monitoring and I’ll follow that path for our backend, whether it’s your own instance of Prometheus or a Prometheus-compatible solution such as Logz.io Infrastructure Monitoring.

For the “Store” step, let’s configure Telegraf to send the metrics to the backend. You can do that in one of two ways, depending on your backend and architecture of choice:

Pull mode: in this mode, Telegraf exposes a /metrics endpoint in OpenMetrics standard format for scraping. This mode is good for the classic Prometheus backend which scrapes all endpoints (i.e. read in pull mode). For this mode, configure Telegraf with the Prometheus Client output plugin ([[outputs.prometheus_client]] section in Telegraf configuration)

Push mode: in this mode, Telegraf will remote-write the metrics it collects downstream to the Prometheus compatible backend at a designated endpoint. For this mode, configure Telegraf with HTTP output plugin. Make sure to set data_format = “prometheusremotewrite” to be Prometheus compatible.

Step 3: Visualize with Grafana style dashboards

Once the data is stored in Prometheus, it’s easy to build Grafana style dashboards on top of it (or any other visualization you prefer on top of Prometheus).

As I mentioned, the purpose is to visualize the Jenkins environment. You should look into:

System metrics: such as CPU, memory, disk usage and load of the machines or compute instances on which your CI/CD pipeline runs.
Container metrics: such as the container CPU, Memory, I/O, Network (inbound, outbound) and Disk usage behavior, as well as the the container status (running/stopped/paused) by Jenkins machine
JVM metrics: such as thread count, heap memory usage, and garbage collection duration. This is relevant for Jenkins that is a Java-based program. Other CI/CD tools may require monitoring of their respective runtime environment.

Container metrics for Jenkins environment

In addition to system metrics, Jenkins exposes different metrics on the Jenkins nodes, queues, jobs, executors. Monitoring those metrics can tell you if you Jenkins queue sizes suddenly increase or if you have a rise in the number queues in status “blocked” or “stuck”. Similarly you can see how many Jenkins jobs were executed over time or the job duration to identify trends, or how many jobs are in status “failure” or “unstable”. Here’s a list of useful metrics that we visualized in our Jenkins dashboard (if you use Logz.io you can install the dashboard in one click and try it out yourself)

Jenkins metrics for nodes, queues, jobs and executors

It’s easy to create alerts on top of Prometheus API, whether using AlertManager, Grafana alerts, or Logz.io’s alerting mechanism or other service of your choice.

CI/CD pipeline performance monitoring with Jaeger and Distributed Tracing

We saw above how the Kibana dashboard can show an aggregated view of step duration. But what if we want to investigate the performance of specific pipeline runs? That’s what distributed tracing is for!

Let’s see how to visualize Jenkins jobs and pipeline executions as distributed traces, following the same 4-step flow.

Step 1: Collect with OpenTelemetry

OpenTelemetry is the emerging standard for collecting observability data. At the time of writing, OpenTelemetry is generally available (GA) for collecting Distributed Tracing, and it’s recommended to use it to keep future-proof and vendor agnostic. Note that OpenTelemetry will also cover metric and log data in the future as well.

Here’s what it takes to collect trace data from Jenkins with OpenTelemetry:

Install OpenTelemetry Collector
Install Jenkins OpenTelemetry plugin in Jenkins UI
Configure the Jenkins plugin to send to the OpenTelemetry Collector endpoint in OTLP/gRPC protocol

Step 2: Store Data in Jaeger backend

Now we need to configure the OpenTelemetry collector to send the trace data to Jaeger. Whether you run your own Jaeger or use a managed service such as Logz.io Distributed Tracing, all you need is to set the right exporter and endpoint in the exporters: section of the collector’s YAML configuration file. See this guide for more information on OpenTelemetry Collector and its available suite of exporters.

Step 3: Visualize with Jaeger UI

The “Visualize” step is pretty straightforward with Jaeger UI. As soon as your trace data comes in, you’ll see the built-in views populated. The most useful one is the Timeline View, which presents the pipeline’s run as a Gantt chart:

Jenkins pipeline run visualized as a trace in the Timeline View in Jaeger UI

On the left hand side you can see the indented list representing the call sequence of the steps in a nested form. On the right hand side you can see the Gantt chart which visually shows the duration of each step, as well as which steps ran in parallel or sequentially.

When your pipelines take too long to run, it’s easy to use this view to analyze where most of the time is being spent and how to optimize things, whether in shortening a specific step’s duration, to make sequential steps run concurrently, optimize thread pools or similar performance improvements.

The trace data is similar to structured log data, and you can create alerts on it just as easily, in many cases it can be achieved based on Elasticsearch API, and in the case of Logz.io it’s the same alerting mechanism.

Dotan Horovits 🇮🇱🎗 on LinkedIn: #cicd #observability #opentelemetry

📢 #CICD #Observability is now an official #OpenTelemetry Special Interest Group (SIG) under the Cloud Native Computing…

www.linkedin.com

Summary

Your CI/CD is as critical as your Prod — give it the same treatment. The same way you use Observability to monitor Prod — do the same with your CI/CD environment. Preferably even reuse the same observability stack, so you don’t have to reinvent the wheel.

Investing in good CI/CD observability will pay off with a significant improvement in your Lead Time for Changes, effectively shortening the cycle time it takes a commit to reach production.

It takes 4 steps to gain observability:
Collect → Store → Visualize → Report & Alert.

Start with instrumenting your pipeline to get events, state, metrics, traces. Then store and visualize according to the data type — I’ve demonstrated here the popular open source stack with Elasticsearch/OpenSearch for logs and events, Prometheus and Grafana for metrics, Jaeger for traces, but if you use something else in your production, just reuse what you know and have. Then set alerts and reports to automate as much as possible over the data.

I originally published this article under Logz.io blog