DevOps Monitoring Dashboard Design Guide by Horovits

Unreadable Metrics: Why You Can’t Find Anything in Your Monitoring Dashboards

A Guide to Effective Dashboard Design for DevOps and SRE

Understand your dashboard’s user persona and use case

Start with understanding the persona of the user who will utilize the dashboard. After all, you’re not writing the dashboard to yourself (unless you do). Adapt your dashboard to the familiarity and knowledge level of that persona.

Utilize the right data visualizations

The best dashboards use the right visualizations to convey the most important information. For example, a graph might be a better choice than a table for displaying trends over time, while a pie chart might be a better choice than a line chart for comparing proportions between different groups. Try to use the visualization that best conveys your data.

Single number count of nodes and pods. Below is trend over time graph to elaborate. Source: Logz.io
A line chart of latency over time, and the latest latency value overlayed. Source: Logz.io

Create clean layout with an intuitive flow

A clean and uncluttered layout is essential for effective dashboard design. Avoid overcrowding the dashboard with too much information and ensure that the design is easy to read and understand.

Top view of service performance metrics, then drill down into individual operations. Source: Logz.io

Keep Consistency of the layout

Once you define a layout that works well, keep consistency of this layout, both within the same dashboard and across dashboards. Seeing similar information presented in different ways in different places can be very confusing. We all know that green is good and red is bad, right? you wouldn’t like to see blue signifying bad anywhere. Consistent color codes is one aspect, but it can also be a consistent graph type to visualize each data type, horizontal vs. vertical alignment and so forth.

Easy to see spike correlation using same width and time frame. Source: Logz.io

Correlate between different dashboards and views

Correlation doesn’t end with panels and visualizations within the same dashboard. You may find yourself needing to correlate a certain metric with metrics found on other dashboards, or drill down into a specialized dashboard to carry on the investigation. Or even correlate with other types of telemetry such as logs or traces. Take the time to incorporate the link to the referred dashboard to ease the transition with the existing investigative context. Some tools have built-in support for such links, such as Dashboard and Panel data links in Grafana open source, or Logz.io’s telemetry correlation, which carry the search context (the time frame, filters etc.) over to the next dashboard. If you can’t incorporate links natively, make notes on the dashboard as static data panel, with relevant guidelines, markdown and links.

Annotate thresholds, alerts and events on the graph

Does your metric have known limits, goals or threshold values to be aware of? Visualize them as horizontal markers on the graph. Classic examples are maximum memory or disk capability, or the error budget. You can also annotate reference numbers such as last week’s average. Similarly, annotate thresholds of alerts that have been defined on that metric or panel. In the following example, you can see the warning threshold annotated in yellow at 50% utilization, and error threshold annotated in red at 80% utilization:

A line graph with a warning threshold (yellow) and an error threshold (red). Source: Logz.io
Deployment marker on top of the graphs. Source: Logz.io

Overlay values on the same graph only when makes sense

As engineers, we want to be efficient and overlay data on the same graph to ease correlation. But this oftentimes goes against the principle of simplicity and easy UX. If you have 100 instances, plotting their CPU utilization on a single graph can be overwhelming. Avoid plotting too many values on a single graph, to make sure it’s manageable. One useful pattern here is to plot only the “10 highest” CPU consumers.

Kafka consumer lag and latency metrics plotted together with strong correlation. Source: Logz.io

Endnote

It’s not enough that you have all the data needed, it needs to be accessible and usable so that people can actually understand what goes on and investigate and issues. When designing a dashboard, keep in mind the user experience. Make sure the data is presented in a way that allows users quickly and easily understand what’s going on at a glance. This means ensuring that the dashboard is free of data overload, and that it uses consistent conventions and visual cues to make the data more accessible.

--

--

Technology evangelist, innovation enthusiast, Startup Nation resident, proud father. https://twitter.com/horovits http://linkedin.com/in/horovits/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store