Cracking Performance Issues in Microservices with Distributed Tracing

Dotan Horovits (@horovits)
2 min readNov 12, 2022

--

Microservices architecture is the new norm for building products these days. An application made up of hundreds of independent services enables teams to work independently and accelerate development. However, such highly distributed applications are also harder to monitor.

When hundreds of services are traversed to satisfy a single request, it becomes difficult to investigate system issues. This includes when a customer request returns a failure code, or if a customer request is suddenly very slow to respond.

While logs have been an established tool for analyzing root causes, microservices architecture is not monolithic. There can be hundreds or thousands of services involved, each making the same number of requests per second.

With log entries then scattered across numerous log files, how can you determine which are relevant, or or put them together according to the execution flow?

Distributed Tracing Fundamentals

This has given rise to the new discipline of Distributed Tracing. Google released a monumental research paper in 2010 following their experience building Dapper, their own in-house Distributed Tracing system. According to OpenTracing.org:

“Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.”

With Distributed Tracing, our application reports tracing data for each service and operation that is invoked as part of the request execution. This data, called spans, is collected by the analytics backend, ordered by causality, and then visualized, typically as a Gantt chart. In the example below, we can see a trace, starting with the HTTP GET /dispatch operation invoked on the frontend service and then flowing through a series of services and operations to fulfill.

Timeline view visualizing a distributed trace in Jaeger UI. Source: Logz.io Distributed Tracing

In this example, it’s easy to see where most of the time is spent and potential performance inefficiencies, such as a series of sequential calls that, if run concurrently, could reduce the overall request latency.

To learn more about this topic, please read my full article published in DevPro Journal, and watch my recent talk at ContainerDays 2022 conference. If you’re looking to take your distributed tracing practice to the next level, check out Logz.io.

--

--

Dotan Horovits (@horovits)

Technology evangelist, CNCF Ambassador, open source enthusiast, DevOps aficionado. Found @horovits everywhere