Shopify’s Journey to Planet-Scale Observability
Shopify operates its e-commerce platform at massive scale, running thousands of services and processing billions of events per second. In a previous blog post, I shared how Shopify runs Platform Engineering to meet this scale. But what about observability?
To tackle the challenges of observability at this scale, they built Observe — an in-house observability stack that makes use of open-source tools and specifications. In fact, they replaced an older vendors-based system, in an awe-inspiring migration project. But why build their own stack? Which open source tools did they use? How did they shape the user experience to their needs?
I sat down with Elijah McPherson, the engineering director who was brought onboard to lead Shopify’s observability overhaul, to hear his story about this monumental undertaking on the latest episode of OpenObservability Talks.
Why rebuild the observability stack in-house?
“At Shopify’s scale, no off-the-shelf solution could meet our reliability, performance, and cost needs while giving us full control over our data and how we operate,” said Elijah. The sheer scale of Shopify’s operations made commercial solutions prohibitively expensive and, at times, operationally restrictive. The engineering team needed a system that could ingest massive volumes of logs, metrics, and traces without compromising performance or inflating costs.
For example, Shopify’s previous third-party observability solution was projected to cost tens of millions of dollars annually as data ingestion continued to grow. By building in-house, they reduced storage costs by over 80%, while improving query performance and system reliability.
Tailoring observability for organizational processes
Observability is not just about collecting table stakes data — it needs to fit within an organization’s workflow. “We designed Observe to work the way Shopify engineers think and operate, rather than forcing them into a vendor’s way of doing things,” Elijah explained.
This meant creating integrations with Shopify’s internal incident response processes and ensuring that every alert, dashboard, and query was aligned with how teams troubleshoot production issues. A key innovation was their automatic dependency mapping, which reduced mean-time-to-resolution (MTTR) by helping engineers pinpoint failing services more quickly.
The open source stack used at Shopify observability
Shopify’s observability stack is built on open source technologies, including StatsD and Prometheus for metrics, Loki for logs, Tempo for distributed tracing, Grafana for visualization, and ClickHouse and Apache Parquet for the backend analytics storage.
“We believe in open source — not just using it, but contributing back. Every challenge we solved, we tried to upstream where possible,” Elijah emphasized.
While Grafana served as the default visualization layer, Shopify needed custom enhancements. “We built a set of plugins that let us combine business and operational data in ways that standard Grafana couldn’t,” Elijah noted.
One example is a custom panel that overlays customer conversion rates with infrastructure performance metrics. This allowed teams to instantly see if slow database queries were impacting checkout rates, leading to faster incident resolution.
The functionality of Shopify’s observability platform
The new observability platform, codenamed Observe, centralizes logs, metrics, traces, and profiling data. It allows engineers to observe the different telemetry in one UI with unified experience, and correlate events across different domains efficiently.
“One of the biggest improvements we made was allowing real-time correlation between traces and business metrics, helping teams understand not just system health, but customer impact,” said Elijah. This feature was crucial in identifying performance bottlenecks and prioritizing fixes based on actual user experience.
In a recent internal conference, Shopify conducted in-depth talks about these and other technical aspects, including:
- Shopify Observe product overview
- Planet-Scale Metrics Ingestion
- Logging Ingestion Pipeline, backed by Vector, ClickHouse, and a custom Logging API
- Long-Term Metrics Storage, supporting distributed query with unbounded retention using Google Cloud Storage, Thanos, and Parquet.
- Scaling Metrics Infrastructure Dynamically to Meet Demand and optimize resource allocation and performance.
- Network Monitoring with eBPF, Vector and ClickHouse to manage millions of events per second.
- Insights on OpenTelemetry’s role, challenges, and future at Shopify.
- Profiling Massive Ruby Apps
Check out the recording on Elijah’s post.
Adopting open standards
Open standards were a guideline for Shopify. “Lock-in is a risk we refuse to take. Open standards ensure that we remain in control of our data,” Elijah explained. Shopify heavily invested in OpenTelemetry for its tracing pipeline.
By moving to OpenTelemetry, Shopify achieved seamless compatibility across its services and removed proprietary agents that were previously causing 15–20% performance overhead on high-throughput services.
On the metrics side it aligned with the standards provided by OpenMetrics, the Prometheus metrics exposition format, and StatsD.
The importance of product sense in internal developer platforms
“An internal observability platform is a product, and it needs product thinking,” Elijah emphasized. This meant treating engineers as customers, gathering feedback, iterating on UX, and ensuring adoption.
As a former product manager for developer platforms I couldn’t agree more. Too often do I see engineering organizations assume that when a platform is meant for the internal engineers then it needs no product owner. Nothing could be farther from the truth, as Platform Engineering taught us so well.
One critical success factor was reducing friction in query writing. “We built a natural-language-based query assistant that auto-suggests PromQL and LogQL queries based on common troubleshooting patterns,” saving engineers 30–40% of their time when diagnosing issues.
Observability into business health
Observability at Shopify goes beyond infrastructure. “We provide dashboards that correlate request latency with checkout abandonment rates, so engineering teams can prioritize fixes that drive revenue,” Elijah shared.
This shift helped Shopify recover millions in lost revenue by identifying performance bottlenecks before they impacted customers at scale. Elijah also shared examples from the marketing domain about measuring the success of campaigns and conversion rates, and mapping them to R.E.D. metrics of respective microservices.
Other examples are related to product management, such as Merchant experience monitoring. By linking system metrics to business KPIs, Shopify ensures engineering decisions are driven by customer impact. This shows the power of SaaS observability, and how it can serve your product managers, marketing and company leadership.
How to run a migration project for a live production platform
Migrating observability systems in a high-scale production environment is risky. Shopify followed a dual-write approach: “We ran both the old and new stack in parallel, ensuring parity before switching over,” Elijah described.
This gradual migration helped them catch data discrepancies early, avoiding major incidents. The team also used canary deployments to verify performance improvements before rolling out changes company-wide.
My own advice is to use these migration projects as compelling events to do data cleanup, remove unused and broken dashboards, revamp and align the metadata enrichment — the sort of things you usually don’t get to on the day-to-day, but can do as part of a system-wide migration.
Endnote: should your organization build in-house observability?
Building an in-house observability stack isn’t for everyone. “If you’re operating at Shopify’s scale, where cost, performance, and control become existential concerns, then it makes sense. But for most teams, managed solutions are more than enough,” Elijah advised.
Shopify’s journey is a testament to how observability, when designed around an organization’s needs, can be a strategic advantage rather than just an operational necessity.
Want to learn more? Check out the OpenObservability Talks episode: Shopify’s Journey to Planet-Scale Observability.