Scaling Platform Engineering: Shopify’s Blueprint
Platform Engineering is a hot topic these days. We’ve seen the hype around it in 2023, and I expect we shall see it becoming production-grade as we move into 2024. I wanted to look into this topic, and learn from those who’ve already implemented it at scale: the e-commerce hyperscaler Shopify.
In a recent episode of OpenObservability Talks, I had the pleasure of hosting Aparna Subramanian, the Director of Production Engineering at Shopify. Previously, Aparna was Director of Engineering at VMware where she was a founding member of Tanzu on vSphere, a Kubernetes platform for the hybrid cloud.
E-Commerce Scale at Shopify
We kicked off our conversation by peering into the sheer magnitude of Shopify’s operations. Aparna painted a vivid picture of the colossal scale at which Shopify operates, especially during peak events like Black Friday and Cyber Monday.
“During this past Black Friday, Cyber Monday [2023], our application servers were handling 58 million requests per minute, and the database was handling 19 million queries per second. If we look at our streaming infrastructure, that was about 29 million messages per second of stream processing,” Aparna shared, setting the stage for the immense infrastructure that the Platform Engineering team needs to master at Shopify.
Evolution of Platform Engineering at Shopify
Our discussion swiftly shifted to the genesis of Platform Engineering at Shopify. Aparna took us back to 2016 when Shopify faced the challenge of multiple teams deploying in different ways to production. The realization struck that adopting DevOps transferred the ops ownership to the developers without really giving them the right tools and the time to work on these problems, leading to the birth of Platform Engineering at Shopify.
“Shopify decided to take the approach of Platform Engineering. And have a platform where all of these tools are built, custom-built for our business, custom-built for our developers, and there’s one unified way of deploying things to production,” Aparna explained, emphasizing the need for a unified and efficient deployment strategy.
Today Platform Engineering at Shopify is structured in a layered model. There’s an infrastructure group, and within it there’s data platform, observability platform, stateful systems, and streaming platform. And there’s the production platform, which is the bottom layer that supports all of these platforms, and these platforms support the application developers. This enables Shopify to move fast at scale. Aparna gave a sense of the team’s velocity: “We ship about 1,000 PRs a day. And the application itself gets deployed to production 107 times a day.”
Kubernetes: The Backbone of Shopify’s Platform
Shopify’s journey into the cloud-native landscape took a pivotal turn with the adoption of Kubernetes, and Aparna shed light on how Kubernetes serves as the backbone of their operations, with about 400 Kubernetes clusters running across the fleet. “So at Shopify, everything runs on Kubernetes, our stateless workloads, our applications, and the stateful workloads, all of our databases,” she stated.
What stood out was the concept of a “platform of platforms.” Even though everything is unified under Kubernetes, Shopify’s infrastructure is layered with specialized platform teams owning and managing different aspects, such as the database platform, streaming platform, and observability platform.
Observability Shared Between Platform and Application
One of the key success factors highlighted by Aparna is the clear division of responsibilities between application developers and platform engineers. At Shopify, everyone is responsible for monitoring and being on call, but application developers are responsible for the application portion and the platform engineers are responsible for the platform and the infrastructure.
“And when there’s an issue, we all come together to troubleshoot and figure out what’s going on in the application.” Aparna remarks, highlighting how the collaboration along with the distinct roles ensure a streamlined approach when troubleshooting arises.
Internal Developer Platform at Shopify
At Shopify, as I see in many other places, there is a product approach to Platform Engineering. Platform Engineering is developing a product for the internal developer community within the company, and from the application developers side, everything is done by self-service.
From the observability side, Aparna shared that “Platform Engineering provides all of the tools for monitoring production. Alerting, observability, dashboards, there’s resiliency, there’s on-call incident management teams.”
While the platform teams own and manage their respective platforms, application developers own their application code, and are responsible for shipping it through the release cycle and deploying it to production.
Balancing Flexibility and Abstraction in the Platform
One central challenge I keep encountering in Platform Engineering teams is the topic of balancing flexibility and abstraction, and I was curious how Shopify addressed that. Aparna admitted it’s a work in progress, and that they “started with a more abstracted, a layer on top of Kubernetes, and that actually did not work very well in the past.”
With this experience, they’ve reached the realization they can’t hide Kubernetes from the application developers. The current balance is with meaningful defaults that work for the majority of the developers, along with a manifest that power users can manipulate.
Aparna said that the Platform team focuses on providing a “golden path” while encouraging developers to challenge and contribute to the evolving platform. “We embrace change, so we want the dev teams to actually push the boundaries.”
Handling Peaks: Black Friday, Cyber Monday, and More
With the holiday season just over, our conversation naturally drifted towards how Shopify prepares for peak events. Aparna provided insights into the meticulous planning involved, from capacity estimation to resiliency testing. She shared how Shopify disables the default automatic scaling in those cases, and instead relying on the over-provisioning set to meet the extreme volumes experienced during Black Friday and Cyber Monday.
Beyond these major events, there are also the flash sales that a certain merchant can launch, which can generate peak traffic. Having an enterprise merchant advise Shopify in advance can help prepare for this peak. But, as Aparna says, “as a platform that supports millions of merchants, we don’t have a way for everybody to tell us,” so for the most part handling these peaks is fully automated.
Want to learn more? Check out the OpenObservability Talks episode: Scaling Platform Engineering: Shopify’s Blueprint