Retrospect on the AWS Outage and Resilient Cloud-Based Architecture
Key takeaways:
- A single availability zone (AZ) outage can take down your service
- Multi-AZ strategy may not be enough
- Patterns, guidelines and best practices for disaster recovery
AWS outage in AWS US-EAST-1 yesterday took down many services. Again.
Yesterday Amazon reported an outage in a single availability zone (AZ) in its US-EAST-1 region — AWS’s largest region.
You may think “one availability zone, how serious could that be?” Well, it apparently took down the service of many sites, including Slack and Hulu
This is not new. In fact it’s at least the 3rd AWS outage this month alone. Let’s face it, we’ve been struggling with similar outages for years. Perhaps it’s a good time to revisit common practices for business continuity & disaster recovery (BCDR). Here’s something I wrote a while back after another major AWS outage in the same region US-EAST-1, which is quite relevant these days:
Business Continuity and Disaster Recovery: Guidelines and Best Practices
Design for failure and Chaos Engineering
The first and fundamental principle in building robust architecture is to design for failure. As SmugMug states:
… we designed for failure from day one. Any of our instances, or any group of instances in an AZ [Availability Zone], can be “shot in the head” and our system will recover …
This principle should be prevalent during design, development, deployment and maintenance stages of the system. SimpleGeo presents an excellent work practice:
… At SimpleGeo failure is a first class citizen. We talk a lot about it in design discussions, it influences our operational procedures, we think about it when we’re coding, and we joke about it at lunch …
Some companies even embedded Chaos Engineering practices, simulating random failure in their work procedures. Netflix was the pioneer with its Chaos Monkey service, to get their engineering team used to a “constant level of failure in the cloud”.
Stateless and autonomous services
If possible, divide your business logic into stateless services, to allow easy fail-over and scalability. Netflix explained the fail-over benefits:
… if a server fails it’s not a big deal. In the failure case requests can be routed to another service instance and we can automatically spin up a new node to replace it …
Twilio aggregates their stateless services into homogeneous pools, which provides them both fail-over and elasticity:
… The pool of stateless recording services allows upstream services to retry failed requests on other instances of the recording service. In addition, the size of the recording server pool can easily be scaled up and down in real-time based on load …
To contain the ripple effect of the failure, make the services well-encapsulated, as SmugMug states:
… Make your system divided into well-encapsulated components that can fail individually without failing the entire system …
Redundant hot copies spread across zones
By replicating your data to other zones, you insulate your service from zone-wide failure. As Netflix explains:
… we ensure that there are multiple redundant hot copies of the data spread across zones. In the case of a failure we retry in another zone, or switch over to the hot standby …
Twilio also emphasizes the configuration of timeout and retry to avoid delays in failing over to another copy:
… By running multiple redundant copies of each service, one can use quick timeouts and retries to route around failed or unreachable services …
Spread across several public cloud vendors and/or private cloud
Most IT organizations avoid depending on a single ISP by having another ISP as backup. Even Amazon is using this strategy internally to ensure high-availability of their cloud by using a primary and a backup network. Similarly, you would like to avoid dependency on a single cloud vendor by having another vendor as backup. This holds true even if the vendor provides a certain level of resilience, as we saw with Amazon’s multi-AZ failure on the recent outage. Many of the companies that survived AWS outages owe it to using their own datacenters (a hybrid cloud model), to using other vendors (a multi-cloud model), or to using the US West Region of AWS. SmugMug for instance kept their critical data on their own datacenter:
… the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all — it all still lives in our own datacenters, where we can provide predictable performance …
and also recommends to “spread across many providers”, although admitting that
… This is getting more and more difficult as AWS continues to put distance between themselves and their competitors …
When considering using different regions of AWS for resilience, it’s interesting to note Amazon’s statement about the effort required on your application’s side to work with multiple regions, which makes you wonder if it’s that much easier than working with a different vendor altogether:
… if you want to move data between Regions, you need to do it via your applications as we don’t replicate any data between Regions on our users’ behalf. You also need to use a separate set of APIs to manage each Region. Regions provide users with a powerful availability building block, but it requires effort on the part of application builders to take advantage of this isolation …
Automation and Monitoring
Automation is the key. Your application needs to automatically pick up alerts on system events, and should be able to automatically react to the alerts. As SimpleGeo architect states:
… Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated …
Interesting to see that even Netflix that took pride in surviving the failure back in 2012, admitted that the manual responses their engineers used this time will not work in the future, as they grow to a
… worldwide service with dozens of availability zones, even with top engineers we simply won’t be able to scale our responses manually …
Detailed alerting mechanisms are also essential to the manual control of the system, as Bizo states:
… we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc. …
Avoiding ACID services and leveraging on NoSQL solutions
The CTO of SimpleGeo recommended avoiding to rely on ACID services, as it inhibits the distributed nature of the cloud. In order to achieve that, Twilio recommended to “relax consistency requirements”. Netflix implemented that by
… leveraging NoSQL solutions wherever possible to take advantage of the added availability and durability that they provide, even though it meant sacrificing some consistency guarantees …
Load Balancing
Use dynamic balancing, regardless of the zone. When balancing equally by zone, like Amazon’s Elastic Load Balancer (ELB) does, if a zone fails this can bring down the system.
… Netflix uses its own software load balancing service that does balance across instances evenly, independent of which zone they are in. Services using middle tier load balancing are able to handle uneven zone capacity with no intervention …
Conclusion
Recent AWS outages serve as an important lesson to the IT world, and an important milestone in our maturity in using the cloud. The most important thing to do now is to learn from the mistakes made by those who went down with AWS, as well as from the success of the ones who survived it, and come up with proper methodology, patterns, guidelines and best practices on doing it right, so that Skynet will not take down humanity.