The summary - AWS had a single zone failure in their US East region. We initiated our plan to evacuate the failed zone and we were able to continue to deliver services to our clients from the non-affected AWS zones in the US East. I am very proud of our infrastructure team and how they jumped on this and made sure we continued to deliver service.
We build, plan and practice for this stuff, it doesn't just magically happen. We just went through a drill for this very scenario a week before the AWS service degradation. Also, as I have mentioned in a previous blog post, if you are too attached to an EC2 instance, you are probably doing something wrong.
The thing about product infrastructure and security is this: no one really appreciates it until something goes wrong. As a CTO, it's part of my job to make sure we can deliver product to our customers who run their businesses on our platform. I have to champion for this all the time - "Yes, It is worth our time and effort to build a robust and secure platform, not just add features to the product".
We have some principles that govern how we deliver services on AWS, which helped us in this last service degradation.
- Front end server EC2 instances are stateless. If one goes down, there is another to take its place and customers don't get affected.
- Front end server EC2 instances are templates. They don't store configurations and application code. When a front end server starts up, it is told what it is, and it finds it's configuration and application code in a pre-configured S3 bucket, it then copies these locally and starts up. This allows us to run many of these as we need, in multiple availability zones, and bring them up and down without issues. We can also modify the config and code in once place and then have the front end servers can pick up the new versions easily.
- Front end services run in multiple availability zones fronted by Elastic Load Balancers (ELBs). If a zone goes down, the ELB moves traffic to non-affected zones. If ELBs have trouble, we can by-pass traffic to ELBs and shard traffic directly to the instances by reconfiguring DNS (we had do do this for the last service degradation).
- Back end databases are state-full (of course), but they are actively replicated to multiple availability zones. When one zone goes offline, we failover to the non-affected zone (again, we had do do this for the last service degradation). For paranoia, we backup the databases to RackSpace.
There is a lot of more stuff we do, but these principles helped us withstand these last set of issues. We will continue to refine and work on product delivery. For example, we are interested in having our services replicated across regions (East to West coast, for example). There are tools to do this or we can roll our own. We have toyed with this, but need to dedicate more time to it.
Security and service delivery is a process, not a destination. We learned a lot and continue to refine and get better, but the moral is you have to think about this and dedicate time to this - and never really feel that you are "done".
No comments:
Post a Comment