Al Issa's Blog: April 2011

Yes, the outage was very bad for Amazon Web Services. Essentially, it was a multi-zone failure in the US East region for AWS that was not supposed to happen. Amazon claimed that each zone in a region was independent from other zones, and that failures in one zone would not affect another. The failure over last weekend showed that did not happen. Failures spanned availability zones. That was bad.

Luckily, none of our production systems were affected. Don’t know why, luck of the draw. Our stage systems went offline, and eventually we got tired of waiting for them to recover, so we copied the EBS volumes, and spun up new systems with the data of the old systems and everything worked fine.

Am I happy with the failure? No. Especially since EC2 makes it impossible to migrate running instances to other zones or regions for that matter. Also, the lack of transparency on how these availability zones work is very disturbing. It’s hard to take their word for it until they prove how they work and why this won’t happen again.

Are we sticking with Amazon Web Services? Yes, but with some changes to our strategy going forward. Cloud computing models are still the cheapest way to bring up services, platforms and infrastructure without a lot of upfront costs. Our company could not be where it is now without AWS. I figure that this failure will make Amazon Web Services better, and increase transparency. This is not unlike some of the massive Internet service failures that were very common in the late 90’s, but as technology and platforms matured, and as we understood how to engineer better and stronger services, these outages have become very rarer. Web sites and web-based services are more robust, and handle large volumes of traffic much better than they did 10 years ago. Similarly, cloud computing will only get better and better. The cost efficiencies alone are too great for it to go away.

What changes will we make to how we use Amazon Web Services?

We will be duplicating our services to other AWS regions (US West and US East), not relying on availability zones within a region to take care of us.

We will be moving our system backups not only to S3, but now we will be moving them to an AWS competitor as well (RackSpace).

I am going to run our team through a fire-drill. How do we recover our services in another AWS region in case of a disaster? How do we recover our services somewhere else completely?

At the end of the day, for our company to leave AWS would require a large capital commitment to hardware and infrastructure we are not prepared or really able to take on. Besides, I don’t want to manage another large set of IT infrastructure anymore. It sucks to do that. I got spoiled by letting AWS do that for me, and frankly I like it that way. With some refinement to the way we do things, I am just happy to keep paying AWS to do that for me, and that allows me to concentrate on delivering a cool and solid application.

Monday, April 25, 2011

No need to Panic: Why we are sticking with Amazon Web Services.

Blog Archive

Followers