Wednesday, June 27, 2012

Cloud-based Infrastructure: If you want to be available, be transient

So, there was outage a couple of weeks ago in the Amazon Web Services east coast region - a power problem affecting one of their zones. They have four (or now five) zones in the east coast, and apparently the others were unaffected.

What was amazing to me was not that there was an outage, but that so many users on the AWS forums had messages which essentially said "Help! I can't connect to my instance!".  Well, yes, the zone was offline for a few hours so instances running in that zone were offline.  But, if you rely so much on a single instance being available,  trouble is coming your way.

Cloud-based infrastructure needs to be thought of as transient - that is, like saying, "I like you but I am not committed to you".  If an instance or service is down, you should be able to shrug your shoulders and move on.

At BuildLinks, we are very heavy users of AWS.  We don't even own a single piece of hardware (except for laptops, a couple of hubs and wireless routers).  We are all "in" on AWS, but we continue to work hard and design our systems such that we are not committed to a particular zone or instance (heck, we even back up our databases to RackSpace outside of AWS, just to be paranoid).  If an instance goes offline, there is another to take its place.  AWS provides some services for doing this (like ELB), and some you have to build yourself.  For example, we use multi-zone web server instances fronted by ELB,  but we had to string together database replication ourselves -  our database is actively replicated to another zone in the east coast and also to the AWS west coast region. We back up our database to S3 and RackSpace every few minutes too.

Yes, we could experience an outage if an entire region goes offline, ELB fails or something of the like, but we would not be completely wiped out, given the replication and redundancy we took time to design and build, and continue to build and add with every product release. Cloud-based infrastructure reliability needs to be an on-going issue.

If you want to use cloud-based infrastructure, you need to take time to design and iterate your services for robustness - not just throw instances up with the assumption that they will be there forever.  It will also keep your blood pressure down and help you sleep. :)

If you have questions, drop me a line.  I am glad to share.

Thursday, June 7, 2012

Peopleware

I believe that people who are learning and growing at work, are going to be more productive and positive.  To that end, the BuildLinks technology leadership team is reading Peopleware, by Tom DeMarco and Tim Lister.  It's a classic book on technology team management. Tom DeMarco has written several great books and articles and I am a big fan.

I first read Peopleware when it first came out years ago.  Re-reading it now has reminded me how deeply it has influenced my thinking on technology management.  I don't agree with everything in the book, but over my career I have found that most of their insights are spot on.