Some Amazon hosted web sites experienced a 40 minute outage the morning of 12/28/2008 (9:30am PST to 10:10am PST) as network connectivity to one of Amazon’s “EC2 availability Zone” dropped.  About 40 minutes into the outage, and just before it was fixed, Amazon posted a note on their dashboard page.

“One of our EC2 Availability Zone is experiencing reduced connectivity to the Internet at present. We are shifting traffic to another provider and expecting fully restored connectivity shortly.

A couple of observations:

  1. This frustrating kind of outage, while usually infrequent, can happen at any datacenter.  This one get’s put in the “this better not happen very often” mental bucket.
  2. I am disappointed with how long it took Amazon to post any kind of announcement.  It took me more than a few minutes to truly convince myself that it was Amazon’s issue and not a burp in our own operations.  I checked one other site, userscripts.org, that I know  hosted in EC2 and that site didn’t seem to be affected.  Apparently they are hosted in a different “EC2 availability zone”.
  3. I’m pleased with how our own monitoring of the website worked.  Most of our monitoring is done with Nagios from within the Amazon Cloud.  Since the issue was connectivity TO the cloud, that set of monitoring didn’t notice anything wrong.  According to it’s point of view, everything was operating normally.  Sort of surprisingly, RightScale‘s, the service we use to help us host our site on EC2, monitoring also reported everything as normal.  Fortunately, we have a second set of monitoring hosted on a Linode to monitor the health of the main Nagios monitor.  This guy alarmed as it was suppose to (~$20/month well spent) and at 9:30am PST, I knew our site was down (though there wasn’t much I could do about it).

And a note or two:

  1. With infinite time and funds we’d probably have the site load balanced between multiple clouds.  That will happen some day.
  2. Always have a second set of monitoring completely independent from your main site operations.  It can just be a heartbeat monitor but it is worth the time and money to set up.

Like most websites, we at FanSnap make heavy use of MySQL as a database engine.   Despite it’s popularity, it definitely has it’s quirks.  Full disclosure, I am by no means a DBA though I often get to play one at work.  Last we week we implemented some much needed monitoring to alarm if some expected database activity didn’t happen.  Basically, if a timestamp field called updated_at doesn’t have a value less than 2 hours old, our Nagios monitoring system will send out an alert email.  Seemed like a simple query to me:

select now() - updated_at from the_table order by updated_at desc limit 1

and as long as “now() – updated_at” was less then 7200, Nagios wouldn’t complain.

Since updated_at could be updated a fairly random intervals, simple tests directly against the database seemed to prove the above query worked.  Well… Looking at slice of Nagios monitoring results showed something odd:

Time of Monitor   TimeDiff 
========================
12/2/08 9:38       118
12/2/08 9:43       618
12/2/08 9:48       231
12/2/08 9:53       731
12/2/08 9:58       418
12/2/08 10:06      5199
12/2/08 10:11      88
12/2/08 10:16      588

Within the 5 minutes between monitoring checks at 9:48 and 9:53 the difference went up 500 seconds.  Huh?  There aren’t 100 seconds in an minute.  It wasn’t a fluke since it the same 500 seconds would show up in other intervals such as at 10:11 and 10:16.  Even more prenounced is the jump in seconds between 9:58 and 10:06.  It seems if there were no updates over an hour boundry the number would jump significantly.   Reading a litle documentation and realizing I was only interested in the number of seconds I converted both numbers to seconds using unix_timesteamp():

select unix_timestamp() - unix_timestamp(updated_at) from the_table order by updated_at desc limit 1

This worked much better. I scratched my head a bit and turned to FanSnap’s senior engineering team and queried “Does MySQL know how to do date arithmetic?” I was quickly admonished for not using DATE_SUB() and my pleas referencing pg 489 of the MySQL Reference Manual where it says, and I quote,

“In MySQL Version 3.23, you can use + and – instead of DATE_ADD() and DATE_SUB() if the expression on the right side is a date or datetime column.”

were dismissed as Noobish.  What made this harder to understand was the SQL query didn’t fail with any errors.  It went along its merry way returning results that almost looked like they made sense.

Can any MySQL experts enlighten me?

MySQL, Peace and Happiness,
Mike

About eight weeks ago, after two-plus years of service, my tenure at Flock, a social web browser built on Mozilla’s Firefox technology, was unfortunately and unexpectedly terminated for “corporate restructuring” reasons.  Fortunately I have been given an opportunity to work with a very experienced engineering team at FanSnap, an event ticket search engine that has just launched it’s first beta.

At this gig I have experienced the future of web service hosting and it is Cloud Computing.  The term has morphed with a few definitions over time but specifically I mean renting virtual machine time via a service like Amazon Web Services. More than the ability to scale rapidly as demand goes up (also scale back if demand goes down) at an unbelievably cheap price, it is the structure that it enforces you to adhere to that unleashes it’s power.  The temporal nature of these virtual machines (they can come ago at any time) creates an environment where the cost of not following system administration best practices is high enough to make you do it.  How many times have you tweaked a configuration scriptWhat does this mean?

  1. The configuration of record is NOT what is currently running on your systems. Rather it is the scripts, packages, and configurations that are stored in a repository and deployed to build your systems that is the Truth.
  2. Back up your data.  You can’t be paranoid enough with traditional data center hosting about the safety, integrity and availability of your data.  Being in The Cloud only heightens this.

The Source Of Truth
Doing the work up front to have #1 in place, rapidly pays itself off.  With the Truth as set of configurations and packages, we are able to spin up servers and have them in service in 15 to 30 minutes.  I’m looking forward to implementing auto-scaling on our site that will automatically put servers into production based on certain load conditions.

Cloud computing also creates a bit of mind twist for us old timers. Say you want to do a code update on your servers. Traditionally you would take your systems out of service (in some strategy that hopefully does not make your website go dark), update the code and put them back into service.  With the virtual machines, you can spin up new ones, validate and if all is ok, flip, or migrate depending on your code update strategy, the traffic over to these new machines.  Finally, decommission the old servers when they are no longer serving traffic. This also provides a very quick rollback path if needed.   Just back out the change to the load balancer that flipped the traffic to the new pool of servers and the old code is back in production.

We’ve also spun up high powered VMs for short term, CPU intensive tasks and spun them back down when we were done.  Instead of spending $5K on a machine, we can rent it for a few dollars for a few hours.

It’s The Data, Stupid
While having the ability for machines to come and go at will can make you giddy, making sure your data doesn’t disappear down a virtual black hole can keep you up at night.  What we’ve implemented, and this will likely evolve over time, is having two MySQL master/slave pairs (a primary and a standby) all kept in sync with MySQL Replication in different “Amazon Availability Zones” .  The standby pair is also regularly snapshotting its LVM volume to the S3 repository.  Finally, in case of a complete Amazon melt down, we use MySQL replication to create a copy of our database off the cloud and onto a physical(ish) server.

Log files from the web and app tiers are also sync’d to S3 for future processing and debugging.

Cloud Computing is here.  There are still valid concerns about stability and availability (then again, I’ve been hosted in datacenters such as Rackspace that have gone dark as well).  Monitoring is a pain in the arse since a server that was here one day may not exist the next.  Nevertheless, It is clearly the future of web service hosting.  The economies and associated processes are just too compelling.  We’re probably paying $.25 on the dollar of a fully managed solution such as Rackspace (I’m not dissing, Rackspace.  I found their service to be top notch when I used them).  It will be interesting to watch as the technology and market evolve over the next couple years.  I expect the trend of providing computing power as just another utility to become more realized over time as more AWS like services enter the market and the tools to use these services continue to mature.

I took an interesting stroll down memory lane searching for past classmates (both college and high school) on Facebook.  Several folks from college have already responded (I went to Carnegie Mellon which is a pretty technical school so not surprising).  It will be interesting to see who from high school responds.  As a point of reference, I went to high school long before Al Gore invented the internet.

Folks of my advancing years didn’t have the opportunity to augment, or initiate new, relationships with social networking sites, but these sites are certainly providing a way to reconnect to past ones.  This will be fun.

Blogged with the Flock Browser

One of my all time favorite people watching experiences is to go food shopping at the Safeway on Mother’s day morning.  Every dad in the neighborhood has dragged their uncooperative progeny in the hopeful attempt to salvage a Mother’s Day breakfast.  It is a site to behold. Kids screaming, father’s panicking, helium balloons being let loose in the store… Even if you’re not a parent this is worth checking out.   I, of course, participate and like anything I do, it always more satisfying to do it 100%.  So, I take a few extra steps to ensure I get the full experience.

  1. Give the kids a little sugar before heading to the market
  2. Only bring one toy and expect them to share
  3. Promise to get them the shopping cart with the cool car attached knowing that it will already be taken
  4. Walk slowly down the candy isle several times without buying anything
  5. Check and recheck shopping list while muttering “I know I’m forgetting something”

Anyways, from me to all the Dads out there doing there best to survive an extremely hostile and unfamiliar situation (imagine landing in Bosnia under heavy gunfire like Hilary Clinton did),  Father’s Day is almost here.  Remember to clear your bubbles in combat conditions.

Blogged with the Flock Browser

Let’s Assume the GDP (Gross Domestic Product) metrics are an accurate assessment of the economy’s activity (it’s not, and neither was the now out of favor GNP numbers, but that’s a blog post for another time). The Reuters article, like many others today,  Growth surprises but consumers stressed – Yahoo! News: leads off with “A buildup in inventories kept the economy afloat in the first quarter…”  The fact that GDP grew at 0.6% in Q1 because of an inventory buildup is probably more troubling than if GDP shrank.  What is likely happening is the the economy is decelerating quicker than businesses are able to react.  Fortunately this article, in paragraph three states this possibility.

” Some economists said the report suggested the U.S. economy was on a bit firmer ground than had been thought, but others braced for worse times ahead as businesses ratchet back production further to try to sell off inventories”

While I’m not in favor of overly negative press accounts of economic conditions as there is some truth into “talking ourselves into a recession”, it is disingenuous to try and pass these numbers off as better than expected, which seems to be the trend of the day.

Blogged with the Flock Browser

I’m not a big believer in Karma but I got a dash of it last Friday.  For some cosmic reason in my life I’ve been a finder of wallets and purses and have always dutifully returned them directly to the person or some trusted authority.  I actually had a threefer in college where three days in a row I ran across someone’s left behind wallet. Last one happened a couple months ago outside the Ikeda’s market off of Route 80 on the way back from a Tahoe ski trip where I flagged down an SUV just as it was pulling out of the parking lot.

The cosmic payback came my way on Friday when I dropped my wallet (first time in my entire wallet carrying life) outside the Redwood City Town Hall (I think it is the town hall) just outside the Theater district.  I must have dropped it fishing something out of my pocket while I was eating lunch at the outdoor seating area there.  I didn’t realize it was gone until that evening as I packed up to go home from work.  I’d like to say I calmly got in my car to retrace my steps but there was no calm about it.  Yet, as I approached the seating area I saw a young couple look as if they found a wallet and by the look on my face as I walked towards them they new they had found the owner.  Much gratification and relief flowed profusely and incoherently from me.

So, to the young couple playing cards in the courtyard there, I owe you one.  You have my business card.  Contact me anytime!

Thx,
Mike

Blogged with the Flock Browser