Some Amazon hosted web sites experienced a 40 minute outage the morning of 12/28/2008 (9:30am PST to 10:10am PST) as network connectivity to one of Amazon’s “EC2 availability Zone” dropped.  About 40 minutes into the outage, and just before it was fixed, Amazon posted a note on their dashboard page.

“One of our EC2 Availability Zone is experiencing reduced connectivity to the Internet at present. We are shifting traffic to another provider and expecting fully restored connectivity shortly.

A couple of observations:

  1. This frustrating kind of outage, while usually infrequent, can happen at any datacenter.  This one get’s put in the “this better not happen very often” mental bucket.
  2. I am disappointed with how long it took Amazon to post any kind of announcement.  It took me more than a few minutes to truly convince myself that it was Amazon’s issue and not a burp in our own operations.  I checked one other site, userscripts.org, that I know  hosted in EC2 and that site didn’t seem to be affected.  Apparently they are hosted in a different “EC2 availability zone”.
  3. I’m pleased with how our own monitoring of the website worked.  Most of our monitoring is done with Nagios from within the Amazon Cloud.  Since the issue was connectivity TO the cloud, that set of monitoring didn’t notice anything wrong.  According to it’s point of view, everything was operating normally.  Sort of surprisingly, RightScale‘s, the service we use to help us host our site on EC2, monitoring also reported everything as normal.  Fortunately, we have a second set of monitoring hosted on a Linode to monitor the health of the main Nagios monitor.  This guy alarmed as it was suppose to (~$20/month well spent) and at 9:30am PST, I knew our site was down (though there wasn’t much I could do about it).

And a note or two:

  1. With infinite time and funds we’d probably have the site load balanced between multiple clouds.  That will happen some day.
  2. Always have a second set of monitoring completely independent from your main site operations.  It can just be a heartbeat monitor but it is worth the time and money to set up.

Like most websites, we at FanSnap make heavy use of MySQL as a database engine.   Despite it’s popularity, it definitely has it’s quirks.  Full disclosure, I am by no means a DBA though I often get to play one at work.  Last we week we implemented some much needed monitoring to alarm if some expected database activity didn’t happen.  Basically, if a timestamp field called updated_at doesn’t have a value less than 2 hours old, our Nagios monitoring system will send out an alert email.  Seemed like a simple query to me:

select now() - updated_at from the_table order by updated_at desc limit 1

and as long as “now() – updated_at” was less then 7200, Nagios wouldn’t complain.

Since updated_at could be updated a fairly random intervals, simple tests directly against the database seemed to prove the above query worked.  Well… Looking at slice of Nagios monitoring results showed something odd:

Time of Monitor   TimeDiff
========================
12/2/08 9:38       118
12/2/08 9:43       618
12/2/08 9:48       231
12/2/08 9:53       731
12/2/08 9:58       418
12/2/08 10:06      5199
12/2/08 10:11      88
12/2/08 10:16      588

Within the 5 minutes between monitoring checks at 9:48 and 9:53 the difference went up 500 seconds.  Huh?  There aren’t 100 seconds in an minute.  It wasn’t a fluke since it the same 500 seconds would show up in other intervals such as at 10:11 and 10:16.  Even more prenounced is the jump in seconds between 9:58 and 10:06.  It seems if there were no updates over an hour boundry the number would jump significantly.   Reading a litle documentation and realizing I was only interested in the number of seconds I converted both numbers to seconds using unix_timesteamp():

select unix_timestamp() - unix_timestamp(updated_at) from the_table order by updated_at desc limit 1

This worked much better. I scratched my head a bit and turned to FanSnap’s senior engineering team and queried “Does MySQL know how to do date arithmetic?” I was quickly admonished for not using DATE_SUB() and my pleas referencing pg 489 of the MySQL Reference Manual where it says, and I quote,

“In MySQL Version 3.23, you can use + and – instead of DATE_ADD() and DATE_SUB() if the expression on the right side is a date or datetime column.”

were dismissed as Noobish.  What made this harder to understand was the SQL query didn’t fail with any errors.  It went along its merry way returning results that almost looked like they made sense.

Can any MySQL experts enlighten me?

MySQL, Peace and Happiness,
Mike

About eight weeks ago, after two-plus years of service, my tenure at Flock, a social web browser built on Mozilla’s Firefox technology, was unfortunately and unexpectedly terminated for “corporate restructuring” reasons.  Fortunately I have been given an opportunity to work with a very experienced engineering team at FanSnap, an event ticket search engine that has just launched it’s first beta.

At this gig I have experienced the future of web service hosting and it is Cloud Computing.  The term has morphed with a few definitions over time but specifically I mean renting virtual machine time via a service like Amazon Web Services. More than the ability to scale rapidly as demand goes up (also scale back if demand goes down) at an unbelievably cheap price, it is the structure that it enforces you to adhere to that unleashes it’s power.  The temporal nature of these virtual machines (they can come ago at any time) creates an environment where the cost of not following system administration best practices is high enough to make you do it.  How many times have you tweaked a configuration scriptWhat does this mean?

  1. The configuration of record is NOT what is currently running on your systems. Rather it is the scripts, packages, and configurations that are stored in a repository and deployed to build your systems that is the Truth.
  2. Back up your data.  You can’t be paranoid enough with traditional data center hosting about the safety, integrity and availability of your data.  Being in The Cloud only heightens this.

The Source Of Truth
Doing the work up front to have #1 in place, rapidly pays itself off.  With the Truth as set of configurations and packages, we are able to spin up servers and have them in service in 15 to 30 minutes.  I’m looking forward to implementing auto-scaling on our site that will automatically put servers into production based on certain load conditions.

Cloud computing also creates a bit of mind twist for us old timers. Say you want to do a code update on your servers. Traditionally you would take your systems out of service (in some strategy that hopefully does not make your website go dark), update the code and put them back into service.  With the virtual machines, you can spin up new ones, validate and if all is ok, flip, or migrate depending on your code update strategy, the traffic over to these new machines.  Finally, decommission the old servers when they are no longer serving traffic. This also provides a very quick rollback path if needed.   Just back out the change to the load balancer that flipped the traffic to the new pool of servers and the old code is back in production.

We’ve also spun up high powered VMs for short term, CPU intensive tasks and spun them back down when we were done.  Instead of spending $5K on a machine, we can rent it for a few dollars for a few hours.

It’s The Data, Stupid
While having the ability for machines to come and go at will can make you giddy, making sure your data doesn’t disappear down a virtual black hole can keep you up at night.  What we’ve implemented, and this will likely evolve over time, is having two MySQL master/slave pairs (a primary and a standby) all kept in sync with MySQL Replication in different “Amazon Availability Zones” .  The standby pair is also regularly snapshotting its LVM volume to the S3 repository.  Finally, in case of a complete Amazon melt down, we use MySQL replication to create a copy of our database off the cloud and onto a physical(ish) server.

Log files from the web and app tiers are also sync’d to S3 for future processing and debugging.

Cloud Computing is here.  There are still valid concerns about stability and availability (then again, I’ve been hosted in datacenters such as Rackspace that have gone dark as well).  Monitoring is a pain in the arse since a server that was here one day may not exist the next.  Nevertheless, It is clearly the future of web service hosting.  The economies and associated processes are just too compelling.  We’re probably paying $.25 on the dollar of a fully managed solution such as Rackspace (I’m not dissing, Rackspace.  I found their service to be top notch when I used them).  It will be interesting to watch as the technology and market evolve over the next couple years.  I expect the trend of providing computing power as just another utility to become more realized over time as more AWS like services enter the market and the tools to use these services continue to mature.

I took an interesting stroll down memory lane searching for past classmates (both college and high school) on Facebook.  Several folks from college have already responded (I went to Carnegie Mellon which is a pretty technical school so not surprising).  It will be interesting to see who from high school responds.  As a point of reference, I went to high school long before Al Gore invented the internet.

Folks of my advancing years didn’t have the opportunity to augment, or initiate new, relationships with social networking sites, but these sites are certainly providing a way to reconnect to past ones.  This will be fun.

Blogged with the Flock Browser

One of my all time favorite people watching experiences is to go food shopping at the Safeway on Mother’s day morning.  Every dad in the neighborhood has dragged their uncooperative progeny in the hopeful attempt to salvage a Mother’s Day breakfast.  It is a site to behold. Kids screaming, father’s panicking, helium balloons being let loose in the store… Even if you’re not a parent this is worth checking out.   I, of course, participate and like anything I do, it always more satisfying to do it 100%.  So, I take a few extra steps to ensure I get the full experience.

  1. Give the kids a little sugar before heading to the market
  2. Only bring one toy and expect them to share
  3. Promise to get them the shopping cart with the cool car attached knowing that it will already be taken
  4. Walk slowly down the candy isle several times without buying anything
  5. Check and recheck shopping list while muttering “I know I’m forgetting something”

Anyways, from me to all the Dads out there doing there best to survive an extremely hostile and unfamiliar situation (imagine landing in Bosnia under heavy gunfire like Hilary Clinton did),  Father’s Day is almost here.  Remember to clear your bubbles in combat conditions.

Blogged with the Flock Browser

Let’s Assume the GDP (Gross Domestic Product) metrics are an accurate assessment of the economy’s activity (it’s not, and neither was the now out of favor GNP numbers, but that’s a blog post for another time). The Reuters article, like many others today,  Growth surprises but consumers stressed – Yahoo! News: leads off with “A buildup in inventories kept the economy afloat in the first quarter…”  The fact that GDP grew at 0.6% in Q1 because of an inventory buildup is probably more troubling than if GDP shrank.  What is likely happening is the the economy is decelerating quicker than businesses are able to react.  Fortunately this article, in paragraph three states this possibility.

” Some economists said the report suggested the U.S. economy was on a bit firmer ground than had been thought, but others braced for worse times ahead as businesses ratchet back production further to try to sell off inventories”

While I’m not in favor of overly negative press accounts of economic conditions as there is some truth into “talking ourselves into a recession”, it is disingenuous to try and pass these numbers off as better than expected, which seems to be the trend of the day.

Blogged with the Flock Browser

I’m not a big believer in Karma but I got a dash of it last Friday.  For some cosmic reason in my life I’ve been a finder of wallets and purses and have always dutifully returned them directly to the person or some trusted authority.  I actually had a threefer in college where three days in a row I ran across someone’s left behind wallet. Last one happened a couple months ago outside the Ikeda’s market off of Route 80 on the way back from a Tahoe ski trip where I flagged down an SUV just as it was pulling out of the parking lot.

The cosmic payback came my way on Friday when I dropped my wallet (first time in my entire wallet carrying life) outside the Redwood City Town Hall (I think it is the town hall) just outside the Theater district.  I must have dropped it fishing something out of my pocket while I was eating lunch at the outdoor seating area there.  I didn’t realize it was gone until that evening as I packed up to go home from work.  I’d like to say I calmly got in my car to retrace my steps but there was no calm about it.  Yet, as I approached the seating area I saw a young couple look as if they found a wallet and by the look on my face as I walked towards them they new they had found the owner.  Much gratification and relief flowed profusely and incoherently from me.

So, to the young couple playing cards in the courtyard there, I owe you one.  You have my business card.  Contact me anytime!

Thx,
Mike

Blogged with the Flock Browser

Statistically Significant

September 20, 2007

If there is one thing that just burns me is when someone says with conviction that a number is statistically significant when none of the formal statistical evidence gathering has been done.  So… A primer on Statistical Significance (from someone whom isn’t that statistically bent), it’s importance, why being Statistically Significant isn’t always significant and why I nearly go into a blind rage when some one uses the term out of context.

I like wikipedia’s definition so I quote it here:

‘In statistics, a result is called significant if it is unlikely to have occurred by chance. “A statistically significant difference” simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important or significant in the usual sense of the word.’

Or, in layman’s terms, something is statistically significant if what ever you did (or whatever external event might have happened) made a difference (this is where testing, usually A-B testing, comes into play) and it probably (I intentionally use the word “probably” as this is where the term “confidence interval” comes into play) wasn’t just blind luck. 

Why is knowing whether something is statistically significant is important?  Simply put, when you need to know if an event or a trend is the result of random variation or not.

Without going into the mathematics, three important factors contribute to a popular method of statistical significance testing (in this case the t-test) when comparing two independent data sets. These are the mean, or average, the number of data points (for some reason called degrees of freedom to Statisticians) and the range of these data points, or standard deviation.

For example let’s take the average live expectancy of two sets of rats.  One set gets a bowl of Chereos every day and the other doesn’t.  Even if the average of one set is noticeably (note, not significantly) larger than the other, a large standard deviation in each set may indicate that the difference in the average is not significant. [ probably should put some sample data in here]

What if there were a million rats in each sample and a bowl of Chereos every morning increased their life expentancy by 2.4 seconds.  While possibly statistically significant … it really doesn’t make a difference.  This is when statistical significance doesn’t really mean real world significance.

So why does this just frost me when someone says that something is statistically significant without doing the proof.  Because humans are wired to not deal with randomness very well.  We are wired to try to find patterns in randomness that don’t exist.  It seems 1/2 the world believes that magic patterns emerge from their Ipod song shuffle (“Dude… what are the chances of Dylan’s ‘A hard rain is gonna fall’ be followed by the Grateful Dead doing ‘Here Comes the Rain’ followed by CCR doing ‘Who’ll Stop the Rain’.  There is NO WAY that is random!  I’ve got 1000 songs on my Ipod…. blah.. blah.. blah..”) . Probably a survival trait we developed along our evolutionary path but this bias or tendency that, at best, doesn’t translate well into today’s reality and, at worst, makes for disastrous decision making.

OK… I’ve been writing this off and on for a week.  There is tons of stuff I didn’t touch on (hypothesis testing, Type 1 errors, Type 2 errors, p-values, etc) and I’m sure there are people much more familiar than me with these concepts.  If anyone ever read this blog, I’m sure I’d get some, hopefully, constructive feedback. Maybe I’ll follow this up with more details in the future.

Blogged with Flock

I have to admit that even after over a year working at Flock I’m still addicted to the media bar. Our 0.9 version adds a bunch of new functionality but my favorite is being able to save Flickr and YouTube search queries and getting notified when their are updates.  Just type in some search terms in the Media bar search I’ve nostalgically starred queries for places I’ve lived in (Allston, MA and Vestal, NY) as well as where I live now.

My two favorite queries at the moment that I’ve got starred are “Ultimate Frisbee” (A sport I played at a fairly competitive level when I live in Boston.  At least I thought I was good until I attending an open practice with DOG) on Flickr and “Guitar Instructions” on YouTube (An instrument I’m completely talentless on).

The “Ultimate Frisbee” query has yielded fantastic images almost every day.  An awesome layout here:

Game face on here:

Ultimate in India!

If I was only 20 years younger and 20lbs lighter.

There are also a lot of folks willing to share their guitar knowledge on Youtube which is awesome.  There is no shortage of how to videos for “Stairway to Heaven”, :-) .

Enjoy!

Blogged with Flock

Well… I was really looking forward to doing this year’s bike to work day.  Not so much as an eco-fascist but it’s nice to get a little stretch in before work.   Last year I rode from San Jose to San Mateo which is about 40 miles on a mountain bike.  That was a pretty good poke.

This year I only had to go Mountain View.  About 5 miles from my destination some asshole opened his parked car door into the bike lane and sent me flying into the street.   Stuntmen couldn’t have timed it better.  One thing I noticed that in movies this usually send the victim end over end over the door.  What happened to me was ramming the right side of my chest into the corner of the door and get sent sprawling perpendicular to the car.

Who was this asshole?  I don’t know. I was kind of shakin’ up.  I was on the bike after letting my head clear a bit for 15 minutes but I probably should have made sure his car was alright (and if it was punched out the window of his fancy BMW).

Inventory of damage
1. My right chest and back are throbbing and it hurts to breathe deeply.  Coughing is excruciating.  Here is where the point of the car door caught me in the chest.

2. lost a piece of skin off my right thumb.

3. My right elbow is scraped up pretty good

4. My left pinky is sore

5. My company’s $3000 Macbook Pro is dinged pretty good

6. My glasses are bent

I like riding my bike, though I don’t do it very often, but I’m going to stick to the trails where it is safe (I used to ride my bike through the streets of Boston all the time but this just hurts too much). I was going to bitch about the hour it took to get my bike ready…


Mike

Blogged with Flock

Follow

Get every new post delivered to your Inbox.