Some Amazon hosted web sites experienced a 40 minute outage the morning of 12/28/2008 (9:30am PST to 10:10am PST) as network connectivity to one of Amazon’s “EC2 availability Zone” dropped. About 40 minutes into the outage, and just before it was fixed, Amazon posted a note on their dashboard page.
A couple of observations:
- This frustrating kind of outage, while usually infrequent, can happen at any datacenter. This one get’s put in the “this better not happen very often” mental bucket.
- I am disappointed with how long it took Amazon to post any kind of announcement. It took me more than a few minutes to truly convince myself that it was Amazon’s issue and not a burp in our own operations. I checked one other site, userscripts.org, that I know hosted in EC2 and that site didn’t seem to be affected. Apparently they are hosted in a different “EC2 availability zone”.
- I’m pleased with how our own monitoring of the website worked. Most of our monitoring is done with Nagios from within the Amazon Cloud. Since the issue was connectivity TO the cloud, that set of monitoring didn’t notice anything wrong. According to it’s point of view, everything was operating normally. Sort of surprisingly, RightScale’s, the service we use to help us host our site on EC2, monitoring also reported everything as normal. Fortunately, we have a second set of monitoring hosted on a Linode to monitor the health of the main Nagios monitor. This guy alarmed as it was suppose to (~$20/month well spent) and at 9:30am PST, I knew our site was down (though there wasn’t much I could do about it).
And a note or two:
- With infinite time and funds we’d probably have the site load balanced between multiple clouds. That will happen some day.
- Always have a second set of monitoring completely independent from your main site operations. It can just be a heartbeat monitor but it is worth the time and money to set up.
MySQL Date Arithmetic amongst Other Oddities
December 2, 2008
Like most websites, we at FanSnap make heavy use of MySQL as a database engine. Despite it’s popularity, it definitely has it’s quirks. Full disclosure, I am by no means a DBA though I often get to play one at work. Last we week we implemented some much needed monitoring to alarm if some expected database activity didn’t happen. Basically, if a timestamp field called updated_at doesn’t have a value less than 2 hours old, our Nagios monitoring system will send out an alert email. Seemed like a simple query to me:
select now() - updated_at from the_table order by updated_at desc limit 1
and as long as “now() – updated_at” was less then 7200, Nagios wouldn’t complain.
Since updated_at could be updated a fairly random intervals, simple tests directly against the database seemed to prove the above query worked. Well… Looking at slice of Nagios monitoring results showed something odd:
Time of Monitor TimeDiff ======================== 12/2/08 9:38 118 12/2/08 9:43 618 12/2/08 9:48 231 12/2/08 9:53 731 12/2/08 9:58 418 12/2/08 10:06 5199 12/2/08 10:11 88 12/2/08 10:16 588
Within the 5 minutes between monitoring checks at 9:48 and 9:53 the difference went up 500 seconds. Huh? There aren’t 100 seconds in an minute. It wasn’t a fluke since it the same 500 seconds would show up in other intervals such as at 10:11 and 10:16. Even more prenounced is the jump in seconds between 9:58 and 10:06. It seems if there were no updates over an hour boundry the number would jump significantly. Reading a litle documentation and realizing I was only interested in the number of seconds I converted both numbers to seconds using unix_timesteamp():
select unix_timestamp() - unix_timestamp(updated_at) from the_table order by updated_at desc limit 1
This worked much better. I scratched my head a bit and turned to FanSnap’s senior engineering team and queried “Does MySQL know how to do date arithmetic?” I was quickly admonished for not using DATE_SUB() and my pleas referencing pg 489 of the MySQL Reference Manual where it says, and I quote,
“In MySQL Version 3.23, you can use + and – instead of DATE_ADD() and DATE_SUB() if the expression on the right side is a date or datetime column.”
were dismissed as Noobish. What made this harder to understand was the SQL query didn’t fail with any errors. It went along its merry way returning results that almost looked like they made sense.
Can any MySQL experts enlighten me?
MySQL, Peace and Happiness,
Mike
