19 September 2013

Solving the Uptime Problem Part One

As a DBA, I have to worry about uptime from two angles.  The first is one of raw survival.  I want to be the first to know when an instance decides to shuffle off this mortal coil and join the choir invisible.  I hate being told my servers are down; I absolutely want to know first.  And perhaps less critical but still important is being able to give KPI figures to the management.  So...how can we say what our SQL environment "uptime" is?

Our infrastructure chaps have all sorts of tools...SolarWinds Orion, SCOM, etc. that might be able to give us -server- uptime.  But they can provide that metric to management on its own.  What we are responsible for is a layer deeper: instance uptime.  When SQL Server is online, servicing queries.  If the server is turned on but the SQL service has shot craps, that is a SQL outage.  So we can't just do something simple where we log ping response or anything like that.

I've tried solving this in a handful of ways.  The first attempt was a painful and probably misguided debacle where I tried hacking the back-end of our existing SQL Server monitoring solution, Quest Spotlight.  If anyone out there has monkeyed with Spotlight to any extent they know of which I speak; the back-end databases are all but useless.  There was no possibility of using its history to accurately come up with a picture of uptime.  That said, I did build a fragile little alerting job based on the back end; the outage alerts from Spotlight fire immediately and often are false positives due to network connectivity issues and the like.  I wanted to be able to get an email that I trusted that indicated a hard outage, so I wrote a job that queried the back end for active outage alarms over 15 minutes, and emailed my personal email.  To this day this job is still occasionally saving my bacon.  When no one else is paying attention this job will catch an overlooked outage.

But the uptime calculation problem remained.  I grudgingly started to realize that what I wanted was going to have to be hand built, and so a few abortive attempts to build an SSIS package began.  I started with a fixed list of servers/instances in a table in a DBA utility database with an SSIS package that ran for-each container (for each server name) with a test connection (SELECT 1) to the server, logging in a table whether the instance was up or down.  Lots of strange error handling had to be done to accomplish this.

I had the following issues with this I needed to address:
  1. The fixed list of servers was clunky and sure to be left unmaintained at some point.  I'd much rather use a more actively maintained list of current servers in our environment (we have just shy of 100 instances right now).
  2. The For-Each container was a painfully iterative process...each one done one at a time, and if any failures were noted, it slowed the process by 10-20 seconds for the timeout to occur.
  3. Logging an up or down of every single instance every few minutes is logging tons of data for what should ultimately be assumed; why not just log outages?
In my next blog entry I'll delve into the minutiae of what I actually did to solve this problem.  Til next time...

No comments: