20 September 2013

Solving the Uptime Problem Part Two

My last entry detailed a need we have for a SQL Server specific uptime monitoring solution, both for outage alerts and for uptime reporting to management; I've discussed why we needed it and my earlier attempts to put something together, and today I'll detail the technical components.

Table Structure
The back end of this solution is extremely simple; on an instance we use for DBA monitoring utilities that also happens to house our Central Management Server repository, I have a small database I called DBAUtil, and within there, one table is needed for this solution, named OutageLog.  OutageLog has a identity column primary key, a ServerName varchar column that stores the hostname\instance of the server, a StartTime column and a nullable EndTime column.  StartTime has a default constraint of GETDATE() so when a record is inserted it uses the current date and time, and to facilitate updates the clustered index is on ServerName.

Populating a List of Instances
One of the main problems I detailed in the last entry was of having to hard-code yet another list of monitored servers, which would be assured to fall into unmaintained hell mere seconds after implementing.  What can I say, I don't trust -myself- to keep all these disparate lists maintained, much less any future successor.  In my mind, we have one place that is always kept accurate...our Registered Server list in our Central Management Server.  I kept hunting for a way to use the CMS querying functionality in a batch process, but I came up short.  Ultimately though, I used a combination of CMS and our DTSX package idea...if we could find the back-end of CMS and pull out a list of servers, we'd be on the right track.  Our CMS structure has an "ALL" folder that contains subfolders for each environment (Dev, Test, Model Office, Prod, and a few others) and these contain all currently live SQL Server instances.  So I needed a way to pull in not just the contents of one folder, but all servers in a folder tree structure.  Hearkening back to my days designing Bill of Materials reports for a manufacturing company, I knew the tool I wanted...the recursive CTE.  Here is the query I used.  I include two variables, a root folder name from which to start the recursion, and an "omit" value so that I could easily seperate Prod (root folder Prod, omit folder null) and non Prod (root folder ALL, omit folder Prod).
WITH RecursiveTree AS (
SELECT server_group_id
FROM msdb.dbo.sysmanagement_shared_server_groups_internal
WHERE name = @RootFolderName
SELECT Child.server_group_id
FROM msdb.dbo.sysmanagement_shared_server_groups_internal AS Child
JOIN RecursiveTree AS Parent ON Child.parent_id = Parent.server_group_id
SELECT a.server_name AS Name
FROM msdb.dbo.sysmanagement_shared_registered_servers_internal a
INNER JOIN msdb.dbo.sysmanagement_shared_server_groups_internal b
ON a.server_group_id = b.server_group_id
WHERE a.server_group_id IN (SELECT server_group_id FROM RecursiveTree)
AND b.name != @OmitFolderName
This package is the parent package of the process:

Basically, after passing in the aforementioned parameters, the msdb database CMS tables are queried using the CTE query above, and the list of servers is passed into a variable.  Then it passes into a Foreach loop, incrementing for each value in the list.  I tried several possible solutions for the inside of this container, first trying the Execute Package task, but being unhappy with the serial execution (even when the out of process option is used) I started trying to use DTEXEC instead.  First the Execute Process task seemed to also operate in a serial fashion with DTEXEC, until I realized I could pass CMD.exe as the executable with DTEXEC (and the whole string of arguments, including the server name to pass to the child package), and everything started running in a quasi-parallel mode at this point.  Having almost 100 instances makes it important for us to parallelize this as much as possible; it is light in workload but the slow timeouts would kill us if operating in serial.

Then on to the child package, which accepts a server name as its input parameter.

There is an ADO.NET connection manager that uses the server name variable, via expression, as its connection.  This is how we dynamically build a connection to the target server.  The first step is literally nothing more than running "SELECT 1" against the server.  I allow more than one errors on this step because if it fails, I don't want the entire package to fail.  I have a success and failure constraint from this testing task.  If the logging fails, we go to the "Log Outage If Not Already" task, which goes to DBAUtil.dbo.OutageLog with the following query:
AND ServerName = @ServerName)
INSERT INTO OutageLog (ServerName) VALUES (@ServerName)
Essentially this looks to see if there is an open outage for this query...if there is not one, then as the discoverer, it must be the first to log this outage and it inserts an outage record.
If however, the server responds to the SELECT 1 query, we go to the Close Active Outages task:
AND ServerName = @ServerName)
UPDATE OutageLog SET EndTime = GETDATE()
WHERE EndTime IS NULL AND ServerName = @ServerName
This similarly checks for the existence of an open outage; if one exists, now that the server is responding, it must close the outage by updating that record's EndTime value.  Otherwise, it does nothing; this being the most common logical path most of the time hopefully!

Watching the Watcher
As an extra failsafe I have another copy of the DBAUtil database and the DBAX-OutageWatch_CheckStatus package on a similar monitoring server across the wire in our offsite facility.  I have this package running to check for and log outages of a hardcoded specific server, the one that runs the above packages.  This way, we can get alerts on a failure of the monitoring server itself.  It functions the same as the above but instead of having the parent package and querying CMS I just push the a fixed server name straight into the child package.

About those Alerts...
The jobs that launch these packages are seperated into prod and non-prod, as I alluded earlier.  This allows me to set up slightly different logging frequencies, with staggered run times, and customized alerting.  The second job step does precisely that...alerting:
--Give the job some time to ensure the other job has
--completed its updates.
WAITFOR DELAY '00:00:10';
--Variable to place Body text in for notification email
--Generate Body text based on servers still down
--(null if no servers are down)
SELECT @Body = COALESCE(@Body, '')
+ 'Server ' + a.ServerName + ' is down for '
+ ' minutes, since ' + CAST(a.StartTime AS VARCHAR(50))
+ '.' + CHAR(13)
FROM DBAUtil.dbo.OutageLog a
msdb.dbo.sysmanagement_shared_registered_servers_internal b
ON a.ServerName = b.name
--Outage is still active
--Outage has been for 5 minutes at least
AND DATEDIFF(mi, a.StartTime, GETDATE()) >= 5 
--Servers are in the prod folder in CMS
AND b.server_group_id = 6 
--If servers are down send an email with the @Body text
EXEC msdb.dbo.sp_send_dbmail
@profile_name = 'WhatevertheProfileIs',
@recipients = 'whoevertheoncallchumpis@yourcompany.com',
@subject = 'Prod Server Down, Wake UP!',
@body = @Body
Note that this is my third iteration of this script after some testing.  I got a few incidents where I received an email with no results, back when I was doing an IF EXISTS, selecting for an active outage, and then having a sp_send_dbmail query that does the same select.  Because our logging process is asynchronous, and because sp_send_dbmail puts its stuff into its own queue and is not guaranteed to operate within the bounds of one transaction, it is possible, and indeed happened in this case that the query decided to send the mail while an outage was logged, and then by the time the mail query was run, the outage was cleared, so it looks like a phantom read type issue.  I resolved this by placing a 10 second delay to allow extra room for outage clearing updates to occur, and generating the body of the email into a variable first, then passing that variable into the email alert, eliminating the chance of differing results (which happened when we had two instances of the same query happening).

So that's my system in a nutshell, in its infancy.  Plenty of work to go hardening it up I'm sure.  Maybe in a future post I'll detail out my Uptime RDL magic I did this week using Reporting Services.  It was fun to get back into SSRS after a long hiatus!  Fit like an old glove.  Or like OJ Simpson's bloody glove...well, whatever the case...

19 September 2013

Solving the Uptime Problem Part One

As a DBA, I have to worry about uptime from two angles.  The first is one of raw survival.  I want to be the first to know when an instance decides to shuffle off this mortal coil and join the choir invisible.  I hate being told my servers are down; I absolutely want to know first.  And perhaps less critical but still important is being able to give KPI figures to the management.  So...how can we say what our SQL environment "uptime" is?

Our infrastructure chaps have all sorts of tools...SolarWinds Orion, SCOM, etc. that might be able to give us -server- uptime.  But they can provide that metric to management on its own.  What we are responsible for is a layer deeper: instance uptime.  When SQL Server is online, servicing queries.  If the server is turned on but the SQL service has shot craps, that is a SQL outage.  So we can't just do something simple where we log ping response or anything like that.

I've tried solving this in a handful of ways.  The first attempt was a painful and probably misguided debacle where I tried hacking the back-end of our existing SQL Server monitoring solution, Quest Spotlight.  If anyone out there has monkeyed with Spotlight to any extent they know of which I speak; the back-end databases are all but useless.  There was no possibility of using its history to accurately come up with a picture of uptime.  That said, I did build a fragile little alerting job based on the back end; the outage alerts from Spotlight fire immediately and often are false positives due to network connectivity issues and the like.  I wanted to be able to get an email that I trusted that indicated a hard outage, so I wrote a job that queried the back end for active outage alarms over 15 minutes, and emailed my personal email.  To this day this job is still occasionally saving my bacon.  When no one else is paying attention this job will catch an overlooked outage.

But the uptime calculation problem remained.  I grudgingly started to realize that what I wanted was going to have to be hand built, and so a few abortive attempts to build an SSIS package began.  I started with a fixed list of servers/instances in a table in a DBA utility database with an SSIS package that ran for-each container (for each server name) with a test connection (SELECT 1) to the server, logging in a table whether the instance was up or down.  Lots of strange error handling had to be done to accomplish this.

I had the following issues with this I needed to address:
  1. The fixed list of servers was clunky and sure to be left unmaintained at some point.  I'd much rather use a more actively maintained list of current servers in our environment (we have just shy of 100 instances right now).
  2. The For-Each container was a painfully iterative process...each one done one at a time, and if any failures were noted, it slowed the process by 10-20 seconds for the timeout to occur.
  3. Logging an up or down of every single instance every few minutes is logging tons of data for what should ultimately be assumed; why not just log outages?
In my next blog entry I'll delve into the minutiae of what I actually did to solve this problem.  Til next time...