20 September 2013

Solving the Uptime Problem Part Two

My last entry detailed a need we have for a SQL Server specific uptime monitoring solution, both for outage alerts and for uptime reporting to management; I've discussed why we needed it and my earlier attempts to put something together, and today I'll detail the technical components.

Table Structure
The back end of this solution is extremely simple; on an instance we use for DBA monitoring utilities that also happens to house our Central Management Server repository, I have a small database I called DBAUtil, and within there, one table is needed for this solution, named OutageLog.  OutageLog has a identity column primary key, a ServerName varchar column that stores the hostname\instance of the server, a StartTime column and a nullable EndTime column.  StartTime has a default constraint of GETDATE() so when a record is inserted it uses the current date and time, and to facilitate updates the clustered index is on ServerName.

Populating a List of Instances
One of the main problems I detailed in the last entry was of having to hard-code yet another list of monitored servers, which would be assured to fall into unmaintained hell mere seconds after implementing.  What can I say, I don't trust -myself- to keep all these disparate lists maintained, much less any future successor.  In my mind, we have one place that is always kept accurate...our Registered Server list in our Central Management Server.  I kept hunting for a way to use the CMS querying functionality in a batch process, but I came up short.  Ultimately though, I used a combination of CMS and our DTSX package idea...if we could find the back-end of CMS and pull out a list of servers, we'd be on the right track.  Our CMS structure has an "ALL" folder that contains subfolders for each environment (Dev, Test, Model Office, Prod, and a few others) and these contain all currently live SQL Server instances.  So I needed a way to pull in not just the contents of one folder, but all servers in a folder tree structure.  Hearkening back to my days designing Bill of Materials reports for a manufacturing company, I knew the tool I wanted...the recursive CTE.  Here is the query I used.  I include two variables, a root folder name from which to start the recursion, and an "omit" value so that I could easily seperate Prod (root folder Prod, omit folder null) and non Prod (root folder ALL, omit folder Prod).
WITH RecursiveTree AS (
SELECT server_group_id
FROM msdb.dbo.sysmanagement_shared_server_groups_internal
WHERE name = @RootFolderName
UNION ALL
SELECT Child.server_group_id
FROM msdb.dbo.sysmanagement_shared_server_groups_internal AS Child
JOIN RecursiveTree AS Parent ON Child.parent_id = Parent.server_group_id
)
SELECT a.server_name AS Name
FROM msdb.dbo.sysmanagement_shared_registered_servers_internal a
INNER JOIN msdb.dbo.sysmanagement_shared_server_groups_internal b
ON a.server_group_id = b.server_group_id
WHERE a.server_group_id IN (SELECT server_group_id FROM RecursiveTree)
AND b.name != @OmitFolderName
DBAX-OutageWatch_Initialize.dtsx
This package is the parent package of the process:



Basically, after passing in the aforementioned parameters, the msdb database CMS tables are queried using the CTE query above, and the list of servers is passed into a variable.  Then it passes into a Foreach loop, incrementing for each value in the list.  I tried several possible solutions for the inside of this container, first trying the Execute Package task, but being unhappy with the serial execution (even when the out of process option is used) I started trying to use DTEXEC instead.  First the Execute Process task seemed to also operate in a serial fashion with DTEXEC, until I realized I could pass CMD.exe as the executable with DTEXEC (and the whole string of arguments, including the server name to pass to the child package), and everything started running in a quasi-parallel mode at this point.  Having almost 100 instances makes it important for us to parallelize this as much as possible; it is light in workload but the slow timeouts would kill us if operating in serial.

DBAX-OutageWatch_CheckStatus.dtsx
Then on to the child package, which accepts a server name as its input parameter.
 


There is an ADO.NET connection manager that uses the server name variable, via expression, as its connection.  This is how we dynamically build a connection to the target server.  The first step is literally nothing more than running "SELECT 1" against the server.  I allow more than one errors on this step because if it fails, I don't want the entire package to fail.  I have a success and failure constraint from this testing task.  If the logging fails, we go to the "Log Outage If Not Already" task, which goes to DBAUtil.dbo.OutageLog with the following query:
IF NOT EXISTS
(SELECT 1 FROM OutageLog WHERE EndTime IS NULL
AND ServerName = @ServerName)
BEGIN
INSERT INTO OutageLog (ServerName) VALUES (@ServerName)
END
Essentially this looks to see if there is an open outage for this query...if there is not one, then as the discoverer, it must be the first to log this outage and it inserts an outage record.
If however, the server responds to the SELECT 1 query, we go to the Close Active Outages task:
IF EXISTS
(SELECT 1 FROM OutageLog WHERE EndTime IS NULL
AND ServerName = @ServerName)
BEGIN
UPDATE OutageLog SET EndTime = GETDATE()
WHERE EndTime IS NULL AND ServerName = @ServerName
END
This similarly checks for the existence of an open outage; if one exists, now that the server is responding, it must close the outage by updating that record's EndTime value.  Otherwise, it does nothing; this being the most common logical path most of the time hopefully!

Watching the Watcher
As an extra failsafe I have another copy of the DBAUtil database and the DBAX-OutageWatch_CheckStatus package on a similar monitoring server across the wire in our offsite facility.  I have this package running to check for and log outages of a hardcoded specific server, the one that runs the above packages.  This way, we can get alerts on a failure of the monitoring server itself.  It functions the same as the above but instead of having the parent package and querying CMS I just push the a fixed server name straight into the child package.

About those Alerts...
The jobs that launch these packages are seperated into prod and non-prod, as I alluded earlier.  This allows me to set up slightly different logging frequencies, with staggered run times, and customized alerting.  The second job step does precisely that...alerting:
--Give the job some time to ensure the other job has
--completed its updates.
WAITFOR DELAY '00:00:10';
--Variable to place Body text in for notification email
DECLARE @Body VARCHAR(MAX)
--Generate Body text based on servers still down
--(null if no servers are down)
SELECT @Body = COALESCE(@Body, '')
+ 'Server ' + a.ServerName + ' is down for '
+ CAST(DATEDIFF(mi, a.StartTime, GETDATE()) AS VARCHAR(50))
+ ' minutes, since ' + CAST(a.StartTime AS VARCHAR(50))
+ '.' + CHAR(13)
FROM DBAUtil.dbo.OutageLog a
INNER JOIN
msdb.dbo.sysmanagement_shared_registered_servers_internal b
ON a.ServerName = b.name
--Outage is still active
WHERE a.EndTime IS NULL
--Outage has been for 5 minutes at least
AND DATEDIFF(mi, a.StartTime, GETDATE()) >= 5 
--Servers are in the prod folder in CMS
AND b.server_group_id = 6 
--If servers are down send an email with the @Body text
IF (@Body IS NOT NULL)
BEGIN
EXEC msdb.dbo.sp_send_dbmail
@profile_name = 'WhatevertheProfileIs',
@recipients = 'whoevertheoncallchumpis@yourcompany.com',
@subject = 'Prod Server Down, Wake UP!',
@body = @Body
END
Note that this is my third iteration of this script after some testing.  I got a few incidents where I received an email with no results, back when I was doing an IF EXISTS, selecting for an active outage, and then having a sp_send_dbmail query that does the same select.  Because our logging process is asynchronous, and because sp_send_dbmail puts its stuff into its own queue and is not guaranteed to operate within the bounds of one transaction, it is possible, and indeed happened in this case that the query decided to send the mail while an outage was logged, and then by the time the mail query was run, the outage was cleared, so it looks like a phantom read type issue.  I resolved this by placing a 10 second delay to allow extra room for outage clearing updates to occur, and generating the body of the email into a variable first, then passing that variable into the email alert, eliminating the chance of differing results (which happened when we had two instances of the same query happening).

So that's my system in a nutshell, in its infancy.  Plenty of work to go hardening it up I'm sure.  Maybe in a future post I'll detail out my Uptime RDL magic I did this week using Reporting Services.  It was fun to get back into SSRS after a long hiatus!  Fit like an old glove.  Or like OJ Simpson's bloody glove...well, whatever the case...


No comments: