24 October 2012

Becoming Slightly More Normal

Today's idea for the blog regards inching towards normalcy...that is, the main three normal forms of database design.  I always get a little fuzzy in the head when I think about the precise official definitions of the three, and generally examples are the best way I have of comprehending normalization.

So let's start out with a sample database.  As this is my blog, and the apparently absent readership has put up no protestations, I shall choose Hindustani classical music as my subject; a database of ragas.

Raga
Name Thaat Mode Performance1 Performance2
Yaman Kalyan Lydian Ustad Imrat Khan Hariprasad Chaurasia
Darbari Asavari Aeolian Ustad Vilayat Khan Dagar Brothers
Chandni Kedar Kalyan Lydian Ustad Vilayat Khan Ustad Amir Khan

So we can see from this that we've got a record of each raga, listing what thaat each is classified into, what Western mode that thaat is equivalent to, and a couple example noteworthy performances of each raga.  The main problem here that prevents us from entering the first normal form (1NF) is the multiple values we have for example performances.  We could have issues if there were more or less than two performances to be provided for a raga, for instance.  So to get it to first normal form, we combine the performance columns into one column, since they are storing essentially the same information.

Raga
Name Thaat Mode Performance
Yaman Kalyan Lydian Ustad Imrat Khan
Yaman Kalyan Lydian Hariprasad Chaurasia
Darbari Asavari Aeolian Ustad Vilayat Khan
Darbari Asavari Aeolian Dagar Brothers
Chandni Kedar Kalyan Lydian Ustad Vilayat Khan
Chandni Kedar Kalyan Lydian Ustad Amir Khan

Alright, so now we have it in first normal form. However, we doubled our row count now, and seem to be storing a lot of duplicate information in our table. Since the relationship between a raga and a performance is one to many, we need to split out the performances onto their own table in order to achieve the second normal form. A foreign key (between Raga.Name and Performance.Raga) can be used to connect the data across tables.
Raga
Name Thaat Mode
Yaman Kalyan Lydian
Darbari Asavari Aeolian
Chandni Kedar Kalyan Lydian


Performance
Raga Performer
Yaman Ustad Imrat Khan
Yaman Hariprasad Chaurasia
Darbari Ustad Vilayat Khan
Darbari Dagar Brothers
Chandni Kedar Ustad Vilayat Khan
Chandni Kedar Ustad Amir Khan

Having achieved second normal form (2NF), we note one additional point of redundancy. Fields that do not depend on the primary key (in our case, the raga itself) must be "offshored". Note that Mode is essentially derived from Thaat; it isn't really a unique value of the raga, it is more a further description of the thaat quality. So one more time:
Raga
Name Thaat
Yaman Kalyan
Darbari Asavari
Chandni Kedar Kalyan


Performance
Raga Performer
Yaman Ustad Imrat Khan
Yaman Hariprasad Chaurasia
Darbari Ustad Vilayat Khan
Darbari Dagar Brothers
Chandni Kedar Ustad Vilayat Khan
Chandni Kedar Ustad Amir Khan

Thaat
Name Mode
Kalyan Lydian
Asavari Aeolian

Hey presto, third normal form (3NF) in all its grandeur. It's a very clean and proper way to design databases, but working in the real world, I think it's worth remembering something stated by Karen Lopez at the KC SQL User Group last August: "Normalization is a description, not an objective" (paraphrased). There are times when third normal form is ideal, times when denormalized data is the way to go. The classic IT/DBA mantra applies..."It Depends".

23 October 2012

Locks, Blocks, and Deadlocks

The concepts of locking, blocking, and deadlocking are simple concepts on the surface, with quite a bit of complexity beneath the surface in the details.  But as a parent of two small children, these concepts are modelled on a regular basis.

Locking:
First, the query requests a lock on a resource.  So my son grabs a toy to play with.  He has acquired a lock on that toy, a marker of sorts saying "I'm playing with this right now".  I won't painfully try to thresh out the details of isolation levels and shared vs. exclusive locks as that would basically cause an already taxed analogy to implode.

Blocking:
Then my daughter looks at her brother's toy and thinks, I'd like to play with that toy.  Too bad...a lock has been issued, and your request to play with that toy is denied.  Blocking may end if my son lays the toy aside, thus releasing the lock on the toy.

Deadlocking:
While my son still has a "lock" held on the toy, and is blocking my daughter's request for it, my daughter picks up another toy, holding a lock on that item now.  Then my son decides he wants that toy, too, and requests it.  We have both kids, each holding a lock on a toy and blocking the other's request for it, wanting the other's toy.  Neither will relinquish the toy they have until they get the other toy and can finish their playing.  Deadlocking is this sort of stalemate that happens in the database/playroom where two queries get trapped in a kind of ill-fated interdependency, and the database engine/parents have to step in and choose a "deadlock victim" that has their query cancelled so at least one of the two can be satisfied.

At my rate, if I keep on torturing analogies like this I shall be receiving letters from Amnesty International.

19 October 2012

Windowing Functions Brainbender

An offhand tweet from Adam Machanic of sp_WhoIsActive fame had me toying with a seemingly simple T-SQL problem:

Imagine you have a table of sales orders.  The only fields we'll concern ourselves with is OrderDate and OrderID.  OrderDate obviously can have duplicate values, and OrderID generally should be unique.  Imagine you want to generate a query with the table records, but have an additional field that shows the next date that exists with a sales order.  For instance, if you have five orders that took place on September 1st, and the next orders in the system were on September 5th; the NextDate value for those five orders on September 1st would all be September 5th.

My first (successful) attempt at this looked like this:

SELECT
SalesOrder.OrderID,    
SalesOrder.OrderDate,
datequery.NextDate as NextDate
   
FROM SalesOrder
INNER JOIN
  (
   SELECT TOP 100 PERCENT a.OrdDate,
    lead(a.OrdDate) over (order by a.OrdDate) as NextDate
    FROM (select distinct OrderDate as OrdDate FROM SalesOrder ) as a
    order by a.OrdDate
  ) as datequery
ON datequery.OrdDate = SalesOrder.OrderDate

Essentially what is happening wiht this query is that I have a two level nested query.  Working from the inside out, first I select all the distinct OrderDate values into a set, thus removing the duplicates.  Then I use the LEAD() windowing function to find the next row value from that set for every given date.  Then finally I join that to a general query of the table, to get the NextDate value for any given row.  It works...but perhaps not very efficient.

SELECT
 a.OrderID,
 a.OrderDate,
 (
 SELECT (MIN(b.OrderDate))
 FROM SalesOrder AS b
 WHERE b.OrderDate > a.OrderDate
 ) AS NextDate
FROM SalesOrder AS a

In this version above, I found a much simpler way of providing the intended results without using windowing functions at all.  NextDate is provided through a subquery by selecting the minimum value of OrderDate that is greater than the current date.

Lastly, one more version:

WITH offsetCTE(SalesOrderID, offset) AS
 (
 SELECT TOP 100 PERCENT c.OrderID,    
 ROW_NUMBER() OVER (PARTITION BY c.OrderDate ORDER BY c.OrderID DESC) AS offset
 FROM SalesOrder AS c
 ORDER BY c.OrderID ASC
 )
SELECT
 a.OrderID,
 a.OrderDate,
 LEAD(a.OrderDate, offsetCTE.offset) OVER (ORDER BY a.OrderDate, a.OrderID) AS NextDate
FROM SalesOrder AS a
INNER JOIN offsetCTE ON a.OrderID = offsetCTE.OrderID
ORDER BY a.OrderDate, a.OrderID ASC


In this case, I create a common table expression designed to calculate the number of rows from one row to the next row that contains a different date value.  Using the ROW_NUMBER() function, and partitioning by date (ordering by OrderID descending), I get a reverse row count number for each group of orders per date.  Then I use that CTE in a query where I use the LEAD() function again, but this time, take the offset value I calculated to be the number of rows to skip ahead, the optional second parameter of LEAD(). 

More complicated, admittedly, and with all the permutations I've done so far, I'm not sure which of these would qualify as the most efficient solution.  Fun exercise regardless and my thanks to Adam for offering some guidance and prompting as I noodled through it.

11 October 2012

The Browser Service

Yesterday a developer friend (of whom Steve Ballmer thinks so highly) was bemoaning his company's recent loss of their Server Monkey, because his app was all of a sudden unable to connect to their SQL Server, but the server was up and seemed to be operational.

Once I learned he was connecting to a named instance, I had him check for the SQL Server Browser Service.  That was one of those sample questions that was on just about every sample test for the MCITP exams, so it was drilled into my head...sure enough, it was disabled.  If you're connecting to named instances (unless you provide the port number as well), you'll need the Browser service running to make sure your connection lands at the appropriate spot.

So a very simple tip today but one I was pleased to see make it out of the academic side and into practical usage, for me.

the browser service
is essential to connect
to named instances

08 October 2012

Detangling Nested Queries

As a one-time "accidental DBA", I both witnessed and committed a number of bad or inadvisable practices on my SQL Server.  We were a small shop, and early on the furthest thing from my mind was performance or efficiency.  There was one egregious infraction that's worth a mention here...

Views are definitely handy.  They can also promote and exacerbate developer laziness.  We had a view set up that would join order line and order header records, for a massive "SELECT *" style sales history output, with just about all columns.  Then someone came along and wrote a view using that view as a source table.  Then, someone else came along and wrote an SSRS report based on that second view.  So here we had a highly focused end query returning small amounts of data from a view pulling data from another huge view, pulling data from the source tables.  I've done a bit of testing and I'm not convinced that it creates as much of a performance problem as it would seem; it appears that SQL Server does the logical heavy-lifting of making it efficient even though the code is not.  I'd be interested to know otherwise though!

But the main problem is one of manageability.  With all these unnecessary dependencies we end up with a needlessly complex amount of code...two view definitions and a query, where only a single query was necessary.  Any changes to the view definition breaks the views and queries down the chain, in what could be a transparent manner.  So, one day when the office was quiet enough, I set about detangling these messes and rewriting the report queries to go straight to the tables, only pull the information they needed (removing the "SELECT *"...a bad habit that afflicted me in my early days, I confess).  Now when someone examined the report query, they could understand exactly what data it was pulling, without having to confusingly parse through the subsequent view definitions from which it was retrieving data.  This is especially useful when the end user questions data quality; in my case, I wasn't the report author, so I didn't have a full understanding of what the query was gathering.  If we write report queries as simply as possible and avoid the temptation to just tack it onto an existing view with just some additional filtering, it will help the next guy in the chain, believe me.

"The simplest thing that works", a good mantra for query writing.  Happy query detangling.

02 October 2012

Plotting My Learning Goals

So one of the things I've a bit of an ardent fondness for is developing (often obscure) skills and expertise.  I went through a period, let's call it a hangover, after school where I worked a rather monotonous job and was minimally concerned with sharpening and updating my technical skillset, but in the past couple years I've moved back towards genuine curiousity and interest in excelling in a technical sphere rather than just treading water.

My core technical interest is SQL Server; many years ago I was handed a SQL Server 2000 (then later, 2005) box and given basic duties...back it up, keep it running...normal small shop sysadmin duties.  Our databases at the time were out of the box MS Dynamics applications and so very little involvement on my side was required.  As time went on, management would get increasingly frustrated with the limited capacity of the embedded reporting tools, and I got more and more chances to write and flesh out reports in SQL Server Reporting Services.  Since then, my interest has deepened and the best part of that job and my new job are working in a DBA or reporting capacity with SQL. 

So, the past year with a handful of books (particularly Tom Carpenter's excellent guide) and a lot of curious poking around with both evaluation and, er, -cough- production servers, I shored up my knowledge and was able to pass both tests, fairly handily, for the MCITP in SQL Server 2008 certification.  Does we rest on our laurelses, precioussss?  No we does not, gollum, gollum!

So, next goal is to pass the two upgrade exams for the MCSA in SQL Server 2012.  I have a couple 2012 evals running, and whenever I'm playing with a SQL box I try to work with those particularly, to aid in getting up to par with new features, but at work we're still running 2008, as most folks probably are.  So I'm still preparing, and thinking about what book to procure, if any, to get ready for the certification exam.  But I'm planning 1st quarter 2013, at least.

Another book I've just finished is Don Jones' "Learn PowerShell in a Month of Lunches", which is a great format for an introductory course.  I'm no PowerShell guru but it has helped fill in some of the mental blanks I have when working with it previously, and I feel much more confident behind a PS prompt than I did previously.

From the library I found this Kimball Group book on data warehousing and BI, which was recommended to me by my boss, who went through Ralph Kimball's training and knew him rather well.  I've got precious little desire to work in depth in an OLAP/MDX/data warehousing environment, but perhaps part of that attitude is a fear of the unknown, so this is what I'm reading presently.

Queued up right behind that is an intro to Oracle for SQL Server DBAs.  As Wodehouse's (or was it an improv of Stephen Fry?) Jeeves said, it is well to know what tune the Devil is playing.

And of course, my studies continue aggressively in my lessons with Ustad Imrat Khan-saheb.  He is teaching me a lovely Alhaiya Bilawal, currently.  Here is his surbahar alap: