19 October 2012

Windowing Functions Brainbender

An offhand tweet from Adam Machanic of sp_WhoIsActive fame had me toying with a seemingly simple T-SQL problem:

Imagine you have a table of sales orders.  The only fields we'll concern ourselves with is OrderDate and OrderID.  OrderDate obviously can have duplicate values, and OrderID generally should be unique.  Imagine you want to generate a query with the table records, but have an additional field that shows the next date that exists with a sales order.  For instance, if you have five orders that took place on September 1st, and the next orders in the system were on September 5th; the NextDate value for those five orders on September 1st would all be September 5th.

My first (successful) attempt at this looked like this:

SELECT
SalesOrder.OrderID,    
SalesOrder.OrderDate,
datequery.NextDate as NextDate
   
FROM SalesOrder
INNER JOIN
  (
   SELECT TOP 100 PERCENT a.OrdDate,
    lead(a.OrdDate) over (order by a.OrdDate) as NextDate
    FROM (select distinct OrderDate as OrdDate FROM SalesOrder ) as a
    order by a.OrdDate
  ) as datequery
ON datequery.OrdDate = SalesOrder.OrderDate

Essentially what is happening wiht this query is that I have a two level nested query.  Working from the inside out, first I select all the distinct OrderDate values into a set, thus removing the duplicates.  Then I use the LEAD() windowing function to find the next row value from that set for every given date.  Then finally I join that to a general query of the table, to get the NextDate value for any given row.  It works...but perhaps not very efficient.

SELECT
 a.OrderID,
 a.OrderDate,
 (
 SELECT (MIN(b.OrderDate))
 FROM SalesOrder AS b
 WHERE b.OrderDate > a.OrderDate
 ) AS NextDate
FROM SalesOrder AS a

In this version above, I found a much simpler way of providing the intended results without using windowing functions at all.  NextDate is provided through a subquery by selecting the minimum value of OrderDate that is greater than the current date.

Lastly, one more version:

WITH offsetCTE(SalesOrderID, offset) AS
 (
 SELECT TOP 100 PERCENT c.OrderID,    
 ROW_NUMBER() OVER (PARTITION BY c.OrderDate ORDER BY c.OrderID DESC) AS offset
 FROM SalesOrder AS c
 ORDER BY c.OrderID ASC
 )
SELECT
 a.OrderID,
 a.OrderDate,
 LEAD(a.OrderDate, offsetCTE.offset) OVER (ORDER BY a.OrderDate, a.OrderID) AS NextDate
FROM SalesOrder AS a
INNER JOIN offsetCTE ON a.OrderID = offsetCTE.OrderID
ORDER BY a.OrderDate, a.OrderID ASC


In this case, I create a common table expression designed to calculate the number of rows from one row to the next row that contains a different date value.  Using the ROW_NUMBER() function, and partitioning by date (ordering by OrderID descending), I get a reverse row count number for each group of orders per date.  Then I use that CTE in a query where I use the LEAD() function again, but this time, take the offset value I calculated to be the number of rows to skip ahead, the optional second parameter of LEAD(). 

More complicated, admittedly, and with all the permutations I've done so far, I'm not sure which of these would qualify as the most efficient solution.  Fun exercise regardless and my thanks to Adam for offering some guidance and prompting as I noodled through it.

No comments: