17 October 2014

Clustered Columnstore Indexes Care and Feeding

Allow me to briefly resuscitate this blog for a quick technical post. I mean to blog about our recent trip to Chicago which was mostly for SQLskills Immersion training, something I highly recommend, and also a bit of a personal vacation, but before I get to that, a bit about a new feature in SQL Server 2014, clustered columnstore indexes.

We are using these...yes, in production, even...for fact tables in a data warehouse. The columnar storage is very different for those of us used to the traditional row-based storage models. By storing in columns you can have very quick scan times for targeted queries (SELECT * is going to be painful, however, and thus must be avoided). Also, the columnstore compression inherent in the design is incredible, one table we got down to 6% its original size. More pages fit in cache, better use of I/O...all in all, very powerful feature.

We tried doing some dynamic partitioning of the clustered columnstore index to decrease I/Os by partition elimination. In the end, the gains were minimal enough that they don't warrant the upkeep and complexity of adding partitioning to it, but it remains an interesting option for the future.

Anyway, there are plenty of limitations, too. Try adding a nonclustered index to one of these structures...nope, not allowed. A clustered columnstore index is the exclusive index for any table it is created on. OK, how about we go look at statistics for the index?

Very light histogram, isn't it?

Also, don't bother with sys.dm_db_index_physical_stats, you won't get any output. It is specifically noted as incompatible with clustered columnstore indexes.

So what can you do with clustered columnstore indexes? Well, for one thing, you can fragment them up quite quickly. I'm still in process of learning about how they work, with the columnstore and the deltastore, but one particular query which gives you an idea of how many records in the index are marked for deletion (thus bloating your index with dead data) is the following:

SELECT
OBJECT_NAME(object_id) AS TableName,
SUM(deleted_rows) * 1.0 / SUM(total_rows) AS PercentFrag,
SUM(deleted_rows) as delrows,
SUM(total_rows) as totalrows
FROM sys.column_store_row_groups
GROUP BY OBJECT_NAME(object_id)

I noted our fact tables in production, which undergo fairly regular and heavy DML, are approaching 50% deleted records! Our DML performance had been flagging somewhat in the past week and this gives us something to quantify it with. The ALTER INDEX ... REBUILD command is our only hope here, there exists a version of ALTER INDEX ... REORGANIZE but it doesn't actually remove fragmentation. Note that the REBUILD is not an online operation, so you will need to construct a maintenance window of some sort, which should be feasible in a data warehousing environment, the only environment where these types of indexes warrant consideration, anyway. For more reading: Using Clustered Columnstore Indexes

Lastly, while in general I have been frustrated in my attempts to "crack the nut" of clustered columnstore indexes and peer inside, I discovered this morning that SQL Server is happy to let me run sys.dm_db_database_page_allocations, an undocumented DMF that replaced DBCC IND in 2012. This allows you to list out all pages in the columnstore index, which then (I assume, I've yet to try this morning) can be cracked open with DBCC PAGE. Happy index dissecting!