13 November 2018

I Discovered a Bug...That Started the Product Team Crying

It's a pretty rare thing for we footsoldier DBAs to discover a true bug in Microsoft's code. This one, however, stunned me, a bit, if only because it was located not in an obscure corner of the SQL Server product, such as the bug we found several years ago in Change Data Capture, but in the basic Query Optimizer, and it resulted in false truncation errors, a logical flaw. It is basic and quite reproducible.

There are several things about this that shocked me:
  1. I encountered this bug over several months in two completely separate pieces of code, unrelated to each other at all.
  2. The bug is reproducible in SQL Server 2012, 2014, 2016, and 2017 (v11-v14).
  3. In light of the above, no one had noticed or reported this! Apparently Microsoft was unaware of the bug, and despite me running into it twice in two completely different contexts, no one else had? Just seems odd and not what I would expect given the extremely wide distribution of the product.
So what is it? Basically, if you are inserting into a table from a select statement that joins two or more tables, the join order in the execution plan can have an unintended logical result on your query. Say one table contains some values that would cause truncation errors if inserted...but the other table is inner joined, which would restrict the values on the first table to only those which are of valid length...you'll still get truncation errors depending on the join order in the execution plan, assumedly due to validation happening at a point prior to the join.

Let's reproduce! No, *ahem*, that is not intended as a strikingly unromantic pickup line, what I meant was, let's reproduce the problem in code, first the setup, of two input tables and an output table. Note the input table A has two values, both of which map to the shorter length values of input table B (col1 = 1,2):

USE tempdb
CREATE TABLE InputTblA (col1 INT, col2 CHAR(10));
CREATE TABLE InputTblB (col1 INT, col2 CHAR(10));
CREATE TABLE OutputTbl (col1 INT, col2 CHAR(5));

INSERT INTO InputTblA VALUES (1,'value'),(2,'value');
INSERT INTO InputTblB VALUES
(1,'short'),(1,'short'),(1,'short'),(1,'short'),
(2,'short'),(2,'short'),(2,'short'),(2,'short'),
(3,'longer'),(3,'longer'),(3,'longer'),(3,'longer');

So, now that we have those tables set up, let's do our insert.

INSERT INTO OutputTbl
SELECT a.col1, b.col2
FROM InputTblA a
INNER JOIN InputTblB b ON a.col1 = b.col1

We get, in this case, the expected and valid "(8 row(s) affected)". Looking at the execution plan, we see that it starts with InputTblA and then hash joins to InputTblB.



Normally SQL Server does this right, it's just when the table order gets just so that the bug shows up...so we must force it, using FORCE ORDER.

INSERT INTO OutputTbl
SELECT a.col1, b.col2
FROM InputTblB b
INNER JOIN InputTblA a ON a.col1 = b.col1
OPTION (FORCE ORDER)

Msg 8152, Level 16, State 14, Line 26
String or binary data would be truncated.
The statement has been terminated.

And here's the plan...we can see it scanned InputTblB and ran a compute scalar before joining to TblA which had the constraining data:



That's all fairly easy to duplicate and I went a step further, utilizing the latest feature of SQL 2019 backported to SQL 2017 CU12, the advanced truncation error message 2628 that actually calls out truncated columns and values. Once patched to CU12 I repeated my test with trace flag 460 enabled, and the error message was more descriptive:

Msg 2628, Level 16, State 1, Line 26
String or binary data would be truncated in table 'tempdb.dbo.OutputTbl', column 'col2'. Truncated value: 'longe'.
The statement has been terminated.

It lists the truncated value that we know by definition is impossible, with that query and data, to be in the result set.

Having successfully hunted a Microsoft SQL Server Query Processor bug, I was pleased to notch my tag and cape the hide for my taxidermist, who will mount it for me for my living room wall.  The product team at Microsoft has acknowledged to me (via the Premier Support rep I've been dealing with) that this is in fact a defect and they will be fixing it in a subsequent release (hopefully backported via CU or SP).

No comments: