04 June 2013

"You keep writing that query; I do not think it returns what you think it returns..."

So one of our BAs had a 4 hour long "insert into" query running today that completely pegged the CPUs on their box. The simplified version of the SELECT query involved:

SELECT *
FROM theFirstTable
WHERE value1 IN
( SELECT DISTINCT value1
FROM reallyTerrificallyLargeTable
WHERE value1 NOT IN
( SELECT value1 FROM anotherReallyBigTable ) )


So basically, the apparent logic is they want all the records for the first table, when value1 (an ID field) exists in this other spectacularly large table, referenced by subquery, but doesn't exist in another table (referenced with a further nested subquery).

However, the query doesn't work that way. I first got an inkling of this when I realized (while trying to rewrite the script for performance optimization) that there was no value1 column on anotherReallyBigTable (it was something like [Value One], we'll say). But amazingly to me at the time, it doesn't throw a syntax error saying invalid column!

Then we sorted out what was going on. Because the nested subquery is capable of referencing its calling query, the value1 from the last subquery is actually just the value1 from the first subquery. If the query author had named the tables (ie., "anotherReallyBigTable as a" and then referencing a.value1) the logic would make sense and the values returned would be accurate. As it is, it looks suspiciously like getting a set of values with the condition that that set of values does not exist in that same set of values is very likely to give you a result set of GOOSE-EGG. Probably not what the author had in mind, unless they just enjoy letting the CPUs getting a bit of exercise...