Semi Joins, anti-joins and Nulls in Sql Server


Sql Joins are table operators(binary operations in Relational Algebra) used to combine columns from one or more tables. The expression(predicate) that define the columns which are used to join the tables is called Join Predicate. The result of a join is a set (relational database implementation of a set).
ANSI standard recognises five types of joins: Inner Join, Left Outer Join, Right Outer Join, Full Outer Join and Cross Join.
Joins are typically used to retrieve data from the normalised tables in a relation, e.g. one-to-many, zero-one-to-many, etc.,usually with an equality predicate between  primary and  foreign key columns.
One of the most complex tasks for the Query Optimiser is “join ordering” i.e. finding the optimal join sequence when constructing the execution plan(a query requesting data from n tables requires n-1 joins)

Semi join is one of a few operators in relational algebra that does not have representation in Tsql language. Some of the  “missing” operators are:

  • Semi join
  • Anti-join (anti semi join)
  • Natural join
  • Division

Semi join is a type of join whose result-set contains only the columns from one of the “semi-joined” tables. Each row from the first table(left table  if Left Semi Join) will be returned maximum once, if matched in the second table. The duplicate rows from the first table will be returned, if matched once in the second table. A distinct row from the first table will be returned no matter how many times matched in a second table.
Below is the pseudo-code representation of the above statement.

SemiJoinsPCode

Anti semi join will do the exact opposite. The join will select any rows from the first table that do not have at least one matching row in the second table.

Sql Server engine has three physical(showplan) operators that can be used to perform semi join/anti semi join logical operations, when recognised by the  Query Optimiser.

The table below maps the physical operators and the semi join algorithms that they support.
PhysicalOpSemiJoins
*The full list of the Sql Server showplan operators can be found here.

There are a number of scenarios when Query Optimiser decides to implement a semi join algorithm to optimise query request. Typically, the logical operations that represents semi joins are: IN, NOT IN, EXISTS, NOT EXISTS. The EXCEPT and INTERCEPT set operators may use the same physical, Semi Join operators to perform different logical operations.

The following examples illustrates a few of these scenarios.

Set up the test environment:

Tab1AndTab2
Left Semi Join

The Left Semi Join operator returns each row from the first (top table in the execution plan) input where there is at least one matching row in the second(bottom) input.

LeftSemiJoin
Left Anti Semi Join

The Left Anti Semi Join operator returns each row from the first (top) input when there are no matching rows in the second (bottom) input.

LeftAntiSemiJoin

Both of the queries use the same physical operator – Loop Join( Left Anti Semi Join) to perform different logical operations. The second query performs a logical Left Anti Semi Join whereas the first query performs an operation based on the Difference of Sets  (Set A – SetB) operation.
Set operators EXCEPT, INTERSECT, UNION treats NULL values as equal(non distinct) whereas operator EXISTS evaluates NULL=NULL as FALSE(even if the 3VL result is UNKNOWN). More about the behavior here.
Also, it is worth to mention that all set operators (except  the multi-set operator UNION ALL) remove duplicates. 

Right Semi Join

The Right Semi Join operator returns each row from the second (bottom) input when there is a matching row in the first (top) input.

RightSemiJoin

NOTE: The INTERSECT set operator treats NULL values as equal and therefore a NULL value appears in the result set(as being one of the  common values in tab1 and tab2) .

Right Anti Semi Join

The Right Anti Semi Join operator outputs each row from the second (bottom) input when a matching row in the first (top) input does not exist.

RightAntiSemiJoin

Anti semi Joins, NOT IN and NULLs

Again, things gets “special” when it comes to work with NULLs.
If we execute the Left Anti Semi Join query again, but this time using NOT IN instead of EXISTS, we will get the empty result-set. This is one of the common logical errors that can cause unexpected results.

The image below shows the query execution plans, predicate property of the significant physical operators and the result sets.

AntiJoinNot_IN_NULL1

Just as a reminder, the Nested Loop algorithm compares each row from the OUTER table (tab1 in the example) to each row  from the INNER table (tab2). Only the rows that satisfy the join predicate will be returned.

The query result sets  should contain all rows from the first (top) input when there are no matching rows in the second (bottom) input. The NULLs are treated as distinct. The results should be the same but there are not. Why?

The answer lies in the way the operator NOT IN was implemented.

Query1 (NOT IN)
From the Nested Loop’s Predicate property we can see that the operator uses a sequence of OR logical expressions to implement the request.

t1.ProductName IS NULL
OR t2.Name IS NULL
OR t2.ProductName = t2.Name

To make the tab1 table rows qualify for the output, the results of all the expressions in the predicate must evaluate to FALSE.

The expression t2.Name IS NULL will always evaluate to TRUE for every iteration, resulting in the empty result-set. This is because of a NULL value in the tab2.Name column.

This is important to know to be able to avoid possible logical errors. One way to prevent the behavior is to set NOT NULL column property on the tab2.Name column. The other way is to use  ISNULL() function in the query to prevent NULL expressions.

Query2 (“Correlated” NOT IN)
The query use the same execution plan as NOT IN, except now the Nested Loop’s Predicate property adds an extra AND expression to the sequence of OR expressions.

t2.Name = t1.ProductName
AND t1.ProductName IS NULL
OR t2.Name IS NULL
OR t2.Name = t1.ProductName

The same logic applies as with the first query. The predicate(as a list of the logical expressions) must evaluate to FALSE for the tab1 rows  to qualify for the output. As before, at least one of the expressions on the left side of the AND operator will always evaluate to TRUE (t2.Name IS NULL is TRUE for every iteration). However, the additional  expression (the one on the right side of the AND operator) may evaluate to FALSE making the whole predicate FALSE and allowing the qualified rows to be returned.

The query returns the correct result.

Query3 (“Hard-coded” NOT IN)
From the lack of the Constant Scan Predicate property in the execution plan, we can conclude that the Query Optimiser has decided to return an empty result set knowing that at least one of the “hard coded, NOT IN values” is NULL.
To be able to get the predicate property and analyse its logic, we can run the same query, but this time without the NULL value in the NOT IN list  i. e.

The Constant Scan operator has the Predicate property now.

LeftAntiJoinNOTIN_HC

The predicate is constructed of n not-equal expressions (n is a number of distinct values in the NOT IN list, in this case 4) and n-1 AND operators.

t1.ProductName <>’EPROM’
AND t1.ProductName <>’Flash Memory’
AND t1.ProductName <>’Hard disk’
AND 1.ProductName <>’SSD disk’
— AND 1.ProductName <>NULL –if we had NULL here, the predicate would evaluate to UNKNOWN  for every row resulting in an empty result set.

There are a couple of interesting points here. The first one is the result-set.

The query excludes t1.ProductName  = NULL from the result set. The reason for excluding the NULL  is because of the way “hard coded NOT IN” was implemented.
The NULL values from the left table will be excluded, whereas the NULL value(s) in the NOT IN list will cause an empty result set.
In this example, the algorithm excludes tab1.ProductName NULL value by evaluating predicate as UNKNOWN i.e

NULL <>’EPROM’ (1)
AND NULL <>’Flash Memory’   (2)
AND NULL<>’Hard disk’  (3)
AND NULL<>’SSD disk’  (4)

NULL (1) AND NULL (2) AND NULL (3) AND NULL (4)  is  NULL

It is good to know the differences in the results to prevent possible logical errors.

The other interesting thing is the number of expressions in the predicate that directly depend on the number of hard coded literals. The question is: Is there a maximum number of literals that can be processed? The number is related to a maximum length of a string containing Sql Statements(batch size). The Maximum Capacity Specifications for SQL Server can be found here. The another reason may be running out of stack size(stack overflow) causing the errors 8623 and/or 8632.

Query4(NOT EXISTS) 
The NOT EXIST operator works straightforward here. All rows from tabt1 that does not match the rows from the tab2 (where the correlated predicate t1.ProductName = t2.Name evaluates to FALSE)  will be returned. NULLs are treated as distinct.
The query returns the correct result.
Conclusion

Semi join is a join that has no representation in tsql. The result of the operation contains only columns from one of the tables. All returned rows from the first table must be matched at least once in the second table. The rows from the first table will be returned only once even if the second table contains more than one match. Semi joins are usually implemented using IN or EXISTS operators. Any of the three physical operators that SQL Server uses to perform joins can be used to implement Semi Joins. The set operators EXCEPT and INTERSECT may use the same Semi Join physical operators to perform set operations.
Logical Anti semi join operation may be implemented using NOT IN or NOT EXISTS operators. The NOT IN operator may cause the unexpected results when it operates on the tables that contain NULL values in the joined columns.

Thanks for reading.

Dean Mincic

 

 

 

Leave a Reply

Your email address will not be published.