Tag Archives: semi-joins

Semi Joins, anti-joins and Nulls in Sql Server


Summary

SQL Joins are table operators(binary operations in Relational Algebra) used to combine columns from one or more tables. The expression(predicate) that defines the columns which are used to join the tables is called Join Predicate. The result of a join is a set (relational database implementation of a set).
ANSI standard recognizes five types of joins: Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, and Cross Join.
Joins are typically used to retrieve data from the normalized tables in a relation, e.g. one-to-many, zero-one-to-many, etc., usually with an equality predicate between primary and foreign key columns.
One of the most complex tasks for the Query Optimiser is “join ordering” i.e. finding the optimal join sequence when constructing the execution plan(a query requesting data from n tables requires n-1 joins)

Semi-join is one of a few operators in relational algebra that does not have representation in Tsql language. Some of the  “missing” operators are:

  • Semi join
  • Anti-join (anti-semi-join)
  • Natural join
  • Division

Semi-join is a type of join whose result set contains only the columns from one of the “semi-joined” tables. Each row from the first table(left table if Left Semi Join) will be returned a maximum of once if matched in the second table. The duplicate rows from the first table will be returned if matched once in the second table. A distinct row from the first table will be returned no matter how many times matched in a second table.
Below is the pseudo-code representation of the above statement.

SemiJoinsPCode

Anti-semi-join will do the exact opposite. The join will select any rows from the first table that do not have at least one matching row in the second table.

SQL Server engine has three physical(showplan) operators that can be used to perform semi-join/anti-semi-join logical operations when recognized by the  Query Optimiser.

The table below maps the physical operators and the semi-join algorithms that they support.
PhysicalOpSemiJoins
*The full list of the SQL Server showplan operators can be found here.

There are a number of scenarios when Query Optimiser decides to implement a semi-join algorithm to optimize query requests. Typically, the logical operations that represent semi-joins are: IN, NOT IN, EXISTS, NOT EXISTS. The EXCEPT and INTERSECT set operators may use the same physical, Semi Join operators to perform different logical operations.

The following examples illustrate a few of these scenarios.

Set up the test environment:

CREATE DATABASE testSemiJoins
GO
USE testSemiJoins
GO

IF EXISTS (SELECT 1 FROM sys.tables t WHERE name = 'tab1')
    DROP TABLE dbo.tab1;
GO
CREATE TABLE tab1  (ProductTypeId INT,ProductName VARCHAR(100))
GO

IF EXISTS (SELECT 1 FROM sys.tables t WHERE name = 'tab2')
    DROP TABLE dbo.tab2;
GO
CREATE TABLE tab2  (ProductId INT,Name VARCHAR(100),StorageSizeGB SMALLINT)
GO
--insert some rows
INSERT INTO tab1
    SELECT tab.id,tab.name
    FROM (VALUES (1,'Hard disk')
                ,(1,'Hard disk')
                ,(2,'SSD disk')
                ,(3,'Flash Memory')
                ,(4,'EPROM')
                ,(5,NULL)
                ,(6,'Floppy Disk')
                ,(7,'RAM Memory')
                ,(7,'RAM Memory')
                ) tab(id,name)
GO
INSERT INTO tab2
    SELECT tab.id,tab.name,tab.sizeGb
    FROM (VALUES (1,'Hard disk',250)
                ,(1,'SSD disk',120)
                ,(2,'SSD disk',240)
                ,(3,'Flash Memory',8)
                ,(4,NULL,NULL)
                ,(5,'EPROM',0)
                ) tab(id,name,sizeGb)
GO
Tab1AndTab2

Left Semi Join

The Left Semi Join operator returns each row from the first (top table in the execution plan) input where there is at least one matching row in the second(bottom) input.

SELECT * 
FROM tab1 t1
WHERE t1.ProductName IN (SELECT t2.Name FROM dbo.tab2 t2)
--or using a correlated query (same execution plan)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName IN (SELECT t2.Name FROM dbo.tab2 t2 where t2.Name = t1.ProductName)
LeftSemiJoin

Left Anti Semi Join

The Left Anti Semi Join operator returns each row from the first (top) input when there are no matching rows in the second (bottom) input.

SELECT ProductName FROM tab1
EXCEPT
SELECT Name FROM tab2
--or
SELECT * 
FROM tab1 t1
WHERE NOT EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)
--WHERE t1.ProductName NOT IN (SELECT t2.Name FROM dbo.tab2 t2)

LeftAntiSemiJoin

Both of the queries use the same physical operator – Loop Join( Left Anti Semi Join) to perform different logical operations. The second query performs a logical Left Anti Semi Join whereas the first query performs an operation based on the Difference of Sets  (Set A – SetB) operation.
Set operators EXCEPT, INTERSECT, and UNION treat NULL values as equal(non-distinct) whereas operator EXISTS evaluates NULL=NULL as FALSE(even if the 3VL result is UNKNOWN). More about the behavior here.
Also, it is worth mentioning that all set operators (except the multi-set operator UNION ALL) remove duplicates. 

Right Semi Join

The Right Semi Join operator returns each row from the second (bottom) input when there is a matching row in the first (top) input.

SELECT * 
FROM tab1 t1
WHERE  EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)
OPTION(HASH JOIN)
--or based on the Intersection Of Sets operation
SELECT ProductName FROM tab1
INTERSECT
SELECT Name FROM tab2 
OPTION(HASH JOIN)
-- When building the hash table, the hash match join chooses the table with fewer rows. 
-- In this example, that is the tab2 table.

RightSemiJoin

NOTE: The INTERSECT set operator treats NULL values as equal and therefore a NULL value appears in the result set(as being one of the  common values in tab1 and tab2) .

Right Anti Semi Join

The Right Anti Semi Join operator outputs each row from the second (bottom) input when a matching row in the first (top) input does not exist.

SELECT * 
FROM tab1 t1
WHERE NOT EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)
OPTION(MERGE JOIN)

RightAntiSemiJoin

Anti semi Joins, NOT IN, and NULLs

Again, things get “special” when it comes to working with NULLs.
If we execute the Left Anti Semi Join query again, but this time using NOT IN instead of EXISTS, we will get the empty result set. This is one of the common logical errors that can cause unexpected results.

--NOT IN (does not work)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN (SELECT t2.Name FROM dbo.tab2 t2)

--correlated NOT IN (works ok)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN (SELECT t2.Name FROM dbo.tab2 t2 WHERE t2.Name = t1.ProductName)

--hardcoded NOT IN (does not work)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN ('Hard disk','SSD disk','SSD disk','Flash Memory',NULL,'EPROM')

--NOT EXISTS (works ok)
SELECT * 
FROM tab1 t1
WHERE NOT EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)

The image below shows the query execution plans, the predicate property of the significant physical operators, and the result sets.

AntiJoinNot_IN_NULL1

Just as a reminder, the Nested Loop algorithm compares each row from the OUTER table (tab1 in the example) to each row from the INNER table (tab2). Only the rows that satisfy the join predicate will be returned.

The query result sets should contain all rows from the first (top) input when there are no matching rows in the second (bottom) input. The NULLs are treated as distinct. The results should be the same but there are not. Why?

The answer lies in the way the operator NOT IN was implemented.

Query1 (NOT IN)
From the Nested Loop’s Predicate property we can see that the operator uses a sequence of OR logical expressions to implement the request.

t1.ProductName IS NULL
OR t2.Name IS NULL
OR t2.ProductName = t2.Name

To make the tab1 table rows qualify for the output, the results of all the expressions in the predicate must evaluate to FALSE.

The expression t2.Name IS NULL will always evaluate to TRUE for every iteration, resulting in the empty result-set. This is because of a NULL value in the tab2.Name column.

This is important to know to be able to avoid possible logical errors. One way to prevent the behavior is to set NOT NULL column property on the tab2.Name column. The other way is to use  ISNULL() function in the query to prevent NULL expressions.

SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN (SELECT ISNULL(t2.Name,'') FROM dbo.tab2 t2)

--first replace all NULLs with known values, e.g '' and then add NOT NULL column property or a CHECK constraint
UPDATE tab2
    SET Name = ''
WHERE Name IS NULL
GO

ALTER TABLE tab2
  ALTER COLUMN Name VARCHAR(100) NOT NULL
GO
--or
ALTER TABLE tab2
    WITH CHECK 
    ADD CONSTRAINT CK_NameNotNull CHECK(Name IS NOT NULL)
GO

Query2 (“Correlated” NOT IN)
The query uses the same execution plan as NOT IN, except now the Nested Loop’s Predicate property adds an extra AND expression to the sequence of OR expressions.

t2.Name = t1.ProductName
AND t1.ProductName IS NULL
OR t2.Name IS NULL
OR t2.Name = t1.ProductName

The same logic applies as with the first query. The predicate(as a list of the logical expressions) must evaluate to FALSE for the tab1 rows to qualify for the output. As before, at least one of the expressions on the left side of the AND operator will always evaluate to TRUE (t2.Name IS NULL is TRUE for every iteration). However, the additional expression (the one on the right side of the AND operator) may evaluate to FALSE making the whole predicate FALSE and allowing the qualified rows to be returned.

The query returns the correct result.

Query3 (“Hard-coded” NOT IN)
From the lack of the Constant Scan Predicate property in the execution plan, we can conclude that the Query Optimiser has decided to return an empty result set knowing that at least one of the “hard-coded, NOT IN values” is NULL.
To get the predicate property and analyze its logic, we can run the same query, but this time without the NULL value in the NOT IN list  i. e.

SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN ('Hard disk','SSD disk','SSD disk','Flash Memory','EPROM')

The Constant Scan operator has the Predicate property now.

LeftAntiJoinNOTIN_HC

The predicate is constructed of n not-equal expressions (n is a number of distinct values in the NOT IN list, in this case 4) and n-1 AND operators.

t1.ProductName <>’EPROM’
AND t1.ProductName <>’Flash Memory’
AND t1.ProductName <>’Hard disk’
AND 1.ProductName <>’SSD disk’
— AND 1.ProductName <>NULL –if we had NULL here, the predicate would evaluate to UNKNOWN  for every row resulting in an empty result set.

There are a couple of interesting points here. The first one is the result set.

The query excludes t1.ProductName  = NULL from the result set. The reason for excluding the NULL  is because of the way “hard coded NOT IN” was implemented.
The NULL values from the left table will be excluded, whereas the NULL value(s) in the NOT IN list will cause an empty result set.
In this example, the algorithm excludes tab1.ProductName NULL value by evaluating predicate as UNKNOWN i.e

NULL <>’EPROM’ (1)
AND NULL <>’Flash Memory’   (2)
AND NULL<>’Hard disk’  (3)
AND NULL<>’SSD disk’  (4)

NULL (1) AND NULL (2) AND NULL (3) AND NULL (4)  is  NULL

It is good to know the differences in the results to prevent possible logical errors.

The other interesting thing is the number of expressions in the predicate that directly depend on the number of hard-coded literals. The question is: Is there a maximum number of literals that can be processed? The number is related to a maximum string length containing SQL Statements(batch size). The Maximum Capacity Specifications for SQL Server can be found here. Another reason may be running out of stack size(stack overflow) causing the errors 8623 and/or 8632

Query4(NOT EXISTS) 
The NOT EXIST operator works straightforwardly here. All rows from tabt1 that do not match the rows from tab2 (where the correlated predicate t1.ProductName = t2.Name evaluates to FALSE)  will be returned. NULLs are treated as distinct.
The query returns the correct result.

Conclusion

Semi-join is a join that has no representation in tsql. The result of the operation contains only columns from one of the tables. All returned rows from the first table must be matched at least once in the second table. The rows from the first table will be returned only once even if the second table contains more than one match. Semi-joins are usually implemented using IN or EXISTS operators. Any of the three physical operators that SQL Server uses to perform joins can be used to implement Semi Joins. The set operators EXCEPT and INTERSECT may use the same Semi Join physical operators to perform set operations.
Logical Anti semi join operation may be implemented using NOT IN or NOT EXISTS operators. The NOT IN operator may cause unexpected results when it operates on the tables that contain NULL values in the joined columns.

Thanks for reading.

Dean Mincic