Archive for September, 2013|Monthly archive page

The effect of predicates in outer joins

Inner joins in SQL are well understood. You relate two tables on a given set of columns and where you get matching values on both tables the rows from both tables are joined. What columns you choose to return are up to you. The tables are of equal precedence in the join.

Outer joins are different in that one table is said to be dominant and every row from that table is returned from the join process. The significance of this rule is the subject of this posting. Where the other, subordinate, table has matching values on the joined columns then the rows are returned as they would be in an inner join. Where no matching values are found in the subordinate table, the rows in the dominant table are returned and the subordinate table column values are returned as null.

Now, I imagine most people familiar with SQL will be broadly familiar with all of that. To illustrate the point of this posting, I need a couple of simple tables, each with one row.

create table tb1
( colint int
, coldate date

create table tb2
( colchar char(1)
, colint int

insert into tb1 values (1,'01/01/2013');
insert into tb2 values ('a',1);

The above syntax will work for most database engines, I’ve tested with SQL Server 2012 and DB2 V9.7.

The data types are arbitrary. I’ve not added any referential constraints or indexes to keep the example simple. Feel free to add constraints if you wish, it makes no difference to the results.

Let’s start with a simple inner join:

select *
from tb1
inner join tb2
on tb1.colint = tb2.colint

As you would expect, the results are:

colint coldate colchar colint
1 2013-01-01 a 1

So far so good, so now let’s try as an outer join with tb1 as the dominant table:

select *
from tb1
left join tb2
on tb1.colint = tb2.colint

As we would expect, the results looks like the same as the inner join.

colint coldate colchar colint
1 2013-01-01 a 1

Now let’s apply a predicate in a WHERE clause:

select *
from tb1
left join tb2
on tb1.colint = tb2.colint
where tb1.coldate is null

This time we get no rows returned. This is important. The outer join returned a row from tb1 as it must to satisfy the rule that in an outer join all the rows from the dominant table are returned, but the WHERE clause is applied after the join has completed. Thus the one row that was returned by the join is filtered out by the WHERE clause.

Now for the more interesting case, where we apply the predicate in the join clause:

select *
from tb1
left join tb2
on tb1.colint = tb2.colint
and tb1.coldate is null

The result this time is:

colint coldate colchar colint
1 2013-01-01 NULL NULL

So what’s happening here? On the face of it we have created a row that doesn’t exist and one that contradicts the predicate.

The database engine has performed the outer join and it has filtered out the one row that was returned and crucially that row was filtered out as part of the join process because that’s where the predicate was applied – as part of the join.

So where has the row come from? It’s come from the definition of an outer join, that the all rows from the dominant table are returned. The join process must return every row from the dominant table. To avoid the contradiction that the join and the predicate applied within the join filtered out our one row, the row from the dominant table is returned without reference to the join or the join predicate. The subordinate table columns are set to null because this part of the processing has nothing to do with that table.

Of course what looks even stranger is that we have applied in the query a predicate that says tb1.coldate is null but we actually have a none null value for tb1.coldate in our result set.

Both queries are logically and syntactically valid. Both have valid occasions for use, but it’s a good idea to know which you want and what the differences are.


Referentially valid table listing from SQL Server

A regular task when working with test and development databases is loading test data into tables where referential constraints are present. Whilst you can often drop or suspend the constraints whilst populating tables it’s often just easier to work out a referentially valid order so that parent tables get populated before their children. This usually only applies to relatively small volumes of data because larger volumes are generally loaded via utilities which don’t go through the SQL processing engine and so ignore the referential constraints. In this case the constraints usually need to be rechecked to reactivate them before the tables can be used. The process also usually avoids logging the rows added so a backup is normally required as well.

For small volumes of data, populating tables with regular insert statements is generally easier but it does require tables to be populated in a referentially valid order.  This is something I do just about every day, copying fairly small amounts of test data around for the development team.  Years ago I wrote for DB2 a Perl script that extracts from the catalog a list of the tables in a given schema in a referentially valid order. What I didn’t have was an equivalent for SQL Server and a recent job I had needed such a list. The data model only consisted of about 120 tables but that was enough for me to write the following  function.

alter function riorder(@schema nvarchar(128))
returns @riorderTabs table
( RILevel     smallint not null
, tablename nvarchar(128) not null
-- This function returns a table of RILevel and tablename for the specified schema
-- RILevel starts at 0 for tables with no RI Parents, and increases by 1 for each
-- 'generation' of tables.
-- It is possible to have "Referential loops", i.e. tabA is a parent of tabB, tabB
-- is a parent of tabC, tabC is a parent of tabA. This is unresolvable
DECLARE @schemaTabs table (tbname nvarchar(128));
DECLARE @level smallint;
DECLARE @tabsAtCurrentLevel table (tbname nvarchar(128));

-- start by getting al list of all the tables in the schema
insert into @schemaTabs
from information_schema.tables
where TABLE_SCHEMA = @schema
and table_type = 'BASE TABLE';

-- @level tracks the current RI level, we start at 0 for tables with no RI parent
set @level = 0;

-- We iterate through the levels until either
-- there are no tables left in @schemaTabs
-- or we find no tables at the current level (an
-- unresolvable referential loop)

declare @candidateTab as nvarchar(128);
declare candidateTabs_cursor cursor forward_only read_only for
select tbname from @schemaTabs;

declare @parentTablesFound as smallint;
declare @tablesAtCurrentLevel smallint;

-- set this to a sensible value so we drop into the main loop
select @tablesAtCurrentLevel = count(*) from @schemaTabs;
while @tablesAtCurrentLevel > 0
   set @tablesAtCurrentLevel = 0;

   -- now we iterate for each table @schemaTabs
   -- and find which have no RI parent table in the @schemaTabs
   -- list of tables
   open candidateTabs_cursor;
   fetch next from candidateTabs_cursor
   into @candidateTab;
   while @@FETCH_STATUS = 0
      -- We are only interested in whether @candidateTab has parent tables
      -- in @schemaTabs, not what they are
       select @parentTablesFound = count(*)
         -- get the parent table
             and constr.UNIQUE_CONSTRAINT_NAME = prnt.CONSTRAINT_NAME
               -- get the child table
               on constr.CONSTRAINT_CATALOG = chld.CONSTRAINT_CATALOG
               and constr.CONSTRAINT_SCHEMA = chld.CONSTRAINT_SCHEMA
               and  constr.CONSTRAINT_NAME = chld.CONSTRAINT_NAME
                     inner join @schemaTabs scTabs
                         on scTabs.tbname = prnt.TABLE_NAME
           where chld.TABLE_NAME = @candidateTab
           and chld.CONSTRAINT_SCHEMA = @schema;

           if @parentTablesFound = 0
              -- no parent tables in @schemaTabs so we note this table name.
                  -- We don't remove it from @schemaTabs yet because if we found
                  -- RI children of this table later in this loop we would think
                  -- these were at the current RI level
                  insert into @tabsAtCurrentLevel
                  values (@candidateTab);
                  set @tablesAtCurrentLevel = @tablesAtCurrentLevel + 1;
           -- fetch next table
           fetch next from candidateTabs_cursor
       into @candidateTab;

   close candidateTabs_cursor;
   -- Update the output list of tables for the current level
   insert into @riorderTabs select @level, tbname from @tabsAtCurrentLevel;
   -- Now we can remove the tables we just found
   -- from @schemaTabs
   delete from @schemaTabs where tbname in (select tbname from @tabsAtCurrentLevel);
   -- and clear down @tabsAtCurrentLevel
   delete from @tabsAtCurrentLevel;
   set @level = @level + 1;


Using the function is simply a matter of passing it the schema of the tables you are interested in:

select * from riorder('<schema>') order by RILevel, tablename

Because the first column is the ‘level’ you can get a reversed order (which is the valid order for deleting data) by using the DESCENDING  word in an ORDER BY clause.

select * from riorder('<schema>') order by RILevel, tablename order by RILevel descending

This script can fail to provide a complete answer if your data model contains a “referential loop”, i.e. table_A is a parent of table_B, table_B is a parent of table_C and table_C is a parent of table_A. This can happen perfectly legitimately but it’s generally a source of pain for your hardworking DBA trying to set up test data. If the above script finds this situation it recognises the loop and stops trying to go any further.