Archive for the ‘SQL Server’ Category

The effect of predicates in outer joins

Inner joins in SQL are well understood. You relate two tables on a given set of columns and where you get matching values on both tables the rows from both tables are joined. What columns you choose to return are up to you. The tables are of equal precedence in the join.

Outer joins are different in that one table is said to be dominant and every row from that table is returned from the join process. The significance of this rule is the subject of this posting. Where the other, subordinate, table has matching values on the joined columns then the rows are returned as they would be in an inner join. Where no matching values are found in the subordinate table, the rows in the dominant table are returned and the subordinate table column values are returned as null.

Now, I imagine most people familiar with SQL will be broadly familiar with all of that. To illustrate the point of this posting, I need a couple of simple tables, each with one row.

create table tb1
( colint int
, coldate date

create table tb2
( colchar char(1)
, colint int

insert into tb1 values (1,'01/01/2013');
insert into tb2 values ('a',1);

The above syntax will work for most database engines, I’ve tested with SQL Server 2012 and DB2 V9.7.

The data types are arbitrary. I’ve not added any referential constraints or indexes to keep the example simple. Feel free to add constraints if you wish, it makes no difference to the results.

Let’s start with a simple inner join:

select *
from tb1
inner join tb2
on tb1.colint = tb2.colint

As you would expect, the results are:

colint coldate colchar colint
1 2013-01-01 a 1

So far so good, so now let’s try as an outer join with tb1 as the dominant table:

select *
from tb1
left join tb2
on tb1.colint = tb2.colint

As we would expect, the results looks like the same as the inner join.

colint coldate colchar colint
1 2013-01-01 a 1

Now let’s apply a predicate in a WHERE clause:

select *
from tb1
left join tb2
on tb1.colint = tb2.colint
where tb1.coldate is null

This time we get no rows returned. This is important. The outer join returned a row from tb1 as it must to satisfy the rule that in an outer join all the rows from the dominant table are returned, but the WHERE clause is applied after the join has completed. Thus the one row that was returned by the join is filtered out by the WHERE clause.

Now for the more interesting case, where we apply the predicate in the join clause:

select *
from tb1
left join tb2
on tb1.colint = tb2.colint
and tb1.coldate is null

The result this time is:

colint coldate colchar colint
1 2013-01-01 NULL NULL

So what’s happening here? On the face of it we have created a row that doesn’t exist and one that contradicts the predicate.

The database engine has performed the outer join and it has filtered out the one row that was returned and crucially that row was filtered out as part of the join process because that’s where the predicate was applied – as part of the join.

So where has the row come from? It’s come from the definition of an outer join, that the all rows from the dominant table are returned. The join process must return every row from the dominant table. To avoid the contradiction that the join and the predicate applied within the join filtered out our one row, the row from the dominant table is returned without reference to the join or the join predicate. The subordinate table columns are set to null because this part of the processing has nothing to do with that table.

Of course what looks even stranger is that we have applied in the query a predicate that says tb1.coldate is null but we actually have a none null value for tb1.coldate in our result set.

Both queries are logically and syntactically valid. Both have valid occasions for use, but it’s a good idea to know which you want and what the differences are.


Referentially valid table listing from SQL Server

A regular task when working with test and development databases is loading test data into tables where referential constraints are present. Whilst you can often drop or suspend the constraints whilst populating tables it’s often just easier to work out a referentially valid order so that parent tables get populated before their children. This usually only applies to relatively small volumes of data because larger volumes are generally loaded via utilities which don’t go through the SQL processing engine and so ignore the referential constraints. In this case the constraints usually need to be rechecked to reactivate them before the tables can be used. The process also usually avoids logging the rows added so a backup is normally required as well.

For small volumes of data, populating tables with regular insert statements is generally easier but it does require tables to be populated in a referentially valid order.  This is something I do just about every day, copying fairly small amounts of test data around for the development team.  Years ago I wrote for DB2 a Perl script that extracts from the catalog a list of the tables in a given schema in a referentially valid order. What I didn’t have was an equivalent for SQL Server and a recent job I had needed such a list. The data model only consisted of about 120 tables but that was enough for me to write the following  function.

alter function riorder(@schema nvarchar(128))
returns @riorderTabs table
( RILevel     smallint not null
, tablename nvarchar(128) not null
-- This function returns a table of RILevel and tablename for the specified schema
-- RILevel starts at 0 for tables with no RI Parents, and increases by 1 for each
-- 'generation' of tables.
-- It is possible to have "Referential loops", i.e. tabA is a parent of tabB, tabB
-- is a parent of tabC, tabC is a parent of tabA. This is unresolvable
DECLARE @schemaTabs table (tbname nvarchar(128));
DECLARE @level smallint;
DECLARE @tabsAtCurrentLevel table (tbname nvarchar(128));

-- start by getting al list of all the tables in the schema
insert into @schemaTabs
from information_schema.tables
where TABLE_SCHEMA = @schema
and table_type = 'BASE TABLE';

-- @level tracks the current RI level, we start at 0 for tables with no RI parent
set @level = 0;

-- We iterate through the levels until either
-- there are no tables left in @schemaTabs
-- or we find no tables at the current level (an
-- unresolvable referential loop)

declare @candidateTab as nvarchar(128);
declare candidateTabs_cursor cursor forward_only read_only for
select tbname from @schemaTabs;

declare @parentTablesFound as smallint;
declare @tablesAtCurrentLevel smallint;

-- set this to a sensible value so we drop into the main loop
select @tablesAtCurrentLevel = count(*) from @schemaTabs;
while @tablesAtCurrentLevel > 0
   set @tablesAtCurrentLevel = 0;

   -- now we iterate for each table @schemaTabs
   -- and find which have no RI parent table in the @schemaTabs
   -- list of tables
   open candidateTabs_cursor;
   fetch next from candidateTabs_cursor
   into @candidateTab;
   while @@FETCH_STATUS = 0
      -- We are only interested in whether @candidateTab has parent tables
      -- in @schemaTabs, not what they are
       select @parentTablesFound = count(*)
         -- get the parent table
             and constr.UNIQUE_CONSTRAINT_NAME = prnt.CONSTRAINT_NAME
               -- get the child table
               on constr.CONSTRAINT_CATALOG = chld.CONSTRAINT_CATALOG
               and constr.CONSTRAINT_SCHEMA = chld.CONSTRAINT_SCHEMA
               and  constr.CONSTRAINT_NAME = chld.CONSTRAINT_NAME
                     inner join @schemaTabs scTabs
                         on scTabs.tbname = prnt.TABLE_NAME
           where chld.TABLE_NAME = @candidateTab
           and chld.CONSTRAINT_SCHEMA = @schema;

           if @parentTablesFound = 0
              -- no parent tables in @schemaTabs so we note this table name.
                  -- We don't remove it from @schemaTabs yet because if we found
                  -- RI children of this table later in this loop we would think
                  -- these were at the current RI level
                  insert into @tabsAtCurrentLevel
                  values (@candidateTab);
                  set @tablesAtCurrentLevel = @tablesAtCurrentLevel + 1;
           -- fetch next table
           fetch next from candidateTabs_cursor
       into @candidateTab;

   close candidateTabs_cursor;
   -- Update the output list of tables for the current level
   insert into @riorderTabs select @level, tbname from @tabsAtCurrentLevel;
   -- Now we can remove the tables we just found
   -- from @schemaTabs
   delete from @schemaTabs where tbname in (select tbname from @tabsAtCurrentLevel);
   -- and clear down @tabsAtCurrentLevel
   delete from @tabsAtCurrentLevel;
   set @level = @level + 1;


Using the function is simply a matter of passing it the schema of the tables you are interested in:

select * from riorder('<schema>') order by RILevel, tablename

Because the first column is the ‘level’ you can get a reversed order (which is the valid order for deleting data) by using the DESCENDING  word in an ORDER BY clause.

select * from riorder('<schema>') order by RILevel, tablename order by RILevel descending

This script can fail to provide a complete answer if your data model contains a “referential loop”, i.e. table_A is a parent of table_B, table_B is a parent of table_C and table_C is a parent of table_A. This can happen perfectly legitimately but it’s generally a source of pain for your hardworking DBA trying to set up test data. If the above script finds this situation it recognises the loop and stops trying to go any further.

DB2 V9.7 on RHEL 5 as a data source for a Linked Server in SQL Server 2012

(Aside – it’s been far too long since I last posted anything here. As I’m doing more R&D at the moment I’ll try to post a bit more).

I recently wanted to connect DB2 V9.7 running on RHEL 5 to SQL Server 2012 using SQL Server’s linked server technology. At one time I used Linked Servers to connect to a wide range of data sources, but I hadn’t used them for a while. Thinking the world would have moved on a bit, I looked around the Web to see what the current thinking was but largely drew a blank. The only postings I could find seemed to be from people who described connecting to DB2 as difficult but they didn’t elaborate.

I’d be the first to admit that the process is not straightforward, but it’s not made easier by a lot of what is posted on the Web, which is often misleading at best or just wrong. Most references detail connecting to DB2 running on iSeries or z/OS via DB2 Connect. For DB2 running on Linux or Windows you don’t use DB2 Connect, you connect directly to the DB2 instance so the process is a little different.

I will assume the reader is familiar with Linked Servers in general. If not, have a play. I would suggest you start with something simple, like SQL Server talking to SQL Server. This combination is simple to set up and is reliable. I will also assume you know enough about DB2 to be able to catalog a node and database on the Windows server.

To connect to DB2, there are three OLE DB providers that you can use without paying for third party solutions. These are:

The Microsoft OLEDB Provider for DB2.

The IBM OLEDB Provider for DB2.

The Microsoft OLEDB Provider for ODBC using an ODBC connection to DB2.

From painful experience, if you want to use a third party OLEDB provider, insist on getting a free trial before paying for it. I recall some were no better, and indeed were sometimes worse, than the free versions.

The Microsoft OLEDB Provider for DB2 (DB2OLEDB).

This is installed from the SQL Server Feature Pack. It’s not part of a normal SQL Server installation. Quite simply, I can not get this to work. The installation looks OK, but I can not work out a combination of parameters to get a connection from SQL Server to DB2. The examples I can find all look like they are trying to talk to DB2 Connect. Possibly that’s all this Provider can do?

The IBM OLEDB Provider for DB2 (IBMDADB2).

This appears to work. It does need some effort to set it up but it is probably your best option.

Begin by installing the DB2 Runtime Client (RTC) on the server running SQL Server. The Runtime Client is the minimum DB2 client and just includes the network libraries and the basic command line interface. If you’re looking for the download, at the moment, it seems to be called “IBM Data Server Runtime Client for InfoSphere Federation Server V9.7 Multiplatform Multilingual eAssembly”. I promise I’m not making that up, I couldn’t if I’d tried.

Once the RTC is installed, catalog the source DB2 node and database. As the RTC has no graphical interface, you’ll need to use the db2 command line processor. Make sure this is working before you go any further. Be sure you can connect to your database and can select some data back. Both the OLE DB provider and, if you use it, the ODBC connections use your local database directory. Of course what this means is that both the OLE DB provider and the ODBC driver are just thin layers on top of the DB2 connection layer. This is good news as the DB2 connection via the RTC is reliable.

Next, start up SQL Server Management Studio and you should see in the Linked Servers/Providers list IBMDADB2.<your SQL Server instance>.

This next bit is vital – right click the provider and open the properties list. Make sure the “Allow inprocess” is ticked. Not only will this not work if you don’t have that ticked, it will actually crash the SQL Server service when you try to use the provider.

(Just as an aside, if you have SQL Server in a clustered environment, this would be a really good way to test fail over. Finding relatively harmless ways of crashing a database instance to test fail over isn’t always easy.)

Now you create your Linked Server. This is very simple because all the connectivity details are encapsulated in the DB2 database catalog entry you created earlier. All you need to specify for the linked server is

  • The Provider (from the drop down list).
  • Your database name as the Data Source.
  • Some user credentials in the Linked Server Security tab.

To check all is well, click on the linked server, expand Catalog/default/Tables/System Tables and you should see the DB2 catalog tables for your chosen database. In case you’re not so familiar with the DB2 system catalog, there’s no Master database in DB2. Each database has its own set of catalog tables and is completely isolated from the other databases in the instance (think SQL Server 2012 Contained databases).

Queries to this form of Linked Server use a 4 part name in the same way that Linked Servers to SQL Server do. For example:

select * from <Linked Server>.<database>.SYSCAT.TABLES

Be careful with the object name case. Object names are case sensitive. I know in the native DB2 client object names are case insensitive, but through a linked server they are. If you want to avoid 4 part names, create a local view and access the table through that.

So this works for me. As always with linked servers, if you’re planning on using them a lot, test them to death. In particular test the full range of data types that you want to process. Data type translations in Linked Servers are notoriously unreliable, often caused by poor metadata from the OLE DB Provider.


The good news about this type of Linked Server is, when joining tables on the DB2 instance, you can see SQL Server pushing the join down to DB2. The easiest way to test this is to check what SQL statements are being executed in the DB2 instance. I’d recommend doing this on a quiet system so you can pick out your statement easily.

On the DB2 Instance, if the Statement Monitor is off, connect to your source database and start a Statement Monitor for your connection:

$ db2 update monitor switches using statement on

Get a snapshot baseline of the prepared statements and write the snapshot to a file.

$ db2 get snapshot for dynamic sql on <database> > snp.1

Run your query, and get a second snapshot of prepared statements written to a different file.

$ db2 get snapshot for dynamic sql on <database> > snp.2

Now you can compare the two snapshots (I tend to just use diff) to find the statements that were run. Ideally you’ll be quick enough that the second snapshot only differs from the first as a result of your statement.

From a simple join such as:

select *

from <LinkedServer>.<Database>.<Schema>.TB1 t1

inner join <LinkedServer>.<Database>.<Schema>.TB2 t2

on t1.col1 = t2.col2

where t1.col2 like ‘XXX%’

The snapshot comparison shows 3 statements have been executed. The first 2 are of this form:

select * from “<database>”.”<Schema>”.”TB1″ FETCH FIRST 1 ROWS ONLY

select * from “<database>”.”<Schema>”.”TB2″ FETCH FIRST 1 ROWS ONLY

The third statement is a transformed version of your statement. It looks as if SQL Server makes a simple query against each table in your query to return returning one row to get the result set metadata. Then it prepares and runs your actual query. The single row, initial queries should execute very quickly and only add a very small overhead.

If you are using Linked Servers I would very strongly recommend that you check what is being passed down to the source databases. Against database engines such as DB2, Oracle or mySQL it’s simple enough to do and the time spent might well be a great deal less than the time you spend waiting for poorly optimised statements to return results.

The Microsoft OLEDB Provider for ODBC using an ODBC connection to DB2.

ODBC was supposed to have been superseded so long ago that it seems bizarre to be writing about it, but for completeness here we go. This works in part. As with the IBM OLE DB Provider, you need to start by installing the DB2 Runtime Client and again you should catalog the remote node and database that you are interested in.

Next use the IBM DB2 ODBC Driver associated with your local DB2 client instance to create an ODBC System DSN. Use the Test Connection option to make sure you’re getting through to the DB2 instance.

The final stage is to create the linked server using OLE DB Provider for ODBC (MSDASQL). Here the only parameters you need to specify are:

  • The Provider (from the drop down list).
  • Your ODBC DSN as the Data Source.
  • Some user credentials in the Linked Server Security tab.

The only way I can get this to work is by using the openquery function, e.g.

select * from openquery(<ODBC DSN>,’select count(*) from SYSCAT.TABLES’)

Any attempt at using the 4 part name syntax fails with:

Msg 7313, Level 16, State 1, Line 1

An invalid schema or catalog was specified for the provider “MSDASQL” for linked server “<ODBC DSN>”.