(By Jamie Thomson)
A common requirement when building a data warehouse is to be able to get all rows from a staging table where the business key is not in the dimension table. For example, I may want to get all rows from my STG_DATE table where the DateID is not in DIM_DATE.DateID.
There are 2 ways to do this in conventional SQL.
-- Method #1 INSERT DIM_DATE SELECT s.* FROM STG_DATE s LEFT OUTER JOIN DIM_DATE d ON s.DateID = d.DateID WHERE d.DateID IS NULL -- Method #2 INSERT DIM_DATE SELECT s.* FROM STG_DATE s WHERE DateID NOT IN ( SELECT DISTINCT DateID FROM DIM_DATE d )
There are many cases where using conventional T-SQL may not be an option in achieving this. Perhaps the data is on different servers. Or perhaps STG_DATE isn’t even a table; it may be a text file. In these cases you may have to use DTS to achieve your required results.
There are 2 methods of doing this in DTS 2005. The first method is analogous to the first SQL statement above whereas the second method builds on some of the new functionality in DTS 2005.
We’re going to need a source and destination table for demonstration purposes. Let’s create them now. For simplicity we are going to place the tables into the same database.
IF EXISTS (SELECT * FROM sys.tables WHERE NAME = 'STG_DATE') DROP TABLE STG_DATE CREATE TABLE STG_DATE ( DateID INT NOT NULL PRIMARY KEY , TheDate DATETIME NOT NULL , DayOfMonth INT NOT NULL , DayOfYear INT NOT NULL , DayOfWeek INT NOT NULL , DayName VARCHAR(9) NOT NULL , WeekOfYear INT NOT NULL , WeekName VARCHAR(7) NOT NULL , MonthOfYear INT NOT NULL , MonthName VARCHAR(9) NOT NULL , Quarter INT NOT NULL , QuarterName CHAR(7) NOT NULL , Year INT NOT NULL , Hour INT NOT NULL , Minute INT NOT NULL , Second INT NOT NULL , IsWeekday BIT NOT NULL , DayOfQuarter INT NOT NULL ) IF EXISTS (SELECT * FROM sys.tables WHERE NAME = 'DIM_DATE') DROP TABLE DIM_DATE CREATE TABLE DIM_DATE ( DateID INT NOT NULL PRIMARY KEY , TheDate DATETIME NOT NULL , DayOfMonth INT NOT NULL , DayOfYear INT NOT NULL , DayOfWeek INT NOT NULL , DayName VARCHAR(9) NOT NULL , WeekOfYear INT NOT NULL , WeekName VARCHAR(7) NOT NULL , MonthOfYear INT NOT NULL , MonthName VARCHAR(9) NOT NULL , Quarter INT NOT NULL , QuarterName CHAR(7) NOT NULL , Year INT NOT NULL , Hour INT NOT NULL , Minute INT NOT NULL , Second INT NOT NULL , IsWeekday BIT NOT NULL , DayOfQuarter INT NOT NULL )
And we’re going to need some data in STG_DATE. If you don’t edit it the following script will create a data-set from 1st Jan 1900 to 31st Dec 2050 which results in 55152 rows.
SET DATEFIRST 1 DECLARE @startdate DATETIME DECLARE @enddate DATETIME DECLARE @date DATETIME DECLARE @id INT SET @startdate = '1900-01-01' --Change these to SET @enddate = '2050-12-31' --whatever you want SET @id = 0 SET @date = DATEADD(dd, @id, @startdate) WHILE @date <= @enddate BEGIN INSERT INTO STG_DATE VALUES (@id --DateID , @date --TheDate , DATEPART(dd, @date) --DayOfMonth , DATEPART(dy, @date) --DayOfYear , DATEPART(dw, @date) --DayOfWeek , DATENAME(dw, @date) --DayName , DATEPART(ww, @date) --WeekOfYear , 'Week ' + RIGHT('0' + DATENAME(ww, @date), 2) --WeekName , DATEPART(mm, @date) --MonthOfYear , DATENAME(mm, @date) --MonthName , DATEPART(qq, @date) --Quarter , 'Q' + DATENAME(qq, @date) + ' ' + DATENAME(yy, @date) --QuarterName , DATEPART(yy, @date) --Year , DATEPART(hh, @date) --Hour , DATEPART(mi, @date) --Minute , DATEPART(ss, @date) --Second , CASE WHEN DATEPART(dw, @date) IN (6,7) THEN 0 ELSE 1 END --IsWeekday , (DATEDIFF(DAY, DATEADD(qq, DATEDIFF(qq,0,@date) ,0) ,@ate) + 1) --DayOfQuarter ) SET @id = @id + 1 SET @date = DATEADD(dd, @id, @startdate) END
Our aim is quite simply to get all the data from STG_DATE into DIM_DATE using DTS 2005.
Method 1
This method is an implementation of the LEFT OUTER JOIN with an IS NULL clause SQL statement that you see at the top of this article.
Let’s have a look at the data flow.
The STG_DATE source and DIM_DATE source OLE DB Source adapters point to the 2 tables we have just created. I am assuming that you are familiar with OLE DB Source adapters and know how to set up connections to the two tables. The Order by DateID* Sort transformations do exactly what they say on the tin.
Now we are going to join the data sourced from the STG_DATE and DIM_DATE tables.
- Drag a Merge Join transformation onto the designer and rename it Left Outer Join on DateID
- Drag the green connector from Order by DateID STG_DATE to Left Outer Join on DateID
- The Input Output Selection dialog will appear. From the “Input:” combo box select “Merge Join Left Input”. We are modeling a SQL Left Outer Join hence it is important that the correct data flow is applied to the correct Left Outer Join on DateID input
- Double click Left Outer Join on DateID to enter the Merge Join Editor
- Change the Join Type to “Left Outer Join”
- Select all the check boxes next to the columns in the left input. This will add them to the output data flow from Left Outer Join on DateID
- Select the DateID column from the right input and give it an alias of DIM_DATE_DateID. This will add the DateID column to the output data flow from Left Outer Join on DateID
Your Merge Join Editor should now look something like the following.
The Conditional Split transformation is used to split a data flow into multiple data flows based on the state of the data. In this case we are going to use it to identify all the data that has a NULL value in the DIM_DATE_DateID column.
- Drag a Conditional Split transformation onto the data flow designer
- Rename the Conditional Split transformation as Get New Dates
- Drag the output connector from Left Outer Join on DateID to Get New Dates
- Double click Get New Dates to view the Conditional Split Editor
- Expand Columns in the top left treeview and drag the DIM_DATE_DateID column into first row in the bottom half of the editor. This will enable us to build a condition on this column that allows us to define what data goes into this output
- Change the Condition to “ISNULL([DIM_DATE_DateID])”. This will ensure that only rows that have the value NULL in the DIM_DATE_DateID column will be included in this output
- Change the Output Name to “New Rows”
Your Conditional Split Editor should now look like this.
Finally, we need to put the data somewhere at the end of the data flow.
- Drag an OLE DB Destination adapter onto the designer
- Rename it DIM_DATE destination
- Drag the output connector from Get New Dates Conditional Split to DIM_DATE destination
- The Input Output Selection dialog will appear. From the “Output:” combo box select “New Rows”
- Double click DIM_DATE destination
- On the Connection tab select the DIM_DATE table
- Click on the mappings tab. The correct mappings will be created automatically by joining fields with identical names
And that’s it! Executing this data flow will result in all 55152 rows being copied from STG_DATE to DIM_DATE.
Method 2
This method uses the new DTS Lookup transformation. Before running this you will have to empty the destination table that you have just populated with method 1.
- Execute TRUNCATE TABLE DIM_DATE in SQL Server management Studio
Let’s have a look at the data flow.
The STG_DATE OLE DB Source adapter points to the STG_DATE table that we created earlier. Again I will assume you can configure this yourself.
Now drag a Lookup transformation onto your data flow designer.
- Rename the Lookup as Lookup DIM_DATE
- Drag the green connector from STG_DATE to Lookup DIM_DATE
- Double click Lookup DIM_DATE to open the Lookup Editor
- In the “Connection” combo box select the connection that points to the DIM_DATE table
- From the “Use a table or a view” combo box select the DIM_DATE table. This is specifying that we are going to use the DIM_DATE table as a lookup
- Click the Columns tab
- Mappings between the input columns and the lookup columns will have been created automatically. Delete all mappings except the one between the DateID columns. This ensures that we will use values from the lookup table where the values in the DateID columns are equal
The Lookup Editor Columns tab should look like the following.
The error output from Lookup DIM_DATE will contain all the rows for which there was no matching lookup record in the lookup table. In other words, it will contain all the rows that are not already in the lookup table. This is not an error per se but it achieves our aim.
Drag a new OLE DB Destination adapter onto the data flow designer
- Rename the adapter DIM_DATE
- Drag the red Error output from Lookup DIM_DATE to DIM_DATE
- The “Configure Error Output” dialog will be displayed. In the Error combo box change the selection to “Redirect Row” and click OK. This will ensure that no error is raised when rows get added to the Error output
- Double click DIM_DATE to display the OLE DB Destination Editor
- Configure DIM_DATE to point to the appropriate table and click on the mappings tab. The mappings will be configured automatically.
And that is it! Again, executing this data flow will result in all 55152 rows being copied from STG_DATE to DIM_DATE.
Conclusion
There are 2 methods that you can use to accomplish this requirement. Method 2 is quicker to build although some people may still choose to use the first method as they may be less happy in using the error output for something for which it was not designed. Experiment to see which method works best for your data set.