Type 2 SCD with SQL Merge

I was going through some notes I had from previous projects and came across a sample script for created a Type 2 Slow Changing Dimension (SCD) in a database or data warehouse.
A Type 2 SCD is one where new records are added, but old ones are marked as archived and then a new row with the change is inserted. This allows for a complete historical trail of the row’s changes in detail. Sometimes this can be overkill, but in some cases it is required. The client was doing medical claims processing, so we were required to keep a full history of key information.

In the past (read: pre-2008 versions of SQL) this used to have to be done with a bunch of little code and triggers to make sure the old data was kept.
On one of my recent projects I decided I wanted to use the new MERGE statement to create this same effect, with a LOT less code to maintain.

What actually made this possible is the MERGE statement’s output option. The merge statement actually captured all the changes it makes and tags them in an output table that you can reference. This feature let me capture information that was in the merge portion and pass it through to another statement.

Ok, enough blather, here’s the code sample:

INSERT  INTO Customer_Master
        SELECT  Source_Cust_ID ,
                First_Name ,
                Last_Name ,
                Eff_Date ,
                End_Date ,
                Current_Flag
        FROM    ( MERGE Customer_Master CM
                    USING Customer_Source CS
                    ON ( CM.Source_Cust_ID = CS.Source_Cust_ID )
                    WHEN NOT MATCHED 
                        THEN
		INSERT           VALUES
                ( CS.Source_Cust_ID ,
                  CS.First_Name ,
                  CS.Last_Name ,
                  CONVERT(CHAR(10), GETDATE() - 1, 101) ,
                  '12/31/2199' ,
                  'y'
                )
                    WHEN MATCHED AND CM.Current_Flag = 'y'
                        AND ( CM.Last_Name <> CS.Last_Name )
                        THEN
		 UPDATE          SET
                CM.Current_Flag = 'n' ,
                CM.End_date = CONVERT(CHAR(10), GETDATE() - 2, 101)
                    OUTPUT
                        $Action Action_Out ,
                        CS.Source_Cust_ID ,
                        CS.First_Name ,
                        CS.Last_Name ,
                        CONVERT(CHAR(10), GETDATE() - 1, 101) Eff_Date ,
                        '12/31/2199' End_Date ,
                        'y' Current_Flag) AS MERGE_OUT
        WHERE   MERGE_OUT.Action_Out = 'UPDATE';

I know, it’s a little jumbled, but let’s break it into its basic parts:

First, the main insert statement:

INSERT  INTO Customer_Master
SELECT  FIELDS
FROM    (** SQL CODE **)  AS MERGE_OUT
WHERE   MERGE_OUT.Action_Out = 'UPDATE';

That should seem pretty familiar – just a simple insert statement from a source that contains a field (action_out) we can filter on.

Now for the more tricky part, the merge statement:

MERGE Customer_Master CM
                    USING Customer_Source CS
                    ON ( CM.Source_Cust_ID = CS.Source_Cust_ID )
                    WHEN NOT MATCHED 
                        THEN
		INSERT           VALUES
                ( CS.Source_Cust_ID ,
                  CS.First_Name ,
                  CS.Last_Name ,
                  CONVERT(CHAR(10), GETDATE() - 1, 101) ,
                  '12/31/2199' ,
                  'y'
                )
                    WHEN MATCHED AND CM.Current_Flag = 'y'
                        AND ( CM.Last_Name <> CS.Last_Name )
                        THEN
		 UPDATE          SET
                CM.Current_Flag = 'n' ,
                CM.End_date = CONVERT(CHAR(10), GETDATE() - 2, 101)
                    OUTPUT
                        $Action Action_Out ,
                        CS.Source_Cust_ID ,
                        CS.First_Name ,
                        CS.Last_Name ,
                        CONVERT(CHAR(10), GETDATE() - 1, 101) Eff_Date ,
                        '12/31/2199' End_Date ,
                        'y' Current_Flag

OK, still a little confusing so I’m going to describe what’s going on in plain English but first a quick note:
The current_flag field was added to the SCD so that an index could be created on it for real time reporting rather than spending the time searching for the current record through the detailed history. This is one of those cases that knowing what a customer really wants, regardless of what they ask for can make for simple performance tuning.

Now for the logic of the “Joe on the street” description of the merge statement:

MERGE Customer_Master CM
                    USING Customer_Source CS
                    ON ( CM.Source_Cust_ID = CS.Source_Cust_ID )

I am going to attempt to merge data into the Customer Master table from the Customer Source table where the ID values in each table match.

           WHEN NOT MATCHED 
                        THEN
		INSERT           VALUES
                ( CS.Source_Cust_ID ,
                  CS.First_Name ,
                  CS.Last_Name ,
                  CONVERT(CHAR(10), GETDATE() - 1, 101) ,
                  '12/31/2199' ,
                  'y'
                )

If I do not find a match in the customer source table I am going to insert a row with the values I require and move to the next record.

    WHEN MATCHED AND CM.Current_Flag = 'y'
                        AND ( CM.Last_Name &lt;&gt; CS.Last_Name )
                        THEN
		 UPDATE          SET
                CM.Current_Flag = 'n' ,
                CM.End_date = CONVERT(CHAR(10), GETDATE() - 2, 101)

If I do find a match, and the current_flag is ‘y’ and the last names do not match, then I want to update the Master table with the new values and then move to the next record.
Every other match/no match scenario in this sample, I don’t need so there are no further checks done.

 OUTPUT
                        $Action Action_Out ,
                        CS.Source_Cust_ID ,
                        CS.First_Name ,
                        CS.Last_Name ,
                        CONVERT(CHAR(10), GETDATE() - 1, 101) Eff_Date ,
                        '12/31/2199' End_Date ,
                        'y' Current_Flag;

When I’m done with the merge, I want to access the internal table and pass certain values out to any other statements. Note the $Action field in the output, this is what tells us the action the merge statement performed and will be what is used on the outer insert statement to perform the new record insert.

So putting it all together you get this:

INSERT INTO Customer_Master
SELECT 
	Source_Cust_ID, 
	First_Name, 
	Last_Name, 
	Eff_Date, 
	End_Date, 
	Current_Flag
FROM
	( MERGE Customer_Master CM
		USING Customer_Source CS
	ON (CM.Source_Cust_ID = CS.Source_Cust_ID)
	WHEN NOT MATCHED THEN
		INSERT VALUES (CS.Source_Cust_ID, CS.First_Name, CS.Last_Name,
		convert(char(10), getdate()-1, 101), '12/31/2199', 'y')
	WHEN MATCHED AND CM.Current_Flag = 'y' AND (CM.Last_Name <> CS.Last_Name ) THEN
		 UPDATE SET CM.Current_Flag = 'n', CM.End_date = convert(char(10), getdate()-2, 101)

	OUTPUT $Action Action_Out, 
			CS.Source_Cust_ID, 
			CS.First_Name, 
			CS.Last_Name,
			convert(char(10), getdate()-1, 101) Eff_Date, 
			'12/31/2199' End_Date, 
			'y'Current_Flag) 
AS MERGE_OUT
WHERE MERGE_OUT.Action_Out = 'UPDATE';

Hopefully this will help you see just how powerful the merge statement in SQL Server can be.
By using this process we streamlined not only our code base for the data warehouse being built, but also the performance of maintaining a Type 2 SCD.

 

Helpful links:

SQL Merge Statement: http://technet.microsoft.com/en-us/library/bb510625.aspx

SCD Types: http://www.folkstalk.com/2012/03/slowly-changing-dimensions-scd-types.html