I was going through some notes I had from previous projects and came across a sample script for created a Type 2 Slow Changing Dimension (SCD) in a database or data warehouse.
A Type 2 SCD is one where new records are added, but old ones are marked as archived and then a new row with the change is inserted. This allows for a complete historical trail of the row’s changes in detail. Sometimes this can be overkill, but in some cases it is required. The client was doing medical claims processing, so we were required to keep a full history of key information.
In the past (read: pre-2008 versions of SQL) this used to have to be done with a bunch of little code and triggers to make sure the old data was kept.
On one of my recent projects I decided I wanted to use the new MERGE statement to create this same effect, with a LOT less code to maintain.
What actually made this possible is the MERGE statement’s output option. The merge statement actually captured all the changes it makes and tags them in an output table that you can reference. This feature let me capture information that was in the merge portion and pass it through to another statement.
Ok, enough blather, here’s the code sample:
INSERT INTO Customer_Master SELECT Source_Cust_ID , First_Name , Last_Name , Eff_Date , End_Date , Current_Flag FROM ( MERGE Customer_Master CM USING Customer_Source CS ON ( CM.Source_Cust_ID = CS.Source_Cust_ID ) WHEN NOT MATCHED THEN INSERT VALUES ( CS.Source_Cust_ID , CS.First_Name , CS.Last_Name , CONVERT(CHAR(10), GETDATE() - 1, 101) , '12/31/2199' , 'y' ) WHEN MATCHED AND CM.Current_Flag = 'y' AND ( CM.Last_Name <> CS.Last_Name ) THEN UPDATE SET CM.Current_Flag = 'n' , CM.End_date = CONVERT(CHAR(10), GETDATE() - 2, 101) OUTPUT $Action Action_Out , CS.Source_Cust_ID , CS.First_Name , CS.Last_Name , CONVERT(CHAR(10), GETDATE() - 1, 101) Eff_Date , '12/31/2199' End_Date , 'y' Current_Flag) AS MERGE_OUT WHERE MERGE_OUT.Action_Out = 'UPDATE';
I know, it’s a little jumbled, but let’s break it into its basic parts:
First, the main insert statement:
INSERT INTO Customer_Master SELECT FIELDS FROM (** SQL CODE **) AS MERGE_OUT WHERE MERGE_OUT.Action_Out = 'UPDATE';
That should seem pretty familiar – just a simple insert statement from a source that contains a field (action_out) we can filter on.
Now for the more tricky part, the merge statement:
MERGE Customer_Master CM USING Customer_Source CS ON ( CM.Source_Cust_ID = CS.Source_Cust_ID ) WHEN NOT MATCHED THEN INSERT VALUES ( CS.Source_Cust_ID , CS.First_Name , CS.Last_Name , CONVERT(CHAR(10), GETDATE() - 1, 101) , '12/31/2199' , 'y' ) WHEN MATCHED AND CM.Current_Flag = 'y' AND ( CM.Last_Name <> CS.Last_Name ) THEN UPDATE SET CM.Current_Flag = 'n' , CM.End_date = CONVERT(CHAR(10), GETDATE() - 2, 101) OUTPUT $Action Action_Out , CS.Source_Cust_ID , CS.First_Name , CS.Last_Name , CONVERT(CHAR(10), GETDATE() - 1, 101) Eff_Date , '12/31/2199' End_Date , 'y' Current_Flag
OK, still a little confusing so I’m going to describe what’s going on in plain English but first a quick note:
The current_flag field was added to the SCD so that an index could be created on it for real time reporting rather than spending the time searching for the current record through the detailed history. This is one of those cases that knowing what a customer really wants, regardless of what they ask for can make for simple performance tuning.
Now for the logic of the “Joe on the street” description of the merge statement:
MERGE Customer_Master CM USING Customer_Source CS ON ( CM.Source_Cust_ID = CS.Source_Cust_ID )
I am going to attempt to merge data into the Customer Master table from the Customer Source table where the ID values in each table match.
WHEN NOT MATCHED THEN INSERT VALUES ( CS.Source_Cust_ID , CS.First_Name , CS.Last_Name , CONVERT(CHAR(10), GETDATE() - 1, 101) , '12/31/2199' , 'y' )
If I do not find a match in the customer source table I am going to insert a row with the values I require and move to the next record.
WHEN MATCHED AND CM.Current_Flag = 'y' AND ( CM.Last_Name <> CS.Last_Name ) THEN UPDATE SET CM.Current_Flag = 'n' , CM.End_date = CONVERT(CHAR(10), GETDATE() - 2, 101)
If I do find a match, and the current_flag is ‘y’ and the last names do not match, then I want to update the Master table with the new values and then move to the next record.
Every other match/no match scenario in this sample, I don’t need so there are no further checks done.
OUTPUT $Action Action_Out , CS.Source_Cust_ID , CS.First_Name , CS.Last_Name , CONVERT(CHAR(10), GETDATE() - 1, 101) Eff_Date , '12/31/2199' End_Date , 'y' Current_Flag;
When I’m done with the merge, I want to access the internal table and pass certain values out to any other statements. Note the $Action field in the output, this is what tells us the action the merge statement performed and will be what is used on the outer insert statement to perform the new record insert.
So putting it all together you get this:
INSERT INTO Customer_Master SELECT Source_Cust_ID, First_Name, Last_Name, Eff_Date, End_Date, Current_Flag FROM ( MERGE Customer_Master CM USING Customer_Source CS ON (CM.Source_Cust_ID = CS.Source_Cust_ID) WHEN NOT MATCHED THEN INSERT VALUES (CS.Source_Cust_ID, CS.First_Name, CS.Last_Name, convert(char(10), getdate()-1, 101), '12/31/2199', 'y') WHEN MATCHED AND CM.Current_Flag = 'y' AND (CM.Last_Name <> CS.Last_Name ) THEN UPDATE SET CM.Current_Flag = 'n', CM.End_date = convert(char(10), getdate()-2, 101) OUTPUT $Action Action_Out, CS.Source_Cust_ID, CS.First_Name, CS.Last_Name, convert(char(10), getdate()-1, 101) Eff_Date, '12/31/2199' End_Date, 'y'Current_Flag) AS MERGE_OUT WHERE MERGE_OUT.Action_Out = 'UPDATE';
Hopefully this will help you see just how powerful the merge statement in SQL Server can be.
By using this process we streamlined not only our code base for the data warehouse being built, but also the performance of maintaining a Type 2 SCD.
Helpful links:
SQL Merge Statement: http://technet.microsoft.com/en-us/library/bb510625.aspx
SCD Types: http://www.folkstalk.com/2012/03/slowly-changing-dimensions-scd-types.html