Introduction
This is the first part of a 2 part blog that outlines techniques in dealing with very large dimensions (or small ones for that matter) when going through your ETL (Extract, Transform & Load) process in your data warehouse using Microsoft SSIS (SQL Server Integration Services) as your ETL tool.
The Problem
- You have designed your data warehouse and you are left with a very large and complex dimension. (When I say large and complex I mean a million plus rows and 10 plus data sources, but the process works on any size of dimension)
- You still need to meet your ETL time requirements and it would not be plausible to process the full dimension for each ETL run.
For the purpose of this blog I have simplified the problem to a small simple dimension.
Here is a Entity Relationship diagram of the source for our dimension.
This is a very simplified Product dimension and the end result we want is:
Solution
To start I would like to caveat this with the fact that this is by no means the only and definitive way but a way that has worked for me and works well.
All ETL processed go through 4 main steps:
Extract - Extract the data from source systems
Clean - Clean the data for your data warehouse.
Conform - Change the data into your data warehouse format
Load - Load it into your data warehouse
Extract
Firstly we need to Extract the 3 source tables (Product, Product_Pricing, Product_Hierarchy). We only want to extract changes to these tables since the last load as we do not want to process the whole dimension each time. There are many ways that this can be done and is not part of the scope of this blog.
Clean
In clean we keep a current snapshot of the source table and mark any rows that change to 'changed'. For example if a department (1, Produce) in Product_Hierarchy changes to (1, Food & Produce) we change that row and mark it as changed.
Here is the sample data I am using in the clean area.
Product_Hierarchy
| Department |
Department_Name |
| 1 |
Produce |
| 2 |
Grocery |
| 3 |
BWS |
| 4 |
Technology |
Products
| Product_Key |
Product_Name |
Department |
| 1234 |
Lettuce |
2 |
| 1235 |
Potato |
2 |
| 1236 |
Tomato |
2 |
| 2341 |
Crisps |
1 |
| 2342 |
Biscuits |
1 |
| 3241 |
DVD |
4 |
| 3242 |
CD |
4 |
| 4321 |
Fosters |
3 |
| 4322 |
Stella |
3 |
| 4323 |
Pinot Grigio |
3 |
Product_Pricing
| Product_Key |
RRP |
| 1234 |
0.99 |
| 1235 |
0.89 |
| 1236 |
2.99 |
| 2341 |
1.89 |
| 2342 |
1.47 |
| 3241 |
3.57 |
| 3242 |
2.54 |
| 4321 |
1.50 |
| 4322 |
4.20 |
| 4323 |
2.30 |
So this is the data we have in our clean tables. We now need to merge these tables into one that would be used as out source dimension to conform.
This is the SQL used:
SELECT
p.[Product_Key] as [Source_Product_Key]
,p.[Product_Name] as [Source_Product_Name]
,p.[Department] as [Source_Department]
,h.[Department_Name] as [Source_Department_Name]
,price.[RRP] as [Source_RRP]
FROM [clean].[Products] p -- this is the driving table
LEFT OUTER JOIN [clean].[Product_Pricing] price
ON
p.[Product_Key] = price.[Product_Key]
LEFT OUTER JOIN [Clean].[Product_Hierarchy] h
ON
p.[Department] = h.[Department]
WHERE
p.[Changed] = 1 -- driving table has changed
OR price.[Changed] = 1 -- the price has changed
OR h.[Changed] = 1 -- the department has changed
This merges the the tables and makes sure we get all changes (by using the OR Changed flags). One of the problems you get with multiple sources is that if the Department_Name in Product_Hierarchy changes it could change more than one row in the Dimension (so you can't just use your driving product table to track changes).
For example, if (1, Produce) changes to (1, Food & Produce) as previously mentioned we need to make sure Products 2341 and 2342 are included to conform. Holding the full snapshot of your relational data in your staging area allows you to get ALL changes and only load changes (not the whole dimension). This approach can save many minutes on large dimensions.
The above solution does require the storage space in your staging database but this is a small price to pay for time saved.
Part 2 of this blog will explain how to flow this through a conform stage in SSIS. You can download the sample database I have used to have a play for yourself below