Altius Community

Altius Consulting Community
Welcome to Altius Community Sign in | Join | Help
in Search

Jake Smillie

June 2008 - Posts

  • Data Warehousing - ETL Dealing With Large Dimensions in SSIS

    Introduction

    This is the first part of a 2 part blog that outlines techniques in dealing with very large dimensions (or small ones for that matter) when going through your ETL (Extract, Transform & Load) process in your data warehouse using Microsoft SSIS (SQL Server Integration Services) as your ETL tool.

     

    The Problem

    - You have designed your data warehouse and you are left with a very large and complex dimension. (When I say large and complex I mean a million plus rows and 10 plus data sources, but the process works on any size of dimension)

    - You still need to meet your ETL time requirements and it would not be plausible to process the full dimension for each ETL run.

     

    For the purpose of this blog I have simplified the problem to a small simple dimension.

    Here is a Entity Relationship diagram of the source for our dimension.

     

    DataWarehouse_Diagram_1

     

    This is a very simplified Product dimension and the end result we want is:

    DataWarehouse_Diagram_2

     

    Solution

    To start I would like to caveat this with the fact that this is by no means the only and definitive way but a way that has worked for me and works well.

    All ETL processed go through 4 main steps:

    Extract   - Extract the data from source systems

    Clean      - Clean the data for your data warehouse.

    Conform - Change the data into your data warehouse format

    Load       - Load it into your data warehouse

     

    Extract

    Firstly we need to Extract the 3 source tables (Product, Product_Pricing, Product_Hierarchy). We only want to extract changes to these tables since the last load as we do not want to process the whole dimension each time. There are many ways that this can be done and is not part of the scope of this blog.

     

    Clean

    In clean we keep a current snapshot of the source table and mark any rows that change to 'changed'. For example if a department (1, Produce) in Product_Hierarchy changes to (1, Food & Produce) we change that row and mark it as changed.

     

    Here is the sample data I am using in the clean area.

    Product_Hierarchy

    Department Department_Name
    1 Produce
    2 Grocery
    3 BWS
    4 Technology

    Products

    Product_Key Product_Name Department
    1234 Lettuce 2
    1235 Potato 2
    1236 Tomato 2
    2341 Crisps 1
    2342 Biscuits 1
    3241 DVD 4
    3242 CD 4
    4321 Fosters 3
    4322 Stella 3
    4323 Pinot Grigio 3

    Product_Pricing

    Product_Key RRP
    1234 0.99
    1235 0.89
    1236 2.99
    2341 1.89
    2342 1.47
    3241 3.57
    3242 2.54
    4321 1.50
    4322 4.20
    4323 2.30

     

    So this is the data we have in our clean tables. We now need to merge these tables into one that would be used as out source dimension to conform.

    This is the SQL used:

     

    SELECT 
           p.[Product_Key]        as [Source_Product_Key]
          ,p.[Product_Name]        as [Source_Product_Name]
          ,p.[Department]        as [Source_Department]
          ,h.[Department_Name]    as [Source_Department_Name]
          ,price.[RRP]            as [Source_RRP]
    FROM [clean].[Products] p -- this is the driving table 
    LEFT OUTER JOIN [clean].[Product_Pricing] price
    ON
        p.[Product_Key] = price.[Product_Key]
    LEFT OUTER JOIN [Clean].[Product_Hierarchy] h
    ON
        p.[Department] = h.[Department]
    WHERE
        p.[Changed] = 1 -- driving table has changed
    OR    price.[Changed] = 1 -- the price  has changed 
    OR    h.[Changed] = 1 -- the department has changed 

     

    This merges the the tables and makes sure we get all changes (by using the OR Changed flags). One of the problems you get with multiple sources is that if the Department_Name in Product_Hierarchy changes it could change more than one row in the Dimension (so you can't just use your driving product table to track changes).

    For example, if (1, Produce) changes to (1, Food & Produce) as previously mentioned we need to make sure Products 2341 and 2342 are included to conform. Holding the full snapshot of your relational data in your staging area allows you to get ALL changes and only load changes (not the whole dimension). This approach can save many minutes on large dimensions.

    The above solution does require the storage space in your staging database but this is a small price to pay for time saved.

     

     

    Part 2 of this blog will explain how to flow this through a conform stage in SSIS. You can download the sample database I have used to have a play for yourself below

  • Quick Tip - Using Excel or TextPad to manipulate strings

    I am sure many times you have been asked to compare values in a database table against an Excel Spreadsheet or text file. You may be given a list of keys and a list of values. A quick way to write a SQL statement to retrieve the values is to use the SQL 'in' clause, but this poses the problem of adding commas or even commas and single quotes around the key value e.g.

    SELECT
        myKey,
        myValue
    FROM myTable
    WHERE
        myKey in 
    (
    'Key_1',
    'Key_2',
    'Key_3'
    )

    I have seen many times people do this by manually adding the single quotes and commas, this is fine if you have handful of values to compare but if you have 20+ it can become pretty tedious!

    There is a simpler way, in fact 2 that I regularly use.

     

    Excel

    You can use Excel formula to add the quotes and commas for you. Say you have you key values in column A you could write this formula:

    ="'" & A1 & "',"

    This will convert:

    Key_1 into 'Key_1',

    Once you have typed one formula you can use the Excel cell drag functionality to apply that formula to all your rows and then just copy the new column to your SQL query.

     

    TextPad

    TextPad is a pretty rich functionality text editor for Windows, and is definitely worth a look if you haven't used it before. You can download a trial copy from http://www.textpad.com/

    One useful functionality in TextPad is the ability to record macros. When you do this TextPad will record your key strokes and apply them to the whole text file when run.

    So using the same example above where you have a list:

    Key_1

    Key_2

    Key_3

     

    You can start recording a new macro and hit these keys: [Home], ['], [End], ['], [,], [Enter] then stop recording. Now just run your new macro applying it to the whole document and before you know it all your rows are in your desired format.

     

    I hope this tip might save you some effort in the future!