Altius Community

Altius Consulting Community
Welcome to Altius Community Sign in | Join | Help
in Search

Jake Smillie

October 2008 - Posts

  • Kimball Microsoft Data Warehouse: What they don’t tell you! Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

    Welcome to my 4th and final post of my 4 part series of the Kimball Microsoft Data Warehouse toolkit. I hope you have found it interesting so far. If you have missed any posts, below is the outline of my series:

     

    Part 1 – SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 – A Composite Type Change (Type-1 & Type-2)

    Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

    Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

     

    The Problem

    Early arriving facts and implicitly failed dimension lookups are a common situation in Data Warehousing. One of the best examples of this would be the Customer Loyalty Card scheme. I am sure when shopping you have been asked to open up a new customer loyalty scheme, whilst being ensured that the current purchase will be added to your points total.

    Well, in this situation the point of sale will be the first time the new customer exists in a Data Warehouse. When we do a lookup against the Customer dimensions we won’t find the key. What do we do?

     

    What Kimball Say

    Kimball give a few ideas in handling a situation like this, but their preferred method is using inferred dimension members. Which, in short, is adding a stub dimension member (for example in the problem described above just adding the Customer number – probably the loyalty card number – to the Customer Dimension and wait for the details of that customer to come into the dimension at a later date). You can read a bit more about this here. Additionally there is a more in depth explanation from Tom Stagg here.

    This approach works very well for the problem described and is probably the best method for this type of problem. However, what happens if you need some dimensional data to work out fact data (e.g. If you have a situation where a new product can lend itself to early arriving facts – the transactional system is not kept up to date that well – you might want to store profit in the fact using the ETL to avoid calculating it. For this you need cost price of the product which will be dimensional data and not likely to feed from sales data, but wait a minute: This is a new product! We don’t know about it so how can we know the cost price).

     

    The Solution

    Well there are a few things you can do.

    - Don’t work out the profit of the sale (We don’t want to do this because this would make our Data Warehouse wrong)

    - Load it in initially with the profit as 0 and infer the new product. When we get the cost price of the new product we then update the fact. (This would work, but to me, it sounds very complex to add workflow into dimensions to then work out what facts, if any it needs to update and additionally updating fact rows is considered as best avoided)

    - Load them into a Fact_Failed_Lookup table. (This is my preferred approach and I will tell you why)

     

    So, lets take a look at what I mean.

    Imagine we have a Product dimension with one product in and an Unknown member.

    Products_ID Product_Key Product_Name RRP Cost_Price RowIsCurrent RowStartDate RowEndDate
    -1 0000 Unknown 0.00 0.00 1 01-Jan-1900 00:00:00 31-Dec-9999 00:00:00
    1 1234 Lettuce 0.99 0.50 1 01-Jan-1900 00:00:00 31-Dec-9999 00:00:00

    If we sell product 1234 we can work out the Profit using the cost price and the transaction price (Price - Cost Price).

    Transaction Date Product_Key Qty Price Profit
    1 20081028 1234 1 0.99 0.49

    However if we sell a “Beer” (product code 5555) which isn’t in our Dimension we don’t yet know the cost price and cannot work out the profit.

    Transaction Date Product_Key Qty Price Profit
    2 20081028 5555 1 4.99 xxx

    So, we load the transactions as follows.

    Fact_Sales

    Transaction Date_ID Product_ID Qty Price Profit
    1 20081028 1 1 0.99 0.49

    Fact_Sales_Failed_Lookup

    Transaction Date_ID Date Product_ID Product_Key Qty Price Profit Estimate
    2 20081028 20081028 -1 5555 1 4.99 0 TRUE

    When we load the cube we use a view that unions the Fact_Sales and the Fact_Sales_Failed_Lookup key values (additionally we add a view for RDBMS reporting access that does the same)

    This means the overall takings are correct but the profit may not be. I like to add Estimate to the cube so that business users can filter on estimates or not if they wish.

    Now, each day the failed lookup fact table is monitored by the Standing Data team where any invalid business / natural keys can be modified (e.g. Typo issues) and investigation into failures. Each day as part of the ETL the Failed Lookups are truncated and dimension lookups are re-tried (or this can be scheduled to do so on a weekend perhaps). If they fail again they go back into the failed lookups, if they succeed they go into the real fact.

    This is the workflow:

     

    FailedLookups

     

    This approach has added advantages in that:

    - If typos occur in fact data they can be updated easily and the ETL just fixes it

    - If you want / need fact data more frequent than dimension processing, you can just add all fact data that is newer than your dimensional load date to your failed lookup table and mark them as estimates. The next time the dimensions are processed the estimate rows can simply flow into the real fact.

    However be warned using this approach making sure the Failed Lookup facts are actually looked at by the Standing Data Team and they don’t just keep growing and growing.

     

    This concludes my series, which I hope you have found useful and valuable.

  • Kimball Microsoft Data Warehouse: What they don't tell you! Part 3 – A Composite Type Change (Type-1 & Type-2)

    This is the third part of my 4 part blog series that is aimed to give the world more insight on some of the Kimball Group concepts and, sometimes difference in opinions. Below is the outline of the series.

     

    Part 1  - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 – A Composite Type Change (Type-1 & Type-2)

    Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

     

    Part 3 – A Composite Type Change (Type-1 & Type-2)

    Today I am going to talk about a Type of change that is a combination of Type-1 and Type-2 changes. This is not the well understood Type-1.5 change but a composite Type change where multiple columns are related to one and other and each having a different change type. Below is the outline of the problem.

     

    The Problem

    In my previous posts I have used a simple Product dimension, and I will continue to use it. It is quite likely a Product dimension will have some sort of Product Category or Brand attributes e.g. Produce, Beers Wines & Spirits, etc. It is also quite likely that there will be a business key for the product categories e.g. 100 = Produce, 200 = Beers Wines & Spirits, etc. Lets take a look at what that might look like:

     

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

     

    In this simple dimension model it would be quite likely to have Department as a Type-2 and Department_Name as a Type-1. The business should be able to change the alias of the department without affecting their figures but if a product changes department then we don’t want historic sales to move with it.

     

    So what would happen if we get a new row like this:

    Product_Key Product_Name Dep_Code Dep_Name RRP
    1234 Bag of Lettuce 2 Grocery & Frozen 1.3

     

    So, here the Product_Name (Type-1) has changed as has the Department_Name (Type-1). If we flow it through the Kimball sample ETL approach

     

    SCD_Simple

     

    In this situation the new row will flow down the Type-1 change route, updating the Product_Name and Department_Name for all historic rows and current row. The table will now look like this:

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Grocery & Frozen 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery & Frozen 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    This is all fine and expected. However have you spotted the issue yet? What happens if Product: 1234 changes Department e.g:

    Product_Key Product_Name Dep_Code Dep_Name RRP
    1234 Bag of Lettuce 3 Produce 1.3

    In this situation the Checksum for Type-1 and Type-2 will mark a change; Department has changed from ‘2’->’3’ causing a Type-2 checksum to change and Department_Name has also changed from ‘Grocery & Frozen’->’Produce’ causing the Type-1 checksum to change.

    So, we update all the historic rows with the new Type-1 data and insert the new Type-2 row (after handling the Type-2 change). Using this logic the Dimension will now look like this.

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Produce 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Produce 1.3 0 24-Oct-2008 00:00:00 27-Oct-2008 00:00:00
    3 1234 Bag of Lettuce 3 Produce 1.3 1 27-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    Uh-oh!! The current row is correct but the previous rows have had their Department_Name’s updated which is wrong!! We want the Dimension to look like this:

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Grocery & Frozen 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery & Frozen 1.3 0 24-Oct-2008 00:00:00 27-Oct-2008 00:00:00
    3 1234 Bag of Lettuce 3 Produce 1.3 1 27-Oct-2008 00:00:00 31-Dec-9999 00:00:00

     

    What Kimball Say

    Well I haven’t read anywhere in Kimball media regarding this specific problem (however I would like to caveat that with the fact I haven’t read all their media), but they do give out a sample SSIS package that would fall over if a dimension was modelled in this way. I do not think by any stretch of the imagination that Kimball have provided this sample as an all conquering, all problem solving solution to all your dimension needs and why should they? However it would be quite easy to fall into the trap of assuming it would work for your dimension. This post’s purpose is to share knowledge and try and prevent people tripping over this problem.

     

    The Solution

    Some of you may be reading this and thinking, “I would just use a Snowflake for the Department”, and yes this would solve the issue for the specific problem I have described (where you only have 1 level in the Department hierarchy), but this is probably quite unlikely in a production Data Warehouse, as you would probably have multiple levels; meaning you would just get the same problem in the Snowflaked dimension (I doubt you would want snowflake on top of snowflake for each level in the hierarchy).

    In actual truth I do not have a specific (generic) solution to this problem yet, except to say that you need to build in some logic into your Type-1 updates to ensure that if the related Type-2 attribute changes, you don’t update all the historic data also. Each problem will have a slightly different approach to best solving this problem. If I do work out a way to capture all issues I will be sure to post the solution.

    Let me know how you have solved this in the past.

  • Kimball Microsoft Data Warehouse: What they don't tell you! Part 2 – Type-1 Changes & Historical Data

    This is the second part of my 4 part blog that is aimed to give the world more insight on some of the Kimball Group concepts and, sometimes difference in opinions. In my first blog I talked about my difference in opinion to Kimball Group on handling Type-2 changes in dimensions and what dates to use to end and start the new row. You can read that here.

     

    Below is an overview of my series:

    Part 1 - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 - A Composite Type Change (Type-1 & Type-2)

    Part 4 - Dealing with Failed Lookups & Early Arriving Facts - Alternative Approach

     

    Part 2 – Type-1 Changes and Historical Data

    In general the concepts of different Type changes within dimension processing is pretty well understood. There is a lot of material out there that describes the different types. ETL-Tools.info has a brief explanation if you need one. There is also a lot of information on how to deal with them within your ETL.

     

    The Problem

    In my experience many of your Dimensions within your Data Warehouse will have both Type-1 and Type-2 attributes. This is fine and also accepted, but when we have a member that has gone through many Type-2 changes over time, what do we do when we get a Type-1 change on that member?

     

    Consider the changes below:

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    On 25th of October we get a Type-1 change to the Product name changing it to ‘Bag of Lettuce’

    Do we change Products_ID 1 and 2? or just 2 (the current row)?

    What Kimball Say

    In fairness, Kimball does suggest the correct way of dealing with this, but I would be more forceful and say it is THE way to deal with it.

    So Kimball suggests that we would normally change 1 & 2 (change for all historical rows also) and I believe this to be correct. So lets see why.

    If we had this data in the product dimension

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    and this data in our sales fact

    Date_ID Products_ID Qty
    20081023 1 1
    20081024 2 1

    How many “Bag of Lettuce” have we sold in total?

    Well lets see… This could be the SQL:

       1: SELECT
       2:     Qty as 'Total Qty'
       3: FROM Fact_Sales s
       4: INNER JOIN Dim_Products p
       5: ON
       6:     s.Products_ID = p.Products_ID
       7: WHERE
       8:     p.Product_Name = 'Bag of Lettuce'

    The answer given here would be 1.

    This is wrong because we have actually sold 2. (I appreciate that for this type of query I would probably use the Product_Key but I am just demonstrating the issue).

    This means our aggregations are invalid and not correct. If we change the historical row(s) (as suggested) as well, our aggregations remain in tact.

    So now our product dimension looks like this

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    And if we use the same SQL above then we get the correct answer.

    It does mean on large dimensions that change quite frequently, Type-1 changes can become expensive within your ETL, but expensive is OK if you get the correct answer.

    The Solution

    The solution is easy. Always update all your history for Type-1 changes.

    An alternative to this approach is to create a view on top of your dimensions that joins on itself to retrieve the most recent Type-1 attributes but I see this as a bit of a hack and find it much more sensible to keep the dimension as clean and conformed as possible.

    Let me know your thoughts.

  • Kimball Microsoft Data Warehouse: What they don't tell you! Part 1 - SCD Dates

    I was recently lucky enough to attend the Kimball University "Microsoft Data Warehouse in Depth" course, which I thoroughly recommend to anyone looking at the Microsoft toolkit for their Data Warehouse solution. It is well structured, well thought out and delivered in a clear and authoritative manner. Both Warren and Joy know their stuff (details about Warren and Joy can be found here).

    All this being said I did feel they missed out a few 'gotchas', and some of their methods and advice conflicted with a few of my personal opinions regarding Data Warehousing best practices (at least in the Microsoft Toolkit). This blog series is going to talk about these things and why I feel they are important.

     

    Part 1 - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 - A Composite Type Change (Type-1 & Type-2)

    Part 4 - Dealing with Failed Lookups & Early Arriving Facts - Alternative Approach

     

    Part 1 - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    The Problem

    When I Type-2 change a dimension member what Start and End dates do I use?

    Lets have a look at an example scenario:

     

    Products_ID Product_Key Product_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 0.99 1 01-Jan-1900 00:00:00 31-Dec-9999 00:00:00

     

    Now we load the dimensional data for 24-Oct-2008 which changes the RRP to £1.30 of Lettuce (we are all feeling the credit crunch squeeze!). This would be a Type-2 change. So how do we add a new row and what dates do we use in the RowStart and RowEnd dates? Is it 24-Oct-2008 or 23-Oct-2008?

    This is a problem that I have had arguments about in the past, and I bet if you ask 10 different Data Warehouse experts 5 will give you one answer and 5 will give you another. In the main I do not think it matters too much as long as you are consistent, except when using Microsoft SQL Server as your RDBMS (at least pre 2008) or when you are loading more frequently than once per day.

     

    What Kimball Say

    Kimball tell you to switch the row like below:

    Products_ID Product_Key Product_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 0.99 0 01-Jan-1900 00:00:00 23-Oct-2008 00:00:00
    2 1234 Lettuce 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    So the RowEnd date is the load date (remember we are loading 24-Oct-2008) –1 day.

    If we had a simple Sales fact when we do lookups loading our facts we can either do it like follows:

       1: SELECT 
       2:      p.Products_ID 
       3: FROM conform.Fact_Sales f
       4: LEFT OUTER JOIN conform.Dim_Products p
       5: ON
       6:     f.Product_Key = p.Product_Key
       7: AND f.Date >= p.RowStartDate
       8: AND f.Date <= p.RowEndDate

     

    or

       1: SELECT 
       2:      p.Products_ID 
       3: FROM conform.Fact_Sales f
       4: LEFT OUTER JOIN conform.Dim_Products p
       5: ON
       6:     f.Product_Key = p.Product_Key
       7: AND f.Date BETWEEN p.RowStartDate AND p.RowEndDate

     

    Why is it an Issue?

    It is often common practice within a transactional system to mark dates something occurred as the current system Datetime (especially when using SQL Server i.e. GetDate())

    Pre SQL Server 2008 there is no Date type so all dates are time specific also. So if we consider a fact table feed like this:

    Transaction Date Product_Key Qty
    1 23-Oct-2008 00:00:00 1234 1
    2 23-Oct-2008 09:00:00 1234 1
    3 24-Oct-2008 00:00:00 1234 1

    Using the the Kimball approach the data we receive in the join will be:

    Transaction Products_ID
    1 1
    2 NULL
    3 2

    Uh oh!! We are missing a product ID! This is because Transaction 2 happened at 9am (not exactly midnight) and due to the RowStart and RowEnd dates being accurate to the second (smalldatetime) or even 1000th of a second (datetime) we miss the join. To get round this you could strip the Fact date to just date (force it to be 00:00:00) but this can be expensive, especially when doing this for millions of fact rows and when there is an alternative way.

     

    The Solution

    Instead of using the load day –1 we use the load day as the RowEndDate so our Lettuce now looks like this:

    Products_ID Product_Key Product_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Lettuce 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    We now change our join very slightly to the below:

       1: SELECT 
       2:      p.Products_ID 
       3: FROM conform.Fact_Sales f
       4: LEFT OUTER JOIN conform.Dim_Products p
       5: ON
       6:     f.Product_Key = p.Product_Key
       7: AND f.Date >= p.RowStartDate
       8: AND f.Date <  p.RowEndDate

    So we only join on rows less than but not equal to the RowEndDate. So now we get these values:

    Transaction Products_ID
    1 1
    2 1
    3 2

    Yay!! we are not missing any Product_IDs now, and all is happy.

    This approach has saved expensive date conversions when they are not necessary. As long as the RowEndDate of the previous row is the same as the RowStartDate of the next row then all is happy. This is my standard approach and it is also an approach that will work if you need to move to real time processing (changes occur multiple times a day).

    I hope you can see how this approach ticks more of the boxes in issue resolution. Let me know about your thoughts and how you have tackled this problem in the past.