Welcome to my 4th and final post of my 4 part series of the Kimball Microsoft Data Warehouse toolkit. I hope you have found it interesting so far. If you have missed any posts, below is the outline of my series:
Part 1 – SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)
Part 2 – Type-1 Changes and Historical Data
Part 3 – A Composite Type Change (Type-1 & Type-2)
Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach
Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach
The Problem
Early arriving facts and implicitly failed dimension lookups are a common situation in Data Warehousing. One of the best examples of this would be the Customer Loyalty Card scheme. I am sure when shopping you have been asked to open up a new customer loyalty scheme, whilst being ensured that the current purchase will be added to your points total.
Well, in this situation the point of sale will be the first time the new customer exists in a Data Warehouse. When we do a lookup against the Customer dimensions we won’t find the key. What do we do?
What Kimball Say
Kimball give a few ideas in handling a situation like this, but their preferred method is using inferred dimension members. Which, in short, is adding a stub dimension member (for example in the problem described above just adding the Customer number – probably the loyalty card number – to the Customer Dimension and wait for the details of that customer to come into the dimension at a later date). You can read a bit more about this here. Additionally there is a more in depth explanation from Tom Stagg here.
This approach works very well for the problem described and is probably the best method for this type of problem. However, what happens if you need some dimensional data to work out fact data (e.g. If you have a situation where a new product can lend itself to early arriving facts – the transactional system is not kept up to date that well – you might want to store profit in the fact using the ETL to avoid calculating it. For this you need cost price of the product which will be dimensional data and not likely to feed from sales data, but wait a minute: This is a new product! We don’t know about it so how can we know the cost price).
The Solution
Well there are a few things you can do.
- Don’t work out the profit of the sale (We don’t want to do this because this would make our Data Warehouse wrong)
- Load it in initially with the profit as 0 and infer the new product. When we get the cost price of the new product we then update the fact. (This would work, but to me, it sounds very complex to add workflow into dimensions to then work out what facts, if any it needs to update and additionally updating fact rows is considered as best avoided)
- Load them into a Fact_Failed_Lookup table. (This is my preferred approach and I will tell you why)
So, lets take a look at what I mean.
Imagine we have a Product dimension with one product in and an Unknown member.
| Products_ID | Product_Key | Product_Name | RRP | Cost_Price | RowIsCurrent | RowStartDate | RowEndDate |
| -1 | 0000 | Unknown | 0.00 | 0.00 | 1 | 01-Jan-1900 00:00:00 | 31-Dec-9999 00:00:00 |
| 1 | 1234 | Lettuce | 0.99 | 0.50 | 1 | 01-Jan-1900 00:00:00 | 31-Dec-9999 00:00:00 |
If we sell product 1234 we can work out the Profit using the cost price and the transaction price (Price - Cost Price).
| Transaction | Date | Product_Key | Qty | Price | Profit |
| 1 | 20081028 | 1234 | 1 | 0.99 | 0.49 |
However if we sell a “Beer” (product code 5555) which isn’t in our Dimension we don’t yet know the cost price and cannot work out the profit.
| Transaction | Date | Product_Key | Qty | Price | Profit |
| 2 | 20081028 | 5555 | 1 | 4.99 | xxx |
So, we load the transactions as follows.
Fact_Sales
| Transaction | Date_ID | Product_ID | Qty | Price | Profit |
| 1 | 20081028 | 1 | 1 | 0.99 | 0.49 |
Fact_Sales_Failed_Lookup
| Transaction | Date_ID | Date | Product_ID | Product_Key | Qty | Price | Profit | Estimate |
| 2 | 20081028 | 20081028 | -1 | 5555 | 1 | 4.99 | 0 | TRUE |
When we load the cube we use a view that unions the Fact_Sales and the Fact_Sales_Failed_Lookup key values (additionally we add a view for RDBMS reporting access that does the same)
This means the overall takings are correct but the profit may not be. I like to add Estimate to the cube so that business users can filter on estimates or not if they wish.
Now, each day the failed lookup fact table is monitored by the Standing Data team where any invalid business / natural keys can be modified (e.g. Typo issues) and investigation into failures. Each day as part of the ETL the Failed Lookups are truncated and dimension lookups are re-tried (or this can be scheduled to do so on a weekend perhaps). If they fail again they go back into the failed lookups, if they succeed they go into the real fact.
This is the workflow:
This approach has added advantages in that:
- If typos occur in fact data they can be updated easily and the ETL just fixes it
- If you want / need fact data more frequent than dimension processing, you can just add all fact data that is newer than your dimensional load date to your failed lookup table and mark them as estimates. The next time the dimensions are processed the estimate rows can simply flow into the real fact.
However be warned using this approach making sure the Failed Lookup facts are actually looked at by the Standing Data Team and they don’t just keep growing and growing.
This concludes my series, which I hope you have found useful and valuable.