Altius Community

Altius Consulting Community
Welcome to Altius Community Sign in | Join | Help
in Search

Jake Smillie

  • UK OUG & Shared Services Security

    This week I attended and presented at my first UK Oracle User Group, Special Interest Group (SIG) for Essbase and Planning. Attending the UK OUG was definitely beneficial and great to find out what other people are doing with the products. I would thoroughly recommend this to anyone who has an interest in Essbase and Planning to attend these events. You can join the UK OUG by signing up http://www.ukoug.org/ and you should.

    As I said I also presented a piece on Hyperion Shared Services 11.1.1.3 with tips and tricks on how to setup security and why you should use it. To see the slides please join UK OUG and go http://www.ukoug.org/lib/show_document.jsp?id=11565

    Enjoy, and I hope to see you at future UK OUG meetings!

  • Modelling & Loading a Slowly Changing Many-to-Many Relationship

    I haven’t blogged for a while, mostly because I have been busy. Today I am on a train to London , so I thought what better thing could I do to fill the time than to blog about something I have been meaning to blog about for a while: Modelling Slowly changing many-to-many relationships in a Data Warehouse.

    This is something that, on paper, seems easy, but also very easy to get wrong / fall into common pitfalls.

    When I was first asked to model such a scenario I went looking in text books and Google (or do I mean Bing now?). Anyway, I was shocked to see a distinct lack of information about this or at least no information that went into enough detail. So here is my in detail explanation on how to do this.

    Consider the scenario where you have products and groups of products. A product can belong to many groups and they can also move in and out of groups, but we want to measure sales at any given time for these groups. Your Products dimension can slowly change, as can your ProductGroups, as can your ProductGroup_Mappings!

    Below are extracts of an example Product, ProductGroup dimensions and a Sales Fact Extract.

    image

    image

    image

    Group 1 has Product A up until 20th-Feb-2009 and has Product B always. So if we wanted to get the Sum of all sales for Group 1 we would get 5(Transactions 1,2,4,5,6). We would not include 3 because this sold product A when it was not part of Group 1 (after 20th-Feb-2009).

    To model this in a Data Warehouse, you may initially do it like the following:

    Linked on Business Keys

    image

    So from the above you can see after 20th-Feb-2009 Product A is no longer part of group 1. This would be fine, and work if you are using SQL to retrieve data because you can add some between clauses using the RowStart and End dates etc when doing a retrieval to find total sales for Group 1, but for me, this is missing the point of a Data Warehouse. Querying a well designed Data Warehouse should be simple, I don’t want to have to worry about complex where clauses to get the data I want. It should all just work! Besides if you had an OLAP engine such as SSAS on top of your Data Warehouse it would not know what to do with RowStart and End Dates.You need to be able to write queries by just using simple joins without where clauses on RowStart and End dates to do things like this. The best way to think about this is VLOOKUPS in Excel. SSAS will in essence do VLOOKUPS from your fact to your dimension to get the correct dimension member.

    So let’s see how this would aggregate all sales for Group 1.

    image

    So what’s happening here?:

    First find all products in Grp1 (I used a VLOOKUP for this: Matches twice and return Products A & B.

    Then:

    (sum all sales where Product = A) + (sum all sales where Product = B): I used SUMPRODUCT for this

    As you can see it has matched on all sales for Product A which is wrong. We want it to exclude transaction 3 as it happened when Product A was not in Group 1.

    This approach on joining on business keys works fine for groups that do not slowly change. E.g. You add a product to a group and all sales of that product are attributed to that group. You then remove it and none are attributed to that group.

    Ah… I need to do the relationship on surrogate keys. After all this is why we have them in slowly changing dimensions!

    Linked on Surrogate Keys

    Product Dimension for reference:

    image

    image

    Now we see and extra row (2) due to the slow change of product A. Now lets see what happens when we link on keys to the sales.

    Sales data for reference:

    image

    image

    So again, we first get the product keys that are attributed to Group 1: (P1,P2 & P3) and we then aggregate all sales where those are the keys for.

    Bingo! We have the right answer. Modelling on surrogate keys is definitely the way forward. But is that it? The more astute of you will realise that this has the exact same problem as the Business Key modelling. If we had a sale of Product A on the 22-Feb-2009, after Product A has been removed from group 1 but before it slowly changed itself (on 01-Mar-2009) the join would include that sale, wouldn’t it?

    Well lets see…

    image

    Transaction 7 is the new transaction.

    image

    Well yes our suspicions are correct it does break!

    What this means is that if any change to a group happens (add a remove a product to it) we need to slowly change the products in the ETL or ELT process before any sales data (or any fact data) comes through. So lets see what the new Product dimension looks like when this happens.

    image

    There is a new product key for product A which starts on the 20th-Feb-2009 (when Product A is removed from Group1) and then again for P5 is the normal slow change on 01-Mar-2009. This makes the sales fact look like this:

    image

    Now it definitely works. Good stuff.

    Conclusion

    - You need to model such scenarios using surrogate keys and not business ones

    - Every time the members of a group change you need to force a change on the affected dimensions. This process makes the loading of the warehouse complex especially when you have multiples of these relationships and they change frequently, but who said loading Data Warehouses was easy?

    There is one other way to do this relationship without having to force the change of the members and that is to include the date key in the many-to-many mapping. Whilst this would work, it could have a detrimental affect on performance as you could quickly have millions of rows in your mapping table / bursting view (if you join on dates dimension).

    Finally I have attached my Excel spreadsheet of my working so you can have a play. Enjoy!

    Jake

  • SmartView – Timeout Increase NetRetryCount

    Recently I have been tearing my hair out with SmartView as whenever I perform a large query against my Essbase cube I receive a timeout that suggests I should up the NetRetryCount.

    “Fine” I thought, so I updated the Essbase.cfg file with the following (just to be sure):

    NETDELAY 8000
    NETRETRYCOUNT 8000

    and restarted Essbase and re-tried my query.

    Same Error! That config change seems to have done nothing.

    A little hunting around later, I discovered there was another setting I need to change (in APS essbase.properties file). Of course! I am connecting through provider services to connect to Essase. It is APS that is timing out. So I update essbase.properties to:

    olap.server.netConnectRetry=8000
    olap.server.netDelay=8000
    olap.server.netSocketTimeOut=800

    restarted Provider Services (and Essbase for good measure) and re-tried my query.

    Grrr! Same error! What is going on?

    Some more hunting around and reading endless posts of making sure you restart the services after changing the config settings I stumbled across this forum http://www.network54.com/Forum/58296/message/1212599342/Hyperion+figured+out+my+issue

    It seems Internet Explorer has a default timeout, and it seems that SmartView uses these settings:

    http://support.microsoft.com/kb/813827

    http://support.microsoft.com/kb/181050

    IE 6 has a default of timeout of 60 seconds and 7 has a default of 30 seconds. I added the following registry settings

    [HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings]

    "ReceiveTimeout"=dword:00075300

    "KeepAliveTimeout"=dword:000493e0

    "ServerInfoTimeout"=dword:000493e0

    Where 000493e0 = 5 minutes and 00075300 = 8 minutes.

    I restarted Excel and I no longer received the timeout Increase NetRetryCount issue. I will now include this in any install I do of SmartView

    I hope this helps someone else. I feel your pain!

  • Importance of Unrealised FX Gain and Loss

    Many businesses want to see measures across multiple currencies, for example: sales for all countries or customer balances for all account currency types. Typically when dealing with multiple currencies a ‘base’ currency to report on is chosen to make life easier (you can’t simply add Euro sale values to Sterling sale values and get the correct total sales answer). This means data needs to be converted to base currency so we can accurately see how we are doing.

    Currency exchange rates are by no means static, so movement in exchange rates can influence your measures.

    For example, if you wanted to see total balances across months you might see something like this:

    image

    Looking at this we can see that May 2009 was a great month!…. Or was it?

    Well lets look at the raw data that makes this chart.

    image

    You can see in May 2009 we actually lost €100,000. The above report paints a very different picture to what actually happened! The reason why the chart shows such a large jump is that Sterling became a lot stronger against the Euro giving the illusion of a good month, when really it wasn’t.

    This means we have to track Unrealised FX Gain and Loss to understand the numbers better.

    Depending on whether you are using ‘End of Day’ or ‘Start of Day’ exchange rates this is done differently.

    End of Day Rate

    To calculate unrealised FX using end of day rates use this formula:

    (Today’s Measure @ Today’s Exchange Rate) - (Today’s Measure @ Previous Exchange Rate)

     

    Start of Day Rate

    To calculate unrealised FX using start of day rates use this formula (rates should be in direct quotation e.g. local ccy value * rate = base ccy value)

    (Today’s Measure - (measure delta)) * (exchange rate delta)

     

    Once unrealised FX is calculated we can show a slightly different report:

    image

    We will use the Balance £ – Unrealised FX Accumulation to display in our report, which gives us:

     

    image

    This shows something quite different from the report above, but it shows a more accurate visualisation of the numbers. It shows that May 2009 was actually a bad month.

     

    Now that you understand the importance of tracking Unrealised FX you need to decide at what granularity to track it. I would recommend tracking it at the lowest level possible so that you can see it at any aggregation level. This makes your design a little more tricky but gets you the most flexibility.

  • Multi Level Hierarchy Dimension Design Tip

    When building a dimensional modelling solution, somewhere along the line you will almost certainly have at least one multi level hierarchy within one or more dimensions. A good example of this is a Product hierarchy, which might go:

    Department->Sub Department->Type->Product

    Typically for each level member you will have an ID (be it numeric or alpha numeric) and a Description. In the Adventure Works database there is a product hierarchy that goes: Department->Sub Department->Model->Product, so this seems like a nice example to share this tip.

    Lets have a look at how you might store this data in your conformed dimension:

    image

    So here, Product 709 rolls into Mountain Bike Socks, Socks and Clothing.

    Building this in a cube is simple, but sometimes not all reports are suited to cube and are better delivered using SQL.

    Lets consider a requirement whereby the user wants a dump of all sales between dates and wants to filter by any level in the Product hierarchy (e.g. any Clothing, or just Socks).

    There are many ways to provide this solution but there is 1 way I like.

    Assuming that you would do this in a stored procedure, you might pass in each level ID for Category, Subcategory, Model and Product as a parameter and then determine what to filter on based on their selection (e.g. If they selection ‘All’ for Subcategory do not filter on this and so on. Doing this is complicated and gives me a headache, you might use 4 different almost identical SQL statements and have IF (SubcategoryIDSelection = ‘All’) then run this query, otherwise run another query. Confused? I am. You may use dynamic SQL to build up the query (which is a nightmare to maintain / understand what is going on).

    The way I like to do achieve this is to store a unique hierarchy key as a string for each product. So for Product 709 we would have: ‘0003_0023_000013_000709’ as the ProductHierarchyKey (we pad the keys to make them sortable). This makes filtering on any level of the hierarchy very easy. For example if the user requests to see all sales for Clothing_Socks we pass into the procedure ‘0003_0023’ as the value we can then do a select and filter like this:

       1: SELECT
       2:     p.*
       3: FROM dbo.D_Product p
       4: WHERE
       5:     p.ProductHierarchyCode like '0003_0023%'

    Which will bring back products 709 and 710. As you can see, it doesn’t matter what level you pass into the procedure, it will always be the same query (no need to build any funky logic or use dynamic SQL). It also makes it easier to filter on multiple selections, e.g. the user might want to see sales for ‘Socks’ and ‘Road Frames’ together. So we would pass in ‘0003_0023,0002_0014’ then split the parameter by ‘,’ into a table variable (I will let you decide how to do this as there are many techniques) and then execute SQL similar to:

       1: SELECT
       2:     p.*
       3: FROM dbo.D_Product p
       4: INNER JOIN @SelectedProducts sel
       5: ON
       6:     p.ProductHierarchyCode like sel.Value + '%'

    This approach, for me, is far simpler and easier to understand.

    Hope it helps.

  • Creating Slowly Changing Hierarchies in SSAS 2005

    In a Data Warehouse, you’ll typically get at least one slowly changing hierarchy. You’ll probably also have other

    slowly changing attributes on the same dimension that are not in the hierarchy, causing many changes.

    Setting up this scenario in SSAS 2005 is easy, but it is also easy to do it wrong! Therefore I thought it would be

    useful to blog on how I do this and what potential pitfalls one can come across when attempting this for themselves.

    Consider a Product dimension that has a Supplier hierarchy. The business wants this hierarchy to change over

    time. Additionally there is another Type-2 attribute: RRP, which also changes over time.

    image

    So, Product ‘1234’ changed from SupplierGroup1->SupplierA to SupplierGroup1->SupplierB on ‘2002-Feb-01’.

    That supplier didn’t work out so it went back to SupplierGroup1->SupplierA on ‘2002-Apr-01’. Also on’2002-Mar-01’ the price changed from £9.99 to £10.99.

    If browsing the cube by Supplier Hierarchy and Date you would want to see all sales for Jan-2002 and Apr-2002

    onwards to by SupplierGroup1->SupplierA and from Feb-2002 to end of Mar-2002 to SupplierGroup1->SupplierB.

    So, lets see how you might go about setting this up in SSAS 2005.

    (For the purpose of this blog I have setup 3 dimensions all using the same Product dimension but setup

    differently and created a Fact Sales that has 1 sale for every day from 2002-Jan-01 to 2002-Apr-30).

     

    Initially you might set this up like a normal type-1 hierarchy that does not track history, and expect SSAS to sort it out when things change. This is what I call the Bad Hierarchy Product dimension.

    Bad Hierarchy Product Dimension

    image

    In this scenario we have created a Supplier hierarchy that goes Supplier Group->Supplier->Product Code. The product code being the natural key of the product dimension e.g. ‘1234’.

    You can see by the attribute relationship that Product Code rolls up into Supplier. Well, lets see what results that gives us.

     

    image

    Hmm, not quite what we were hoping for. The hierarchy doesn’t seem to have changed at all!

    Well this is easy to explain. In the attribute relationship we have told SSAS that Product Code rolls up into Supplier. When SSAS processes dimensions it will retrieve the distinct value of every attribute and then aggregate them based on attribute relationships we define. SSAS will only have one value for product code ‘1234’ and that is, yep you guessed it, ‘1234’. SSAS will then find the first Supplier it finds for Product Code ‘1234’ and say, “right that’s it I’ve got what I want, no need to do anything else” hence the supplier hierarchy never changes.

    So the solution is simple:

    Type-1 attributes are attributes of the natural key (Product Code)

    Type-2 attributes are attributes of the surrogate key (Product Key)

     

    So having just worked out this rule, you may try building the hierarchy by using the surrogate key as the bottom level of the hierarchy. I call this the duplicate product dimension

     

    Duplicate Product Dimension

     

    image

    This time, we have used the Product Key as the lowest level of the hierarchy, and all Type-2 attributes are attributes of the surrogate key. For the Product Key attribute we have changed the ‘Name’ property to be the Product Code, but the ‘Key’ property to be the surrogate key.

    So lets see what results that give us.

     

    image

    Hmm well the good news is that the hierarchy has changed, but we seem to be getting duplicate product codes beneath each one.

    Well again this is easy to explain. We have 4 surrogate keys for product code ‘1234’. We have set the hierarchy to roll up from Product Key to Supplier. SSAS does not know that product key 1,2,3 and 4 have the same Product code and therefore you want to group by them so it just adds them all.

    So, at least our numbers above product code are correct, but this will still not do!

     

    So let see how we do this properly.

     

    Product Dimension

    We create a new attribute in the dimension called ‘SupplierLeaf’ that has composite keys: Supplier Group, Supplier and Product Code and set the ‘Name’ property to ‘Product Code’.

    image

    We then setup the hierarchy as follows.

    image

    So we use the new attribute ‘SupplierLeaf’ as the lowest level of the supplier hierarchy, but rename it to ‘Product Code’ within the hierarchy. We set supplier to be an attribute of Supplier leaf and set supplier leaf as an attribute of the surrogate key. So lets see how this looks.

     

    image

    Bingo! we done it correctly. So, for every slowly changing hierarchy in a dimension you need to add a new ‘Leaf’ attribute. to get it to work correctly.

     

    One you have set this up, it is quite logical and easy to understand how to to it. But as I say, it is easy to get wrong. I hope this helps

  • Oracle Essbase 11 Excel Add-in Installation Gotcha

    Today I installed the Oracle Essbase 11 Excel add-in. All worked well and I could connect to Essbase fine.

    When another user logged on to my machine and attempted to use it, the user could not connect to Essbase using the Excel add-in (nothing would seem to happen) and when trying to initialise the API would receive error #1030000.

    It seems that whilst installing the add-in, the normal environment variables ARBORPATH and ESSBASEPATH were created, but only as user defined variables.

    After installation you will need to check this and change them to system environment variables so that any user can use the add-in. They should both be set to:

    %HYPERION_HOME%\products\Essbase\EssbaseClient

  • Kimball Microsoft Data Warehouse: What they don’t tell you! Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

    Welcome to my 4th and final post of my 4 part series of the Kimball Microsoft Data Warehouse toolkit. I hope you have found it interesting so far. If you have missed any posts, below is the outline of my series:

     

    Part 1 – SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 – A Composite Type Change (Type-1 & Type-2)

    Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

    Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

     

    The Problem

    Early arriving facts and implicitly failed dimension lookups are a common situation in Data Warehousing. One of the best examples of this would be the Customer Loyalty Card scheme. I am sure when shopping you have been asked to open up a new customer loyalty scheme, whilst being ensured that the current purchase will be added to your points total.

    Well, in this situation the point of sale will be the first time the new customer exists in a Data Warehouse. When we do a lookup against the Customer dimensions we won’t find the key. What do we do?

     

    What Kimball Say

    Kimball give a few ideas in handling a situation like this, but their preferred method is using inferred dimension members. Which, in short, is adding a stub dimension member (for example in the problem described above just adding the Customer number – probably the loyalty card number – to the Customer Dimension and wait for the details of that customer to come into the dimension at a later date). You can read a bit more about this here. Additionally there is a more in depth explanation from Tom Stagg here.

    This approach works very well for the problem described and is probably the best method for this type of problem. However, what happens if you need some dimensional data to work out fact data (e.g. If you have a situation where a new product can lend itself to early arriving facts – the transactional system is not kept up to date that well – you might want to store profit in the fact using the ETL to avoid calculating it. For this you need cost price of the product which will be dimensional data and not likely to feed from sales data, but wait a minute: This is a new product! We don’t know about it so how can we know the cost price).

     

    The Solution

    Well there are a few things you can do.

    - Don’t work out the profit of the sale (We don’t want to do this because this would make our Data Warehouse wrong)

    - Load it in initially with the profit as 0 and infer the new product. When we get the cost price of the new product we then update the fact. (This would work, but to me, it sounds very complex to add workflow into dimensions to then work out what facts, if any it needs to update and additionally updating fact rows is considered as best avoided)

    - Load them into a Fact_Failed_Lookup table. (This is my preferred approach and I will tell you why)

     

    So, lets take a look at what I mean.

    Imagine we have a Product dimension with one product in and an Unknown member.

    Products_ID Product_Key Product_Name RRP Cost_Price RowIsCurrent RowStartDate RowEndDate
    -1 0000 Unknown 0.00 0.00 1 01-Jan-1900 00:00:00 31-Dec-9999 00:00:00
    1 1234 Lettuce 0.99 0.50 1 01-Jan-1900 00:00:00 31-Dec-9999 00:00:00

    If we sell product 1234 we can work out the Profit using the cost price and the transaction price (Price - Cost Price).

    Transaction Date Product_Key Qty Price Profit
    1 20081028 1234 1 0.99 0.49

    However if we sell a “Beer” (product code 5555) which isn’t in our Dimension we don’t yet know the cost price and cannot work out the profit.

    Transaction Date Product_Key Qty Price Profit
    2 20081028 5555 1 4.99 xxx

    So, we load the transactions as follows.

    Fact_Sales

    Transaction Date_ID Product_ID Qty Price Profit
    1 20081028 1 1 0.99 0.49

    Fact_Sales_Failed_Lookup

    Transaction Date_ID Date Product_ID Product_Key Qty Price Profit Estimate
    2 20081028 20081028 -1 5555 1 4.99 0 TRUE

    When we load the cube we use a view that unions the Fact_Sales and the Fact_Sales_Failed_Lookup key values (additionally we add a view for RDBMS reporting access that does the same)

    This means the overall takings are correct but the profit may not be. I like to add Estimate to the cube so that business users can filter on estimates or not if they wish.

    Now, each day the failed lookup fact table is monitored by the Standing Data team where any invalid business / natural keys can be modified (e.g. Typo issues) and investigation into failures. Each day as part of the ETL the Failed Lookups are truncated and dimension lookups are re-tried (or this can be scheduled to do so on a weekend perhaps). If they fail again they go back into the failed lookups, if they succeed they go into the real fact.

    This is the workflow:

     

    FailedLookups

     

    This approach has added advantages in that:

    - If typos occur in fact data they can be updated easily and the ETL just fixes it

    - If you want / need fact data more frequent than dimension processing, you can just add all fact data that is newer than your dimensional load date to your failed lookup table and mark them as estimates. The next time the dimensions are processed the estimate rows can simply flow into the real fact.

    However be warned using this approach making sure the Failed Lookup facts are actually looked at by the Standing Data Team and they don’t just keep growing and growing.

     

    This concludes my series, which I hope you have found useful and valuable.

  • Kimball Microsoft Data Warehouse: What they don't tell you! Part 3 – A Composite Type Change (Type-1 & Type-2)

    This is the third part of my 4 part blog series that is aimed to give the world more insight on some of the Kimball Group concepts and, sometimes difference in opinions. Below is the outline of the series.

     

    Part 1  - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 – A Composite Type Change (Type-1 & Type-2)

    Part 4 – Dealing with Failed Lookups & Early Arriving Facts – Alternative Approach

     

    Part 3 – A Composite Type Change (Type-1 & Type-2)

    Today I am going to talk about a Type of change that is a combination of Type-1 and Type-2 changes. This is not the well understood Type-1.5 change but a composite Type change where multiple columns are related to one and other and each having a different change type. Below is the outline of the problem.

     

    The Problem

    In my previous posts I have used a simple Product dimension, and I will continue to use it. It is quite likely a Product dimension will have some sort of Product Category or Brand attributes e.g. Produce, Beers Wines & Spirits, etc. It is also quite likely that there will be a business key for the product categories e.g. 100 = Produce, 200 = Beers Wines & Spirits, etc. Lets take a look at what that might look like:

     

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

     

    In this simple dimension model it would be quite likely to have Department as a Type-2 and Department_Name as a Type-1. The business should be able to change the alias of the department without affecting their figures but if a product changes department then we don’t want historic sales to move with it.

     

    So what would happen if we get a new row like this:

    Product_Key Product_Name Dep_Code Dep_Name RRP
    1234 Bag of Lettuce 2 Grocery & Frozen 1.3

     

    So, here the Product_Name (Type-1) has changed as has the Department_Name (Type-1). If we flow it through the Kimball sample ETL approach

     

    SCD_Simple

     

    In this situation the new row will flow down the Type-1 change route, updating the Product_Name and Department_Name for all historic rows and current row. The table will now look like this:

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Grocery & Frozen 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery & Frozen 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    This is all fine and expected. However have you spotted the issue yet? What happens if Product: 1234 changes Department e.g:

    Product_Key Product_Name Dep_Code Dep_Name RRP
    1234 Bag of Lettuce 3 Produce 1.3

    In this situation the Checksum for Type-1 and Type-2 will mark a change; Department has changed from ‘2’->’3’ causing a Type-2 checksum to change and Department_Name has also changed from ‘Grocery & Frozen’->’Produce’ causing the Type-1 checksum to change.

    So, we update all the historic rows with the new Type-1 data and insert the new Type-2 row (after handling the Type-2 change). Using this logic the Dimension will now look like this.

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Produce 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Produce 1.3 0 24-Oct-2008 00:00:00 27-Oct-2008 00:00:00
    3 1234 Bag of Lettuce 3 Produce 1.3 1 27-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    Uh-oh!! The current row is correct but the previous rows have had their Department_Name’s updated which is wrong!! We want the Dimension to look like this:

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Grocery & Frozen 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery & Frozen 1.3 0 24-Oct-2008 00:00:00 27-Oct-2008 00:00:00
    3 1234 Bag of Lettuce 3 Produce 1.3 1 27-Oct-2008 00:00:00 31-Dec-9999 00:00:00

     

    What Kimball Say

    Well I haven’t read anywhere in Kimball media regarding this specific problem (however I would like to caveat that with the fact I haven’t read all their media), but they do give out a sample SSIS package that would fall over if a dimension was modelled in this way. I do not think by any stretch of the imagination that Kimball have provided this sample as an all conquering, all problem solving solution to all your dimension needs and why should they? However it would be quite easy to fall into the trap of assuming it would work for your dimension. This post’s purpose is to share knowledge and try and prevent people tripping over this problem.

     

    The Solution

    Some of you may be reading this and thinking, “I would just use a Snowflake for the Department”, and yes this would solve the issue for the specific problem I have described (where you only have 1 level in the Department hierarchy), but this is probably quite unlikely in a production Data Warehouse, as you would probably have multiple levels; meaning you would just get the same problem in the Snowflaked dimension (I doubt you would want snowflake on top of snowflake for each level in the hierarchy).

    In actual truth I do not have a specific (generic) solution to this problem yet, except to say that you need to build in some logic into your Type-1 updates to ensure that if the related Type-2 attribute changes, you don’t update all the historic data also. Each problem will have a slightly different approach to best solving this problem. If I do work out a way to capture all issues I will be sure to post the solution.

    Let me know how you have solved this in the past.

  • Kimball Microsoft Data Warehouse: What they don't tell you! Part 2 – Type-1 Changes & Historical Data

    This is the second part of my 4 part blog that is aimed to give the world more insight on some of the Kimball Group concepts and, sometimes difference in opinions. In my first blog I talked about my difference in opinion to Kimball Group on handling Type-2 changes in dimensions and what dates to use to end and start the new row. You can read that here.

     

    Below is an overview of my series:

    Part 1 - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 - A Composite Type Change (Type-1 & Type-2)

    Part 4 - Dealing with Failed Lookups & Early Arriving Facts - Alternative Approach

     

    Part 2 – Type-1 Changes and Historical Data

    In general the concepts of different Type changes within dimension processing is pretty well understood. There is a lot of material out there that describes the different types. ETL-Tools.info has a brief explanation if you need one. There is also a lot of information on how to deal with them within your ETL.

     

    The Problem

    In my experience many of your Dimensions within your Data Warehouse will have both Type-1 and Type-2 attributes. This is fine and also accepted, but when we have a member that has gone through many Type-2 changes over time, what do we do when we get a Type-1 change on that member?

     

    Consider the changes below:

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    On 25th of October we get a Type-1 change to the Product name changing it to ‘Bag of Lettuce’

    Do we change Products_ID 1 and 2? or just 2 (the current row)?

    What Kimball Say

    In fairness, Kimball does suggest the correct way of dealing with this, but I would be more forceful and say it is THE way to deal with it.

    So Kimball suggests that we would normally change 1 & 2 (change for all historical rows also) and I believe this to be correct. So lets see why.

    If we had this data in the product dimension

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    and this data in our sales fact

    Date_ID Products_ID Qty
    20081023 1 1
    20081024 2 1

    How many “Bag of Lettuce” have we sold in total?

    Well lets see… This could be the SQL:

       1: SELECT
       2:     Qty as 'Total Qty'
       3: FROM Fact_Sales s
       4: INNER JOIN Dim_Products p
       5: ON
       6:     s.Products_ID = p.Products_ID
       7: WHERE
       8:     p.Product_Name = 'Bag of Lettuce'

    The answer given here would be 1.

    This is wrong because we have actually sold 2. (I appreciate that for this type of query I would probably use the Product_Key but I am just demonstrating the issue).

    This means our aggregations are invalid and not correct. If we change the historical row(s) (as suggested) as well, our aggregations remain in tact.

    So now our product dimension looks like this

    Products_ID Product_Key Product_Name Dep_Code Dep_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Bag of Lettuce 2 Grocery 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Bag of Lettuce 2 Grocery 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    And if we use the same SQL above then we get the correct answer.

    It does mean on large dimensions that change quite frequently, Type-1 changes can become expensive within your ETL, but expensive is OK if you get the correct answer.

    The Solution

    The solution is easy. Always update all your history for Type-1 changes.

    An alternative to this approach is to create a view on top of your dimensions that joins on itself to retrieve the most recent Type-1 attributes but I see this as a bit of a hack and find it much more sensible to keep the dimension as clean and conformed as possible.

    Let me know your thoughts.

  • Kimball Microsoft Data Warehouse: What they don't tell you! Part 1 - SCD Dates

    I was recently lucky enough to attend the Kimball University "Microsoft Data Warehouse in Depth" course, which I thoroughly recommend to anyone looking at the Microsoft toolkit for their Data Warehouse solution. It is well structured, well thought out and delivered in a clear and authoritative manner. Both Warren and Joy know their stuff (details about Warren and Joy can be found here).

    All this being said I did feel they missed out a few 'gotchas', and some of their methods and advice conflicted with a few of my personal opinions regarding Data Warehousing best practices (at least in the Microsoft Toolkit). This blog series is going to talk about these things and why I feel they are important.

     

    Part 1 - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    Part 2 – Type-1 Changes and Historical Data

    Part 3 - A Composite Type Change (Type-1 & Type-2)

    Part 4 - Dealing with Failed Lookups & Early Arriving Facts - Alternative Approach

     

    Part 1 - SCD Dates - (What dates should I use in my RowStart and RowEnd dates?)

    The Problem

    When I Type-2 change a dimension member what Start and End dates do I use?

    Lets have a look at an example scenario:

     

    Products_ID Product_Key Product_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 0.99 1 01-Jan-1900 00:00:00 31-Dec-9999 00:00:00

     

    Now we load the dimensional data for 24-Oct-2008 which changes the RRP to £1.30 of Lettuce (we are all feeling the credit crunch squeeze!). This would be a Type-2 change. So how do we add a new row and what dates do we use in the RowStart and RowEnd dates? Is it 24-Oct-2008 or 23-Oct-2008?

    This is a problem that I have had arguments about in the past, and I bet if you ask 10 different Data Warehouse experts 5 will give you one answer and 5 will give you another. In the main I do not think it matters too much as long as you are consistent, except when using Microsoft SQL Server as your RDBMS (at least pre 2008) or when you are loading more frequently than once per day.

     

    What Kimball Say

    Kimball tell you to switch the row like below:

    Products_ID Product_Key Product_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 0.99 0 01-Jan-1900 00:00:00 23-Oct-2008 00:00:00
    2 1234 Lettuce 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    So the RowEnd date is the load date (remember we are loading 24-Oct-2008) –1 day.

    If we had a simple Sales fact when we do lookups loading our facts we can either do it like follows:

       1: SELECT 
       2:      p.Products_ID 
       3: FROM conform.Fact_Sales f
       4: LEFT OUTER JOIN conform.Dim_Products p
       5: ON
       6:     f.Product_Key = p.Product_Key
       7: AND f.Date >= p.RowStartDate
       8: AND f.Date <= p.RowEndDate

     

    or

       1: SELECT 
       2:      p.Products_ID 
       3: FROM conform.Fact_Sales f
       4: LEFT OUTER JOIN conform.Dim_Products p
       5: ON
       6:     f.Product_Key = p.Product_Key
       7: AND f.Date BETWEEN p.RowStartDate AND p.RowEndDate

     

    Why is it an Issue?

    It is often common practice within a transactional system to mark dates something occurred as the current system Datetime (especially when using SQL Server i.e. GetDate())

    Pre SQL Server 2008 there is no Date type so all dates are time specific also. So if we consider a fact table feed like this:

    Transaction Date Product_Key Qty
    1 23-Oct-2008 00:00:00 1234 1
    2 23-Oct-2008 09:00:00 1234 1
    3 24-Oct-2008 00:00:00 1234 1

    Using the the Kimball approach the data we receive in the join will be:

    Transaction Products_ID
    1 1
    2 NULL
    3 2

    Uh oh!! We are missing a product ID! This is because Transaction 2 happened at 9am (not exactly midnight) and due to the RowStart and RowEnd dates being accurate to the second (smalldatetime) or even 1000th of a second (datetime) we miss the join. To get round this you could strip the Fact date to just date (force it to be 00:00:00) but this can be expensive, especially when doing this for millions of fact rows and when there is an alternative way.

     

    The Solution

    Instead of using the load day –1 we use the load day as the RowEndDate so our Lettuce now looks like this:

    Products_ID Product_Key Product_Name RRP RowIsCurrent RowStartDate RowEndDate
    1 1234 Lettuce 0.99 0 01-Jan-1900 00:00:00 24-Oct-2008 00:00:00
    2 1234 Lettuce 1.3 1 24-Oct-2008 00:00:00 31-Dec-9999 00:00:00

    We now change our join very slightly to the below:

       1: SELECT 
       2:      p.Products_ID 
       3: FROM conform.Fact_Sales f
       4: LEFT OUTER JOIN conform.Dim_Products p
       5: ON
       6:     f.Product_Key = p.Product_Key
       7: AND f.Date >= p.RowStartDate
       8: AND f.Date <  p.RowEndDate

    So we only join on rows less than but not equal to the RowEndDate. So now we get these values:

    Transaction Products_ID
    1 1
    2 1
    3 2

    Yay!! we are not missing any Product_IDs now, and all is happy.

    This approach has saved expensive date conversions when they are not necessary. As long as the RowEndDate of the previous row is the same as the RowStartDate of the next row then all is happy. This is my standard approach and it is also an approach that will work if you need to move to real time processing (changes occur multiple times a day).

    I hope you can see how this approach ticks more of the boxes in issue resolution. Let me know about your thoughts and how you have tackled this problem in the past.

  • Essbase API Load Data - Error 1042015 - Network error: Cannot Locate Connect Information For [%s]

    Today I was trying to load data into Essbase v7.1.5 using the VB API command EsbImport.

    The Essbase API documentation is great and gives you very useful code examples. The example they give for EsbImport is defined below:

    Declare Function EsbImport Lib "ESBAPIN" (ByVal hCtx As Long, Rules As ESB_OBJDEF_T, 
                                              Data As ESB_OBJDEF_T, User As ESB_MBRUSER_T, 
                                              ByVal ErrName As String, ByVal AbortOnError As Integer) 
                                              As Long
    Sub ESB_Import ()
       Dim sts          As Long
       Dim Rules        As ESB_OBJDEF_T
       Dim Data         As ESB_OBJDEF_T
       Dim User         As ESB_MBRUSER_T
       Dim ErrorName    As String
       Dim AbortOnError As Integer    
       '*********************************
       ' Rules file resides at the server
       '********************************* 
       Rules.hCtx     = hCtx
       Rules.Type     = ESB_OBJTYPE_RULES    
       Rules.AppName  = "Demo"
       Rules.DbName   = "Basic"   
       Rules.FileName = "Test"           
       
       '********************************
       ' Data file resides at the server
       '******************************** 
       Data.hCtx      = hCtx
       Data.Type      = ESB_OBJTYPE_TEXT      
       Data.AppName   = "Demo"
       Data.DbName    = "Basic"
       Data.FileName  = "Data"               
       '********************************
       ' Specify file to redirect errors 
       ' to if any 
       '******************************** 
       ErrorName      = "IMPORT.ERR"   
       '*************************
       ' Abort on the first error 
       '*************************
       AbortOnError   = ESB_YES   
       '*******
       ' Import
       '*******
       sts            = EsbImport (hCtx, Rules, Data, User, ErrorName, AbortOnError)    
     End Sub

     

    This code however is incomplete and will generate an Essbase error 1042015 - Network error: Cannot Locate Connect Information For [%s].

    The problem is that the database you have selected to import data into is not active. Therefore before the EsbImport command is called, you need to call EsbSetActive, like below

    Dim pAccess AS Integer
     
    sts = EsbSetActive (hCtx, "Demo", "Basic", pAccess)

    This will fix your error 1042015 woes.

  • Data Warehousing - ETL Dealing With Large Dimensions in SSIS

    Introduction

    This is the first part of a 2 part blog that outlines techniques in dealing with very large dimensions (or small ones for that matter) when going through your ETL (Extract, Transform & Load) process in your data warehouse using Microsoft SSIS (SQL Server Integration Services) as your ETL tool.

     

    The Problem

    - You have designed your data warehouse and you are left with a very large and complex dimension. (When I say large and complex I mean a million plus rows and 10 plus data sources, but the process works on any size of dimension)

    - You still need to meet your ETL time requirements and it would not be plausible to process the full dimension for each ETL run.

     

    For the purpose of this blog I have simplified the problem to a small simple dimension.

    Here is a Entity Relationship diagram of the source for our dimension.

     

    DataWarehouse_Diagram_1

     

    This is a very simplified Product dimension and the end result we want is:

    DataWarehouse_Diagram_2

     

    Solution

    To start I would like to caveat this with the fact that this is by no means the only and definitive way but a way that has worked for me and works well.

    All ETL processed go through 4 main steps:

    Extract   - Extract the data from source systems

    Clean      - Clean the data for your data warehouse.

    Conform - Change the data into your data warehouse format

    Load       - Load it into your data warehouse

     

    Extract

    Firstly we need to Extract the 3 source tables (Product, Product_Pricing, Product_Hierarchy). We only want to extract changes to these tables since the last load as we do not want to process the whole dimension each time. There are many ways that this can be done and is not part of the scope of this blog.

     

    Clean

    In clean we keep a current snapshot of the source table and mark any rows that change to 'changed'. For example if a department (1, Produce) in Product_Hierarchy changes to (1, Food & Produce) we change that row and mark it as changed.

     

    Here is the sample data I am using in the clean area.

    Product_Hierarchy

    Department Department_Name
    1 Produce
    2 Grocery
    3 BWS
    4 Technology

    Products

    Product_Key Product_Name Department
    1234 Lettuce 2
    1235 Potato 2
    1236 Tomato 2
    2341 Crisps 1
    2342 Biscuits 1
    3241 DVD 4
    3242 CD 4
    4321 Fosters 3
    4322 Stella 3
    4323 Pinot Grigio 3

    Product_Pricing

    Product_Key RRP
    1234 0.99
    1235 0.89
    1236 2.99
    2341 1.89
    2342 1.47
    3241 3.57
    3242 2.54
    4321 1.50
    4322 4.20
    4323 2.30

     

    So this is the data we have in our clean tables. We now need to merge these tables into one that would be used as out source dimension to conform.

    This is the SQL used:

     

    SELECT 
           p.[Product_Key]        as [Source_Product_Key]
          ,p.[Product_Name]        as [Source_Product_Name]
          ,p.[Department]        as [Source_Department]
          ,h.[Department_Name]    as [Source_Department_Name]
          ,price.[RRP]            as [Source_RRP]
    FROM [clean].[Products] p -- this is the driving table 
    LEFT OUTER JOIN [clean].[Product_Pricing] price
    ON
        p.[Product_Key] = price.[Product_Key]
    LEFT OUTER JOIN [Clean].[Product_Hierarchy] h
    ON
        p.[Department] = h.[Department]
    WHERE
        p.[Changed] = 1 -- driving table has changed
    OR    price.[Changed] = 1 -- the price  has changed 
    OR    h.[Changed] = 1 -- the department has changed 

     

    This merges the the tables and makes sure we get all changes (by using the OR Changed flags). One of the problems you get with multiple sources is that if the Department_Name in Product_Hierarchy changes it could change more than one row in the Dimension (so you can't just use your driving product table to track changes).

    For example, if (1, Produce) changes to (1, Food & Produce) as previously mentioned we need to make sure Products 2341 and 2342 are included to conform. Holding the full snapshot of your relational data in your staging area allows you to get ALL changes and only load changes (not the whole dimension). This approach can save many minutes on large dimensions.

    The above solution does require the storage space in your staging database but this is a small price to pay for time saved.

     

     

    Part 2 of this blog will explain how to flow this through a conform stage in SSIS. You can download the sample database I have used to have a play for yourself below

  • Quick Tip - Using Excel or TextPad to manipulate strings

    I am sure many times you have been asked to compare values in a database table against an Excel Spreadsheet or text file. You may be given a list of keys and a list of values. A quick way to write a SQL statement to retrieve the values is to use the SQL 'in' clause, but this poses the problem of adding commas or even commas and single quotes around the key value e.g.

    SELECT
        myKey,
        myValue
    FROM myTable
    WHERE
        myKey in 
    (
    'Key_1',
    'Key_2',
    'Key_3'
    )

    I have seen many times people do this by manually adding the single quotes and commas, this is fine if you have handful of values to compare but if you have 20+ it can become pretty tedious!

    There is a simpler way, in fact 2 that I regularly use.

     

    Excel

    You can use Excel formula to add the quotes and commas for you. Say you have you key values in column A you could write this formula:

    ="'" & A1 & "',"

    This will convert:

    Key_1 into 'Key_1',

    Once you have typed one formula you can use the Excel cell drag functionality to apply that formula to all your rows and then just copy the new column to your SQL query.

     

    TextPad

    TextPad is a pretty rich functionality text editor for Windows, and is definitely worth a look if you haven't used it before. You can download a trial copy from http://www.textpad.com/

    One useful functionality in TextPad is the ability to record macros. When you do this TextPad will record your key strokes and apply them to the whole text file when run.

    So using the same example above where you have a list:

    Key_1

    Key_2

    Key_3

     

    You can start recording a new macro and hit these keys: [Home], ['], [End], ['], [,], [Enter] then stop recording. Now just run your new macro applying it to the whole document and before you know it all your rows are in your desired format.

     

    I hope this tip might save you some effort in the future!

  • Regional Reports With a Difference

    Many Business Intelligence solutions involve some sort of regional report which in the past has been shown in text in perhaps a spreadsheet type view. In this day and age businesses are increasingly wanting more and more aesthetically pleasing, rich reports. Step in the Microsoft Virtual Earth Platform.

    Virtual Earth Overview

    Virtual Earth : "The Microsoft Virtual Earth platform is an integrated set of services that provides quality geospatial data, rich imagery, cutting-edge technology, and dependable performance that helps organizations visualize data and provide immersive end-user experiences. Backed by a commitment to ongoing investments in innovation, driven by customer feedback, the Virtual Earth platform continues to offer new map detail and precision imagery, feature enhancements, and robust platform capabilities." Microsoft Definition.

    Microsoft Live Maps is powered by Virtual Earth and you can see this in action here.

    Virtual Earth platform includes an SDK which can be invoked using SOAP based HTML or JavaScript APIs.

    Virtual Earth SDK

    The Virtual Earth SDK is very easy to use and invoke. Details of the SDK and how to invoke methods can be found in the interactive SDK.

    Example Reports

    There are many types of reports that Virtual Earth can help visualise but one that comes to mind is: KPI reports e.g. Sales Performance.

    Virtual Earth can give you instant visualisation on how regions are performing using a few neat visualisation tricks.

    A nice way I find to visualise performance in regions is "Heat Mapping". You can add colour to your map in areas where performance is good and bad, giving users instant visualisation of performance.

    A screen shot is of example locations and heat mapping is below.

    Simple Locations

    image

    You can add your own icons for locations by using the following code:

    function AddPushpinToLayer(layer, name, longitude, latitude, icon, html)
    {
        shape = new VEShape(VEShapeType.Pushpin, new VELatLong(latitude, longitude));
        shape.SetTitle(name);
        shape.SetDescription(html);
        shape.SetCustomIcon(icon);
     
        
        layer.AddShape(shape);
            
    }     
    I prefer to use layers where possible as once added you can simply show and hide them.
    Performance Heat

    image

    As you can see from the heat map above the South area is not performing too well however London is performing very well. The circles used can be defined as:

    Large Green = Best Performance

    Large Red    = Worst Performance

    You could show this in colour intensity using the transparent colours if you wish.

    The ability to draw circles code used above was adapted from this blog entry http://virtualearth.spaces.live.com/blog/cns!2BBC66E99FDCDB98!7927.entry 

    Once the data is added the full Virtual Earth functionality is available to the end user so they can switch to 3D View, Road View, Birds Eye view, etc. You can also add your own HTML to the hover over tags so you could add graphs of performance if you wish for example. You can also create events on mouse clicks for example that might export the data to a spreadsheet.

    image

    The above shows how I've used Virtual Earth to simply zoom in on a location and show the Custom HTML tag.

    Conclusion

    You can see how Virtual Earth can be used to show data for a number of scenarios to help visualise data and having done a simple report like the above I can tell you the SDK is very easy to use and there are also a lot of examples out there to work from.

    Some people might see this as a bit of a gimmick but I think it is important to make your reports look good and by spending a bit more time you can produce great looking reports.

    References

    http://www.microsoft.com/virtualearth/platform/default.aspx - Microsoft Definition of Virtual Earth Platform.

    http://virtualearth.spaces.live.com/blog/cns!2BBC66E99FDCDB98!7927.entry  - For code to draw Circles.

More Posts Next page »