Sunday, February 26, 2017

DW Patterns: The good - Accumulating Snapshot facts

In the last post I talked about Periodic Snapshot facts, and where they are good, and why they're not actually that great for what people say they're good for, but are good for some other things.

This post will focus on the other kind of standard snapshot fact. This is, of course, the...


Accumulating Snapshot Fact

While each row in a periodic snapshot represents an observation of the current state of an ongoing process (like a reading of a thermometer, or a count of the number of items in inventory), a row in the Accumulating Snapshot table represents an entire process, from beginning to end. The idea is that you have one date/time column for each step in the process, that records when the record hit that stage. It would be possible to have separate rows and treat it as a transaction fact, but frequently there will be measures that only relate to one step in the process - and then you'd have to have columns that only contain values for a small proportion of rows, and figure out what to do in other cases, and so on.

One important thing to note here is that the process must have absolutely well-defined stages in order for this pattern to work - if it's possible for the process to move backwards and forwards, skip steps or go through the stages in a non-linear order, this pattern is useless. If the process has an undefined number of stages, this pattern is useless.

In cases where you do have a strictly defined process, however, this isn't just a good solution, it's ideal - you minimize the amount of data stored, you don't waste space on pointless columns for most records (some records may have empty columns, but only when the process hasn't yet reached that stage), and you can easily see where each item going through the process is at. It's great - it just seems to me to be quite oddly-specific to earn a place among the Three Patterns of Facts.

Does it fix the problems that were left by the previous two types though? It seems to have invented its own niche problem and solved that rather than helping with these basic questions of history tracking, but let's check. The problems I was evaluating against previously when fact data changed were:
  1. Data entry error (the value was always supposed to be X, but it was originally measured as Y)
  2. The grain of the measure doesn't match our fact table because the quantity we're measuring is not temporally related to when the event occurred (e.g. attention score)
  3. The measure is a continuous variable (e.g. influence of individuals, or temperature)
As with the previous two types, 1 is not covered - we only have the choice to overwrite one of the existing measurements taken - no way to track what the incorrect value was. Not too promising - what about number 2? In order to achieve that kind of tracking, we'd have to have infinite columns. That doesn't seem feasible... Number 3 was already falls under Periodic Snapshot's area of expertise, so we can skip that (this pattern can't really do much there anyway).

So now we have 3 problems remaining - problem 1 and 2 from above, and the new problem of what can be done for modelling processes for which we'd like to use the Accumulating Snapshot, but can't because they have a flexible process (one example that I'm particularly familiar with, is student enrollment - this generally involves a lot of back-and-forth between the institution, the student and various other parties, and the process doesn't follow a set path - or if it does, it's such a convoluted one that the simplistic modelling of the Accumulating Snapshot can't handle it).

So what can be done? Well, Kimball does have a couple of pointers that may help and we even alluded to some in the previous post, but that will be the topic for next time.

Tuesday, January 31, 2017

DW Patterns: The good - Periodic Snapshot facts

Last time I left off talking about the Transactional fact type, and what it does for us (matches how we want data warehouses to behave for many scenarios) and what it doesn't do (provide any history tracking). This time I'll get into...
This one is the "other" fact type - the main fallback for any DW developer for whom Transactional facts don't work. The principle behind this pattern is that rather than an "event-based" measurement (which is what the Transactional fact would be) we have a "polling" measurement. This is where we take stock at the end of the day and say "ah, we had X amount of Y stock in the warehouse today." If you take that each day, you get to see trends.

So, what you wind up with is a log, that looks something like this:

Transactional_Fact
SK_ProductSK_DateSK_WarehouseAmount_Of_Stock
6320160528758
642016052812103
6320160529754
64201605291291

Now, there are a couple of important things to note here. First, the measure (in fact almost any measure associated with this table) is non-additive. This is a pain, but not the end of the world. Worse than that, though, is that we're no longer tracking each individual item - our data is not at the finest possible grain that it can be. This may not be a problem and don't get me wrong - when this is used it's frequently as good as it gets (an important term to get used to...), and it may not be fixable (if we don't track data about each item, this as good as it gets), but consider the case where one of the products is cabbages.

Now, to start with the manager just said "please show me my stock of cabbages each day". No problem! Daily snapshot, that's what you use for tracking stock, so we'll go ahead and slap it in.

Next day, the manager says "Oh, that's great. Now how to I get this to show me the longest amount of time that a cabbage is sitting in the warehouse? We've had some complaints of spoiled cabbages and I want to make sure the problem is downstream of us." Huh, well, OK no you can't do that, I'll have to start from scratch. If you read my previous post, you'll have already noted that I would suggest treating each incoming and outgoing item as a transaction - then you can track the duration it's spent in the warehouse. Again, may not be possible, which is one reason why this pattern exists - after all, design patterns are here to help in tricky situations.

The other situation where transaction facts shine is when the amount of data is so phenomenal that you can't use a transaction fact. If possible of course you should use the Transactional pattern and do things like partitioning, but frequently life isn't all friendly like that. As far as pre-aggregated data goes, this pattern is a good contestant, but as always when violating principles, make sure you need to, and make sure that this is the way you need to do it. There may be other patterns more appropriate.

Finally, more importantly that the above scenarios (depressing since those are the ones most frequently espoused as when this pattern is appropriate) there actually are facts that by their very nature don't support transactional data. Of course, they don't perfectly fit this pattern as it is typically outlined either, so that's a shame. Let me give you an example.

Consider whatever national park service you have in your country. In New Zealand, that's the Department of Conservation. They might be interested in temperature readings, in order to care for an endangered species. This species is very sensitive to high temperatures, but can survive cold ones OK. So, they set up a temperature monitor in the creature's nest that measure every hour, except during the summer months, when it measures every minute (they want to conserve power during the winter).

Now, let's take a look at how we could model that. Periodic snapshot looks bad as written, since it's a specific period, but we have variable periods. We could decide our period is the fastest period (per minute) and simply copy out the previous reading during the winter months, but that's going to store 60 times as much data as necessary during winter, and the data will be misleading anyway (those "fake" readings are actually fake readings - the temperature will certainly not stay consistent throughout each hour). In addition, that's asking for trouble when the rangers get new devices that measure every 30 seconds. Instead, we can alter it to be an arbitrary period - a pretty minor change to the pattern really. All of our aggregate functions still work (we won't be using sums, but averages, max, min and many of the more arcane methods you might think of will still work as normal).

Alternatively, we could treat each reading as a transaction - the difference between the reading and the previous reading could be treated as a delta - if the temperature went up by 1.3 degrees, we'd store 1.3; if the temperature went down by 0.8, we'd store -0.8. This way the temperature would be summable - but have we gained anything? Our rationale behind wanting to swap snapshots for transactions was to not reduce our grain - but in this case the snapshot is at the finest grain we can get. The only way we can get a finer grain is to increase precision of the measuring instruments, in which case our snapshot is at the finer grain anyway. Furthermore, if you take the snapshot then if someone says "what was the temperature at 5:30 on a midwinter morning" you can have an aggregate function that performs interpolation between the 5:00 reading and the 6:00 reading - something that wouldn't be (easily) possible with the transactional version.

It's probably worth pointing out here that we now have two main uses (with different implementations) for Periodic Snapshot. One that's a performance optimization, or a a necessary evil, when we would otherwise rather use a Transaction fact. The other is when we have a continuous variable over time, and the snapshot is (probably) the most appropriate option. This might actually be worth splitting out and giving two names, but I'll get to that later...

So, our snapshot fact has some legitimate uses, some worrying uses, and some outright harmful uses. That's cool. There's one more thing to check though - history tracking. Now in my previous post I mentioned three cases where we might see changes to a fact:

  1. Data entry error (the value was always supposed to be X, but it was originally measured as Y)
  2. The grain of the measure doesn't match our fact table because the quantity we're measuring is not temporally related to when the event occurred (e.g. attention score)
  3. The measure is a continuous variable (e.g. influence of individuals, or temperature)
You'll see that point 3 above naturally falls under the Daily Snapshot's area of effectiveness. What about 1 and 2? Well, let's assume that at 4:00 on the 5th May, our temperature reader read a value of 80°C. That seems unlikely, so we go back and update the source database to an average of the surrounding reads. The snapshot table has no capability for handling this kind of change other than to overwrite the relevant record. That's fine in some circumstances, but fails in others (I'll get more into this in the next post) but regardless, it doesn't track that change so it fails our test.

For 2, it seems on the surface like the snapshot fact is great. We're good at tracking continuous variables right? Well, it's not really continuous. You see, the attention score is altered every time a person tweets about a publication, or it's mentioned in a policy document or whatever. These are discreet events. However, each paper is also published on a certain date, and many measures are interested in specifically this date. This is getting a little confusing, so let's jump to some example questions:

"How many papers were published by my department in January?" - OK, so we need a count of publications, department and the publication date. Not a problem, we can just use a transaction fact, with each record representing a publication.

"What is the average attention score of papers published from my department?" - Also easy; again, this can just be the transaction fact. We'll have the up-to-date attention score as a fact, and just overwrite it (incidentally, overwriting fact as part of day-to-day business is a BIG RED FLAG that you've got your fact very wrong, because...)

"Ok, some of the papers are 5 years old, and some are only a week old, so we need a better average. Let's try... 'What is the average attention score for each paper one year after it was published?'" - Uh... OK, we can just stop storing the attention score after one year...

"Actually, I'm not sure if one year is the right threshold - can we just have that be a parameter?" ... Ok, we'll change this to a snapshot fact table. Oh, except now we can't answer the "How many papers were published by my department in January" question, since we don't have the transaction date, since our date dimension is our snapshot date...

You might notice that the obvious solution is to include another date dimension for snapshot date as well, which is very valid, but in fact this situation is only a specific example of a more generic problem. We have one fact pattern left though, let's wait to talk about it until we take a look at accumulating snapshot (sounds kinda promising right?)

Saturday, January 7, 2017

DW Patterns: The good - Transactional facts

So we left off last time recapping the different kinds of SCD patterns, and I mentioned some of the ways that SCDs don't provide adequate history capture. Some of these are, for example, when we have data related to an event which is changed, but doesn't live in an "entity" dimension, but rather a junk dimension, or when the entity related to an event is changed - something that Kimball would probably call a bad design decision; I'm not so sure.

There's another type of change which can occur - when the amount measured changes - these can be measurement errors that we correct, or they can represent a grain that perhaps isn't factored into the fact table in the traditional way - for example if a research paper is released, the public sentiment can change over time (Altmetric, for example, measure something similar), or tracking the influence of people over time a la Forbe's list of influential people (assuming they base that list on data and not thumb-sucking - I don't know either way). There are also a couple of other situations I'll get into through the remainder of this series.

Anyway, due to these issues with dimensions, next we need to take a look at the ways that the fact table can contribute - can we make up for these shortcomings using standard design patterns? Wikipedia lists four design patterns, while Kimball lists three. Fortunately this time they mostly agree on what they are - Wikipedia's fourth pattern is a variant of one of the other three. The place to start however is always with...

Transactional Fact

These are the most basic, intuitive facts, and with a bit of thought you can normally model (and should probably at least go through the exercise of doing so) the other fact types as a transactional fact - though sometimes you don't have access to that kind of information. Anyway, the basic premise of this fact is that we measure each event that happens - the classic example being sales transactions, where we basically know everything about it when it happens, we can store that, measure it, aggregate it and display it in really nice over-time examples.

In fact, as I mentioned earlier, just about everything can be measured as a transactional fact in some way or another - one example that often arises of when you might not want to use the transaction fact is for measuring stock levels. The amount of stock in a warehouse each day is usually the recommended scenario for Daily Snapshot facts, but if you were to have a positive-valued event for every item that entered the warehouse, and a negative-valued event for every item that left the warehouse, you can happily sum it up over time and get correct values.

Looking back at our list of history tracking requirements, though, we can see pretty clearly that we don't get anything from this fact type. That's not to say you can't combine it with other kinds of patterns to provide history tracking, but this pattern itself doesn't provide anything. Updates to a fact either won't get represented, or they'll overwrite the existing records - either way we don't get history.

Next time we'll take a look at the periodic snapshot fact, and see if that can get us what we want...

Saturday, December 24, 2016

DW Patterns: The good - a recap

Ok, the next obvious step for looking at DW design patterns is fact design patterns - which I will start on next time. First, however, I want to take a quick look back and summarize the dimension patterns.


So the main takeaway from this is that Type 2 is the master pattern, at least for the under-the-hood part. Thought should be put in as to whether your users actually want type 2, but even if they don't, it's still the way to go. Unless you're confident that you're users will never want it, or that your disk space will suffer too much from it, it should be your default implementation.

Let me just say that again: you should default to Type 2 and only consider not retaining the history in exceptional circumstances. This is because if you have the history in type 2, it's trivially easy to implement types 0, 1, 2, 3 and 6 as a view - plus it renders type 4 pointless. That leaves us with types 4K and 5K, which are both optimizations, and thus should only be considered in exceptional circumstances anyway.

So: if you're making an ETL automation tool, or anything like that, you should start with Type 2 and add others only as they are needed.


Stepping back from the individual patterns for a moment though, let's talk about what kinds of problems they solve and what assumptions they make about our data. First, the most important aspect to understand about SCDs - they do not store all history. The only thing any of the above SCDs track is changes to the entities represented by dimensions. This is great when your dimensions nicely fit into business concepts, but breaks down the moment you have something like, say, a junk dimension. Or a dimension with an awkwardly split scope (e.g. a student dimension that's split into two records - one for the student's international study, one for their domestic study). Or any other such information.

The weakness of SCDs for tracking scope doesn't stop there though - what if a fact record (an event) is updated so that it refers to a completely different dimension member (e.g. a student's enrolment is updated to change the paper they're enrolled in). Since no entity is changed, SCDs are hopeless here -we have to hope that our facts deal with it instead.

Sadly, the standard patterns don't do this, as we'll see...

Saturday, December 3, 2016

DW Patterns: The good - type 6

Alright, carrying on from last week where we looked at Type 5, this time we'll be spending a fair amount of word-count on...

Type 6: Hybrid

Yay, mommy and daddy are speaking again! For Type 6, Kimball and Wikipedia are once again in agreement in terms of definitions. Type six is apparently named because it's like combining types 1, 2 and 3, which because it's such a perfect number both multiply and add to 6. The similarities are this:
  • It's like Type 1 because we have a type 1 column for the attribute
  • It's like Type 2 because we also have a type 2 column for the attribute
  • It's like Type 3 because there's more than one column. Yup.
I guess that sarcasm about type 3 isn't really warranted - it is kinda like taking Type 3 to the extreme - instead of "current and previous" or "current, previous and time-before-that", it's "current and at-the-time". I won't let that stop me making snarky remarks though, because that's just who I am.

Anyway, I've said before that if you're going to pick a default, you should pick type 6 rather than type 1 or 2. It also makes Type 5's "outrigger" redundant, since we can just lump the Type 1 attributes in the same spot as the Type 2s and be done with it. Not only that, but the same concept allows us to produce Type 3 attributes, if that's what you're going for.

Actually we get all of this from keeping Type 2 history, so that's the important bit. The reason that I like the Type 6 as a default is that if you're not going to make a decision about whether an attribute should be type 1 or 2, you should allow the user to do so. Otherwise, someone (or a lot of people) are going to think they're looking at one thing when they're actually looking at another. If they get confused and ask you what the current versus historical attributes mean, well great - you can explain and make sure they get it right.

Let's back up a minute though, because this blog hasn't had any SQL in it since my discussion of gluten-free dimension baking and I think it's time to rectify that, because I'm kind of horrified at the thought that someone might actually take Kimball or the wiki article's design seriously. To be clear, your table should look like this:

Dim_Student
SK_TableTableIDType2Attribute
11234Original Value
21234New Value
31234Newer Value
41234Newest Value

That's right - no Type 1 columns, no "last-value" columns, nothing, just Type 2. What you then do is calculate whatever else you need in the view. In terms of performance, if it start's getting slow, you can always throw an index or two on - it should be very easy to get the necessary indexes to make this trivial for almost any number of records. Let's have some examples though, using modern tsql windowing functions:

select
       SK_Table,
       TableID,
       Type2Attribute,
       FIRST_VALUE(Type2Attribute) over (partition by TableID order by SK_Table desc) as Type2AttributeAsType1,
       Lag(Type2Attribute) over (partition by TableID order by SK_Table asc) as Type2AttributeAsType3
from type6.Dim_Something

Chuck this in a view, and bam, all the most useful actually functionally different dimension types (you can even achieve Type 0 by taking the Type 1 and ordering asc instead of desc). You'd probably want to order by something other than the surrogate key in practice (RowEffectiveDate or something like that) but the principle stands.

So that means that we can achieve Types 0, 1, 2,  and 3. It's also like a better version of Type 4, and Type 4k is just a performance optimization for Type 2 dimensions, so who cares about that. We even make Type 5 redundant, since we can just use the mini-dimension and skip the outrigger. So really, what it comes down to is that Type 2 is god - as long as you know how to use it. And now you do!

So in terms of what it is and how to do it, that's all very well and good - but how about what it's useful for? Well... lots of things. The most common use case has to be "Type 2 is what we need, but Type 1 is so darn useful for filtering". What do I mean by that? Well, let's take one of the textbook cases of a Type 2 dimension attribute Employee Team (or department or whatever). You want to be able to have the team as it was when things happened, because that's important context. However, isn't it really handy to just be able to say "How do the people in Bob's team compare to the people in Terry's team?"

Moral questions of comparing people like this aside (it's just the first example that popped into my head - there are plenty of non-worrying questions like it), the Type 2 is necessary for being able to do analysis of the relevant attribute, but Type 1 is really nice entry point into the dimensional model. It's great to have as a filter in situations like these, where you want to evaluate the whole history of a group of individuals even before they were in the group you're using to identify them.

To take the flipped case, you actually find there could be a use case for Type 0 attributes when along-side a Type 2. Ever had someone ask something like "we have a lot of students dropping out of our BAs - where are they going?" You then proceeded to beat your head against the wall, and cry, because arbitrary digraphs aren't exactly one of data warehousing's strong points. With a Type 0/Type 2 combination, however, you can get at least an approximation of this, by filtering Programme Type 0 to "BA" and then grouping by the Type 2 dimension - bam! You've got an idea of how many students are transferring to which programme, and it didn't even take five minutes. Incidentally, I will get to discussing how you can deal with arbitrary digraphs in DWs and, specifically, SSAS/Power BI, since those are my technologies of choice.

DW Patterns: The good - type 5

Ok, so my dramatic return from hiatus didn't work out so well - turns out life is very life-like in that respect. This time I'm writing up a backlog so that when life happens I can continue updating, so there shouldn't be another hiatus until I actually run out of things to say.

Anyway, the last episode was about SCD 4 attributes, so today it's time for...


Type 5k: Add Mini-Dimension and a Type 1 Outrigger

Last time you may remember that Kimball and Wikipedia disagreed as to what SCD 4 meant (not just in details, but completely different concepts). Well, Type 5 doesn't even exist on the Wikipedia definition, so this time we just have to deal with the Kimball definition, so that's good.

What's not so good is that Kimball's Type 5 is even worse than the type 4 - it's an optimization on an optimization and the way they do it offends one of the most basic DW design sensibilities beaten in to me by Kimball themselves - don't snowflake. We're getting ahead of ourselves though - what is Type 5? Well, in Type 4 we pull out rapidly-changing Type 2 attributes into a separate table called a mini-dimension in order to spare us some space in the main dimension. Type 5 adds an "outrigger" to this, which is basically a snowflake from the original dimension containing Type 1 versions of the attributes in the mini-dimension. Why Kimball suggest you snowflake this I can't conceive - if it's that you want it attached to the original dimension, why not just add the Type 1 attributes there? If you want it as a separate dimension, just make it separate!

Anyway, as we'll see when we get to Type 6, there are better ways of doing this any way, even if you really must have the Type 4 mini-dimension, so I'd rate this SCD type as basically a non-pattern - why waste our consciousness remembering this when we could just attach a comment to the design and learn it on-the-spot if it's ever used?

I know this is supposed to be "The Good" of the SCDs, but I find myself hard-pressed to write anything good about this design.

The series continues with a discussion of the significantly more useful (despite not introducing any new concepts), Type 6

Saturday, September 17, 2016

DW Patterns: The good - type 4

I have to apologize - it's been three weeks since my last update. This is because two weeks ago my fiancĂ©e and I bought a house. I know that's a pretty pathetic excuse for being busy, but there it is. Anywho, I'm not here to lament my debt, I'm here to continue my series on data warehousing design patterns. I last left off at Type 3, so the next in line is...

Type 4: ???

Ok so there's a bid of disagreement from here on. Wikipedia says Type 4 is "Add history table" while Kimball Group say that it's "Add mini-dimension". Anyway, it's not my ambition to disambiguate this (currently) so I'm going to cover both here - in which case let's get on with it:

Type 4: Add history table

See definition on Wikipedia

This is just how it sounds - you keep a history table mirroring the dimension, containing the history of the data in that table. Now, the Wikipedia article (currently) isn't completely clear as to how the temporal table is referenced in the fact table ("Both the surrogate keys are referenced in the Fact table to enhance query performance"), meaning there are a couple of interpretations:
  • The fact table only contains references to the "current" record (i.e. type 1) and the history table doesn't have direct links (hereafter referenced as type 4a)
  • The fact table contains two references - one to the "current" record (i.e. type 1) and another to the "as-at" record (i.e. type 2) (hereafter referenced as type 4b)
Now, I haven't used either of these before, so I'm going to be commenting based on comparing these strategies with problems I've encountered before. 4b is interesting, but isn't terrifically user-friendly - imagine trying to use a pivot table against a cube with 20 dimensions, where there are two copies of each! It is, however, a relatively easy-to-implement version of Type 6, for every attribute on the dimension, which might be quite attractive in certain situations. I wouldn't recommend using this for every dimension, but one or two could actually improve usability - I could imagine using this for something like a Course dimension, where lots might change (e.g. the name or even the code, which some might consider the natural key) and users want to be able to do analysis on the current state or look up a specific version.
Type 4a is actually really cool, and if you're using SQL server 2016 then this actually comes out of the box with Temporal Tables. Just whip up a Type 1 dimension, make it a Temporal Table and you're done - simple as that. The reason I like this is that if I were working on 2016 and wasn't storing history of my data in another way, I would make Type 4a my default - it's strictly better than Type 1. This is because if you've made everything a type 4a and your BA decides "now X attribute needs to be type 2/4a/6/etc. then you have the necessary information to backdate it. Otherwise you're stuck in the frustrating situation of having to say "we can make it Type 2, but only going forwards - it won't have history from before today" which is a conversation I always hate having.


Alright cool, so what about the Kimball one, which I'm going to refer to as...

Type 4k: Add mini-dimension

I'm not going to talk much about this, as the mini-dimension is purely an optimization technique. The idea is to pull out a few very volatile or extremely frequently queried attributes from a Type 2 dimension and form them into their own junk dimension. Because it's an optimization technique only, I'm not a huge fan, and I would suggest looking at indexes or checking your dimensional architecture to see if you've made a dimension granularity mistake. If, however, your dimension meets all of the following conditions, you might consider pulling some attributes out into a mini-dimension:
  • The dimension you're looking at is Type 2-ish (2, 4a, 4b, 6 etc...)
  • The dimension is extremely large (to the point that it slows queries significantly) and a few attributes are in high demand (e.g. an invoice line-item dimension for a large company where item-type and quantity banding is used a lot)
  • Or the dimension is unmanageably fast-growing due to a few attributes (I can't even think of an example that I don't immediately rule out as a bad design for this)
  • User-transparent methods for optimizing (e.g. indexing, stats etc.) have failed to provide adequate performance increases
So it could be useful occasionally, but it's not exactly something I'd teach in Data Warehousing 101...

Stick with us as I continue this series on existing DW design patterns by looking in to SCD Type 5