Talk:Data Vault Modeling

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Okay - edited it a bit, but I still see a few issues. Will update it this afternoon. RonaldKunenborg (talk) 09:48, 9 July 2009 (UTC)

I'm done for now. Added several references, lots of context and some explanation. Removed all the flags for lack of context and lack of notability. Moved the marketing blurb to a quote (since it was written by Dan Linstedt on wikipedia it's a bit hard to refer to it :)) and added some background and general information. I may add more once I read up a bit more on Data Vault. I haven't addressed Point-In-Time tables, for instance.RonaldKunenborg (talk) 12:52, 9 July 2009 (UTC)
Removed the tag for the introduction length (yes, it was a bit too long) and cleaned up the article, adding a few references here and there. More work needs to be done in that area so I leave the tag for that item where it is. RonaldKunenborg (talk) 15:57, 6 December 2009 (UTC)
I've changed the article a bit more, cleaning up and adding references (as requested). Also added a section on reference tables (how appropriate :)). I will try and incorporate the request by Jomsviking sometime this week. And add some pictures and other resources. RonaldKunenborg (talk) 22:31, 13 September 2011 (UTC)

Can somebody Explain[edit]

Can somebody explain what you can do with a Data vault model, but can't do with f.ex "dimensional modelling" (or why it's advantageous to use data vault over dimensional modelling, some cases)??Jomsviking (talk) 14:54, 7 January 2011 (UTC)

Hi - sorry, this took ages and I forgot about your comment. A small explanation should be put in the article but for now, let;s just say this.

You (should want to) separate the layers of your DWH into at least three layers nowadays, as a best practice. One layer is where the data comes in, modelled to reflect the source. Let's call this the staging area. Then we have one area designed for the users, where data is stored as the users like to receive it, ready for consumption. Let's call this the data mart area.

And then we have the area inbetween, where we have to store the data in such a manner that it will keep for a long time - the storage area. We manage to achieve this long-time stability by modelling the data in such a way that we disconnect keys from attributes, relationships from the keys, and functional areas from each other. We also want to load this area fast, but do not care about query performance all that much. Hence, the Data Vault: highly scalable during loads, disconnects all areas and only connects data over the business keys that are much more stable than the underlying software systems. However, for query performance almost everyone will use a datamart on top of this, with facts and dimensions. The good news is that you can transform a well-designed DataVault into facts and dimensions quite easily, often with views. This means that dropping the datamart before reloading is not a problem and is actually standard practice when rebuilding smaller data warehouses. This makes it very easy to change business rules, dimensions, add columns to facts or change the granularity of a fact table. RonaldKunenborg (talk) 21:44, 26 September 2012 (UTC)

Another point is that it is often better with a 3nf model and a persistent staging area that has good metadata in ETL tool. In some respects data vault has many of the problems with dimenensional but is impossible the query. A 3nf model is almost as easy to load, says more about the data and is possible to query even if performence is worsa than a deimansional model With a modern applienace like Netezza, MS PDW, Pivotal, teradata you often get away with 3nf and some simplifying views. With DV (Data Vault) you MUST have views and since you need to self join tables and they are plentiful the SQL is impossible to generate with query tools or push down SQL generated from ETL tools. The best advice would be to avoid data vault in a EDW architecture except for sotuations where preserving history where sources don't is the primary interest. Abergdahl (talk) 21:22, 10 December 2013 (UTC)

Example inconsistency[edit]

In the Link example, the surrogate primary key L_DRIVER_ID is optional, but in the satellite example (which refers to the link example), the field L_DRIVER_ID is mandatory as a foreign key to the driver link. This requires resolution/explanation. --Joachim Pense (talk) 11:35, 7 April 2013 (UTC)

Link Example is potentially confusing[edit]

The link example uses the example of a "Driver" (a person driving a car). Link's in data vault also have the concept of "Driving Key", readers might get confused between the "Driver" example, and the "Driving Key".. — Preceding unsigned comment added by 203.53.49.245 (talk) 06:40, 19 April 2013 (UTC)

Good point, I'll take a lot at it later. The example needs a touch-up anyway, I think. RonaldKunenborg (talk)

Which Hubs?[edit]

How do you figure out the Hubs you should have? Is there some kind of technique to make use of? 62.220.160.156 (talk) 21:34, 30 October 2013 (UTC)

Hi, I'll re-read the article to see if I missed something there but the technique is simple: use the business keys. Like: license plate, invoice number, SSN, etcetera. If a system doesn't have natural business keys, you need to find other keys, perhaps even the surrogate keys. For more information you should really read the references at the end of the article, they explain this in more detail. RonaldKunenborg (talk)