Talk:Data warehouse

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Computing (Rated Start-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
Note icon
This article has been marked as needing immediate attention.
WikiProject Databases / Computer science  (Rated Start-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Databases, a collaborative effort to improve the coverage of database related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Computer science (marked as High-importance).
 
WikiProject Business (Rated Start-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Business, a collaborative effort to improve the coverage of business articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 
Note icon
This article has been marked as needing immediate attention.

Removed Section[edit]

I have removed the section "Important Considerations in Building a Data Warehouse" as it is not appropriate to have a "how-to" section in a definition. I've written to the originator of the change but have received no response and therefore have taken this action. The text that has been removed is listed below.


Important questions to ask when using a data warehouse 
  • Do you need a data warehouse? Data warehouses are expensive,additionally you need to train employees on how to use them which is also costly.
  • Do all employees need to have access to the data warehouse? Not all employees need to access the whole data warehouse, in case they need some information you can consider building data marts.
  • How frequently should you update it? This depends on the nature of the data, some needs to be updated instantly, others can be updated weekly or monthly.
  • Which data mining tools should you use? End-users should be the primary determinants of what tools to use, yet the most important key is training.

English please[edit]

I've read this page a couple times over, and I still don't understand what data warehousing, or a data warehouse is. Can anyone please provide some concrete examples of what the heck this article is talking about? —Preceding unsigned comment added by 78.101.133.126 (talk) 13:32, 21 March 2009 (UTC)


We're still waiting for a english article! Whats good for a encyclopedia article to explain something for the ones who already know it? Just to satisfy egos?

Major Edits Made[edit]

Data warehouse is simply a storage location where we store data for our benifits like easy traversing of data etc. Data warehouse has some advantage and some disadvantage also Disadvantage: Data warehouse are costly and human resources are trained for work on data warehouse that is also costly. Advantage: we can easily access the data accessing become faster data become consistant etc.

This is overall description of data warehouse.

Thank you lalit bhardwaj

This page needs serious work![edit]

I'm a Senior Architect in a data warehouse in a Fortune 100 company. We have 150 people on staff and an annual budget of 30 million, and if my management came to this page to see what they're paying all that staff for, we'd be fired!!! There is VASTLY more to data warehousing than this article outlines. I will be back to enhance it dramatically, but in the mean time, don't think this is all there is to data warehousing! --[[User:an annual budget of 30 million is pretty extreme for a data warehouse. What typically happens is that there are companies that need data to manage their business (most of them) and companies for whom data IS their business. My guess is that this writer was in the latter category. Neil Raden (talk) 23:30, 17 February 2008 (UTC)

This text isn't quite right[edit]

Conventional database systems use highly normalized data formats so that they will execute transactions and queries as fast as possible, in minimal time and space.

Normalization gives you tremendous flexibility on the kinds of queries you may run, but often at the expense of time. (unsigned)

tzeh 5.jan.2004 wrote[edit]

This entry of "data warhouse" is not the definition resp. description of a data warehouse. This text defines resp. describes the data warehouse system with some components of the data warehouse system. A data warehouse is a data store only which is build out of separate internal and/or external data sources where the data is integrated in a consistent manner. (unsigned)

Shark Tank on Computer World[edit]

State agency's data warehouse project is renamed: It's now officially an "administrative systems reporting database." Why not just call it a data warehouse? "Several legislators threatened to block the project if the warehouse wasn't constructed within their own districts," grumbles on-scene pilot fish. "All attempts to explain that no physical structures would ever be built fell short. The lawmakers had never heard of a warehouse without a building somewhere." [1] 4.250.198.106 14:18, 29 Mar 2005 (UTC)


References[edit]

I removed the Wiley title from the bibliography since they publish many titles on the subject, and singling one out seemed rather arbitrary. The link provided in the References section includes all Wiley titles on data warehousing.

The "References" and "External links" sections should really be combined, since "External links are still "references". Ringbang 13:15, 22 July 2005 (UTC)

That sounds reasonable to me. I added an external link yesterday (to the data warehousing knowledge base at http://www.datamgmt.com), but I see someone has removed it with no comment. That knowledge base is a very good source of information on data warehousing, authored by the UK's leading DW authority, and is every bit as useful here as Kimball's book. GreenInker 22:12, 25 May 2006 (UTC)

The reason I feel these aren’t arbitrary book choices--even though both are from Wiley--is because these are the two books I recommend people start with. One is a general overview of the Corporate Information Factory approach and the Kimball book is more of a nuts-and-bolts view. They aren’t opposing, but some people feel that they are, and these are really the two biggest names in Data Warehousing. They deserve to be highlighted. Stephen Pace 9 Aug

Makes sense. If you re-add them, I won't remove them, but it would be nice if the listings were qualified in some way in the article (i.e., some mention that they're the standard references for beginners). Better yet would be subsections for special topics in data warehousing, general titles for beginners, etc. I definitely see the value in helping people to sift through Wiley's huge information management library. — Ringbang 20:22, 10 August 2005 (UTC)

I found an article discussing the two cited approaches for organizing data in a datawarehouse, (Inmon & Kimball), from an ISI magazine: March 2005/Vol. 48, No. 3 COMMUNICATIONS OF THE ACM, page 79-84, "A Comparison of Data Warehousing Methodologies", but I don't know how to put a reference. I think this reference could be in the "The Data Architecture - Different methods of storing data in a data warehouse" section. Gueta 20:31, 21 December 2006 (UTC)

Merge[edit]

Recommend merging the article Data mart with this one. Comments? SqlPac 14:46, 17 May 2007 (UTC)


Absolutely Not. Data mart is a separate concept from data warehouse. Data mart is worthy of its own article. Steve 13 June 2007

I agree that data marts are sufficiently distinct from data warehouses to warrant a separate article. Further, the two terms are often used interchangably by the non-cogniscenti and combining the articles might add to that misconception. However, data marts and data warehouses are often used as complementary components of an overall Business Intelligence solution. For that reason, I think the relationship between them needs to be drawn out in both articles. John 22 June 2007


Well, I have seen (as an IBM consultant for seven years in Asia and NA) and used (as a developer now for six years) data marts that source its data from data warehouses. So, they can be closely related. Djoni 5 July 2007.

It may be closely related but does that mean data mart does not have an identity of its own distinct from a data warehouse? Cant a data mart operate without one? If it cant, then it may be merged. But if it can, it is a different concept in its own right and worthy of a separate article. --soum talk 19:45, 5 July 2007 (UTC)

A bit unbalanced.[edit]

Parts of this article are written very pro Inmon at the expense of describing the Kimball method completely. It could use a writeup of the Kimball method from Ralph Kimball's perspective I think rather than from Inmon's. Perhaps each method needs it's own article instead of trying to add a small amount of information on each to this article and end up being a bit lean on both approache's pros and cons.

Both Kimball and Inmon have their own articles. I'd suggest your expand their POV's there. We can draw comparisons here. Neil Raden (talk) 22:58, 24 February 2008 (UTC)

I think "less" should be "more"[edit]

I believe the sentence... "Less complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules." could be better crafted as something like... "In this approach, each of the more complex information items are resolved into a set of records in multiple tables, each of which satisfies the normalization rules." Even that sounds like a mouthful, though...any thoughts? Kiwi137 (talk) In the absence of any objections I will amend text as per the above. Kiwi137 (talk) 11:51, 14 January 2008 (UTC)

Article Needs a Rewrite[edit]

Any article about data warehousing that starts out with Inmon's 20-year-old principles and decade-old debate with Kimball is a history lesson, not an informative piece. I can't even agree with the first sentence: "A data warehouse is the main repository of an organization's historical data, its corporate memory." That is only partially when true when there is a true enterprise data warehouse, which is rare. A data warehouse only contains strucutured data culled from operational systems. That's pretty short of the corporate memory.

Is anyone interested in working with me to improve this article? I'd likke to see more explanation of how a data warehouse is integrat4ed into today's computing environment, the design implications of scale and real-time, etc. Neil Raden (talk) 23:58, 27 January 2008 (UTC)

I'd be happy to help out. Inmon and Kimball can't be ignored in the article, however, as their methodologies and opinions are widely regarded. SqlPac (talk) 20:50, 24 February 2008 (UTC)
I'm not suggesting we ignore them, but they are not the most important issue, or rather, the debate over them isn't. Data warehousing is on the verge of a major re-think thanks to the convergence of operational and analytical processes. How does the high-latency batch update process through multiple physical layers (the Corporate Information Factory) work with today's externalized business, huge scale and demand for immediacy? Where do we fit streaming databases and the processing of complex events and decision automation? So I guess my point is that the history of data warehousing is interesting, but it's more important that we focus on writing an article that is useful to the reader. Neil Raden (talk) 22:47, 24 February 2008 (UTC)
Why don't you add a final section on the future of data warehousing? (Writerguy71 (talk) 15:27, 26 February 2008 (UTC))
I took a stab at a brief section on data warehousing futures. (Writerguy71 (talk) 10:47, 7 March 2008 (UTC))
"Warehousing futures" - I feel like I should call my stock broker and invest heavily <g>. SqlPac (talk) 17:20, 9 March 2008 (UTC)
I agree that they shouldn't be ignored, and I can agree that the discussion over these two personalities in particular may not be the most important issue on the plate--but to understand the future we need to know the past right? :) I think information about these two is important because their methodologies are in such widespread use today. It's important to look at what the future is bringing, but perhaps even more important to most readers is what is in common use today. SqlPac (talk) 17:20, 9 March 2008 (UTC)
I've rewritten the "Kimball versus Inmon" section to be a bit less dogmatic, and to frame the discussion more in terms of the "top-down" vs. "bottom-up" design methodologies. Obviously Inmon and Kimball were mentioned, but since this article isn't about them per se, but rather about data warehousing in general I've basically stated "this is the design approach... and this guy is one of the leading proponents of this approach..." I've also added references and listed some of the benefits and risks associated with the different design methodologies. Feel free to tweak or change as you feel necessary. I think we can add some more information about the "hybrid design" approaches commonly in use. SqlPac (talk) 19:15, 10 March 2008 (UTC)

One physical repository?[edit]

Someone put this sentence in there: "These two influential experts represent the traditional views on one aspect of data warehousing - whether it should be in one physical repository." I disagree with the wording, as I don't believe either one has really said your data needs to be stored in one "physical repository". What's a "physical repository"? A single database on a single server? A single server on a single network? Does this exclude server farms and geographically separated redundancy systems? SqlPac (talk) 20:50, 24 February 2008 (UTC)

I think the point is that Kimball and Inmon were largely focused on the database aspects of data warehousing, which is not really true. Kimball has written extensively about ETL, Web Analytics and the whole data warehouse lifecycle. Inmon started the first company (Prism) to do ETL and has written extensively about many aspects of data warehousing. I think what this person meant was that the Inmon vs Kimball debate was about was about how to build the mega data warehouse, but that isn't really true. Their differences were mostly methodology (how to proceed), though Imon rejected dimensional modeling initially, aceeding to it later for "data marts." But the really interesting part of this statement is the comment about one physical repository. Today, it should be thought of as a logical repository as it can be, as you said, distributed. In addition, there is a rapid pick-up of in-memory databases and query federation, through a meta-layer, which is making slow but steady progress. Neil Raden (talk) 22:57, 24 February 2008 (UTC)

Another round of edits[edit]

Another round of edits were attempted. There was an attempt to keep most of the substance intact. These edits will be fodder for making the article better. For example, the above comment on the physical repository was correct. Writerguy71 (talk) 25 February 2008 (UTC)

There was an attempt to incorporate someone's personal diagram into the substance the article. Frankly, the diagram does not add much though. Writerguy71 (talk) 25 February 2008 (UTC)

Also, admittedly, the article greatly lacks the annotation it should. If there are unannotated controversial statements, then, per Wikipedia policy, then remove them. That being said, the controversial material may revolve around, for lack of a better term, "Inmon - Kimball religion issues". An attempt was to put those issues in a "fair and balanced" light. Writerguy71 (talk) 25 February 2008 (UTC)

Finally, I do not know how the above discusser infers there is a statement that Inmon and Kimball themselves were focused on the database issue. Nonetheless, the language has been changed with the attempt to make it so other readers do not make the same inference. - This article needed so much work that there probably are many other points that could be better said. Writerguy71 (talk) 25 February 2008 (UTC)

It would help if you use indenting under the comments you are commenting one. Also, what would you think about removing the picture at the top? It depicts one type of data warehouse environment only. But more importantly, it uses the term "Data Vault" which is a term used by a vendor. Neil Raden (talk) 21:07, 25 February 2008 (UTC)
As stated previously, it is my opinion that the diagram adds little. (Writerguy71 (talk) 11:53, 26 February 2008 (UTC))
It's gone Neil Raden (talk) 00:16, 27 February 2008 (UTC)

Repository[edit]

The first sentence uses the word "repository." Is a data warehouse really a repository? Most diagrams depict it as much more than that. Is ETL part of a warehouse? Is metadata part of a warehouse (metadata may be housed in its own repository, but isn't metadata itself much more than a repository?). If you want to get expansive about it, aren't reports, metrics, templates and even the forgotten users part of a data warehouse? Neil Raden (talk) 21:12, 25 February 2008 (UTC)

For purposes of simplicity, keep the first sentence as a "repository". If you want to be technically correct, then include ETL, metadata, etc. in the first paragraph. Like to see your edits and how you draw the boundary. Just don't call data warehousing the corporate memory. (Writerguy71 (talk) 16:10, 26 February 2008 (UTC))
I changed the intro. around, using two definitions - one definition from an oft-cited ACM paper and another definition from an oft-cited Inmon paper. I also listed more benefits as cited by the Data Warehousing at Stanford project. You may want to revise or rearrange as necessary. I think it's important to establish a level of credibility with citations in the intro. though. SqlPac (talk) 18:59, 9 March 2008 (UTC)
I noticed others have been rearranging content, and it looks good. I've rearranged the intro. and the first paragraph to combine them, and I also moved the benefits out of the intro and into a separate section. This was basically to keep the intro short, sweet, and to the point. I also reworded the "expanded definition" slightly since it started with "The previous definition focuses on data..." and then goes on to explain that the expanded definition includes "the movement of *data*, the analysis of *data*, the extraction, transformation, and loading of *data*". Seemed like the author might have been trying to say that the previous definition was focused entirely on the "storage of data". Anyway, feel free to change around whatever as necessary. SqlPac (talk) 02:05, 12 March 2008 (UTC)

Normalization/Denormalization[edit]

This is all wrong. In my experience, operational systems are the most denormalized systems. Look at SAP for example. It violates every rule. The most normalized schema in the whole system are the Star Schemas in the SAP BI. It is completely false that dimensional schema are denormalized. A well-designed star schema is in 2nd normal form and if you snowflake the dimensions, which I always do to promote sparse aggregation, they are in 3rd normal form. Some may argue that duplicating data (via aggregates) violates the normalization rule, but if you stop to think about it, so does an index. If you look at a snowflaked schema, what you see are a series of 3rd normal form stars. It is a complete MYTH that 3rd normal form doesn't provide good performance. It depends on the type of schema. Eliminate circular joins and it performs quite well. Neil Raden (talk) 21:30, 25 February 2008 (UTC)

Okay, should the section exist? Or should there be a section on "design" that, perhaps, explains granularity, that some data warehouse designs follow some/all the Codd rules, and that some data warehouses follow dimensional modeling which makes use of "facts" and "dimensions" and which may or may not follow some of the Codd rules? If not, what should be in the section if you agree something like it should exist? (Writerguy71 (talk) 15:56, 26 February 2008 (UTC))
Dimensional schemas consist of denormalized tables in many instances. Indexes are a separate issue from "data", and are an implementation-specific detail. They are managed internally by the system, are usually not exposed to users, and are not considered "data" in most systems. Some indexes do not duplicate data, so I'm at a loss here as to what you're talking about. Whether or not 3NF provides good performance is dependent on many factors. Some systems perform better with a star schema, some perform better with a snowflake schema. I would be interested to see a "well-designed star schema in 2NF." Usually a simple Time dimension can be a determining factor as to whether it even meets the criteria for 1NF, much less 2NF or 3NF. Do you have any examples of 2NF well-designed star schemas (schemata? whatever...)? SqlPac (talk) 17:35, 9 March 2008 (UTC)
I took a stab at rewrites and moving information to a more appropriate section. The section still does not read well so your edits are welcome (Writerguy71 (talk) 11:57, 4 March 2008 (UTC))
Big improvement Neil Raden (talk) 04:48, 11 April 2008 (UTC)
A star schema is defined as "dimensions on 2nd normal form and facts on 3rd normal form". A snow flake schema can be defined as "dimensions on 3nf and facts on 3nf". So, technically, if you "Snowflake" your dimensions you are not making a star schema model.
I think some of the confusion, originates in the concept "dimensional modeling". In a "Kimball setting" the concept "dimensional model" denotes a star schema model (with simple keys), though, ofcourse a snowflake also has "dimensions".
I think that the question of performance between the two models is difficult to answer. Algebraically (say, number of floops needed) a simple query running on a snowflake should be fastest. Though the key performance parameter for modern servers aren't floops, but rather the number of "pages" that has to be loaded into memory, in order to resolve the query, and it will have a lot to do with the internal workings of the query engine, as somebody hinted above. F.ex, I fell pretty convinced that MSSQL2005-2008 does not have significant performance gains with any form of snowflakes and that SSAS2005-2008 performs better on "clean" star schemas (an important consideration if you plan to use that tech on your relational model)Jomsviking (talk) 21:19, 26 February 2010 (UTC).
Besides that, I agree that the section is strange. It could be that I completely misses the point of the author, by "normalised" I understand a normalised relational database model, a snowflake. With that said:
F.ex. it states (Concerning dimensional models) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated Now it is perhaps true that it's complicated, but not more complicated than in a "normalized" model. In fact the operations are similar, though in the dimensional model you skip the tedious normalisation part.
also the It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business. Well, that's always true, irrespectable of "dimensional" or "normalised".
And
"The main advantage of this approach is that it is straightforward to add information into the database."
No, it's not. Jomsviking (talkcontribs) 22:02, 26 February 2010 (UTC)

I completely agree with Jomsviking's comments. —Preceding unsigned comment added by Eleven one (talkcontribs) 16:04, 28 July 2010 (UTC)

Dates for key developments in early years of data warehousing[edit]

A list of some key dates was addded. I thought a little historical context would be good for the article. The list is debatable so edit away. (Writerguy71 (talk) 02:26, 6 March 2008 (UTC))

Roenbaeck (talk) 11:37, 17 February 2011 (UTC) Key dates are fine, but as the list is made now, Daniel Lindstedt seems to have been as important as Inmon and Kimball. Has he really been? Furthermore, if DV is released in 2000, it seems to me that only a very select few could have known that he began work in 1990. Unless generally known, how can that be a key development?

Data mart article badly in need of help![edit]

I just visited what might be considered a "sister article" to this one - the Data mart article. It is badly in need of attention. I figured I'd post a quick note here, since people who are editing this one are more likely to be interested in that one as well (at least more likely to be interested than the general population of editors). Thanks. SqlPac (talk) 02:15, 12 March 2008 (UTC)

External Links, Inline References[edit]

Can anyone inline these references? I'm not sure what material was sourced from these "seminal works" in the article. If they're truly considered seminal works, there should be at least one sentence worth referencing directly from each, right? :)

  • William H. Inmon, Richard D. Hackathorn: Using the Data Warehouse, John Wiley & Son's, ISBN 0-471-05966-8 - One of Inmon's seminal work on data warehousing
  • Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), John Wiley & Sons, ISBN 0-471-20024-7 - One of Kimball's (with the help of Ross) seminal works on data warehhousing

Also I moved all the external links to blogs, etc., from the References section to a new External Links section. Someone might want to go through those links at some point and make sure they meet the Wikipedia criteria for inclusion in the article. I'm not familiar enough with the criteria to do it myself, but there are a couple of blogs in there I'm not familiar with and I wouldn't be a good judge as to whether or not they meet inclusion criteria. Thanks. SqlPac (talk) 04:00, 12 March 2008 (UTC)

I'm Shocked[edit]

...at how good this article is looking these days :) I remember when it was just an ugly little stub! One thing that could use a little attention is the "Data Warehouse Architecture" section. Some of the questions that come to mind from the little bit of content in there are: 1) Are there any other conceptualizations available besides the one presented? 2) How are the layers "interconnected" to one another? 3) Are there any general guidelines or common/best practices around data warehouse architecture? It seems like this section could be fleshed out nicely with some quality content, but right now it stands out as very minimal in an otherwise good article. I noticed a few little nitpicky things that I'll come back and fix when I have time but overall you all have improved this article by several orders of magnitude from where it began! SqlPac (talk) 05:10, 24 June 2008 (UTC)

I'm shocked too, although for a different reason![edit]

This article mentions that Ralph is a proponent of bottom-up approach, while his book that I am snailing right now - "The Data Warehouse Toolkit" clearly scoffs at this approach. He says in the Introduction section of this book, "There is a false sense of security in the diagram detail. 'If its detailed, it must be good'". And then he goes on, "The dimensional model is the top-down model (notice that we started by talking to the CEO), and the data dependencies model is a bottom-up model". May I request a knowledgeable volunteer to check this out? —Preceding unsigned comment added by 12.107.188.130 (talk) 17:29, 17 November 2008 (UTC)


I completely agree. The guy that wrote the section on Inmon-Kimball, clearly hasn't understood Kimball. In kimball's approch you start by talking to the buisness, and through such buisness requirement analysis coupled with the realities of the data avaliable arrive at the "business bus matrix", which is a diagram (a matrix) that show buisness processes and the dimensional context. This is per definition a top-down approch. THEN you use your "business bus matrix" as a blue print for your actual modelling of the problem at hand. This process is a bottom-up. So it's top-down followed by bottom-up. I believe that any database programmer at some point, possibly at the first day at work, realises that in the real world you have to build your database much like the mason builds the chimney, though if the end user don't like your finished datawarehouse, or chimney, all your work is in wain. So the most important thing is to be able to meet end buisness requirements - and that's reflected in Kimball's method which starts with the business (top-down).Jomsviking (talk) 23:13, 26 February 2010 (UTC)

I agree with this statement. The Inmon vs. Kimball approached is confused and almost backwards. This nees to be corrected because it represents the two schools of thoughts that need to be very clear for people who want to understand which architecture approach they want to take. —Preceding unsigned comment added by 210.54.1.141 (talk) 21:03, 24 March 2010 (UTC)

And again I completely agree with Jomsviking's comments. This should be corrected asap. —Preceding unsigned comment added by Eleven one (talkcontribs) 16:05, 28 July 2010 (UTC)

The Inmon approach has been somewhat misrepresented. It is not a boil the ocean approach, but a well engineered, iterative and cost effective approach. Whatever way you build the enterprise DW it's gpoing to be expensive, but an iterative approach - either Inmon or Kimbal etc. - means that you don't boil the ocean and you don't spend significant amounts of cash on something that you subsequently have to throw away (as not fit for purpose). MRJ

Hybrid Design Examples[edit]

It would be nice to highlight some of the most popular Hybrid design methodologies--Madmonky (talk) 16:38, 1 April 2009 (UTC)

History under the Future developments section[edit]

Why does the first section of the Future developments section state historical info? While I think the info should go somewhere in the article, that is clearly not the section for it. Also, it'd be nice if it detailed what some of those rejected ideas are.--Madmonky (talk) 17:06, 1 April 2009 (UTC)

The footnoted article gives examples of rejected ideas. Not sure what you mean by the futures section including historical information. If you mean the references to 2009, 2009 is only a quarter over. Re hybrid methodologies, not sure there are written down hybrid methodologies.--writerguy71 (talk) —Preceding undated comment added 00:47, 3 April 2009

Unclear definition Real Time/Integrated[edit]

What is the difference:

   Real Time Data Warehouse 
   Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking.)
   Integrated Data Warehouse 
   Data warehouses at this stage are updated every time an operational system performs a transaction.

I think this needs an overwork.

RAWR(UTC).

I've fixed the second definition (Integrated Data Warehouse). Crysb (talk) 20:35, 27 September 2010 (UTC)


As far as I am concerned the list is bogus. First of all I don't like section's "to the level of sophistication of a data warehouse:" Because a datawarehouse is "offline" doesn't mean that it's less sophisticated than a (pseudo) real time DW. You (should) build realtime datawarehouse or build parts of the DW as realtime, when there is a specific buisness requirement, which btw never is "real time" but a specification of the length of the latency of the data (well, at least when you have made the customer acquainted with reality). The strategy or technique deployed to build a (pseudo) real time DW, can be more or less sophisticated, which usually has to do with how "real time" your DW should be; in other words, The length of the latency. Anway the list mixes in "integrated" DW into the fray. That's just weird. an integrated DW can be real time or not, a real time DW can be integrated or not. It doesn't make sense to rank DWs like thatJomsviking (talk) 10:25, 5 January 2011 (UTC).

Disadvantages[edit]

This is a terrible section;

   * Data warehouses are not the optimal environment for unstructured data.
   * Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data.
   * Over their life, data warehouses can have high costs.
   * Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization.
   * There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems.

That "Data warehouses are not the optimal environment for unstructured data."
Is utterly irrelevant; One major point of DW is to turn unstructed data into structured data (data consolidation) A DW should not contain "unstructured data" (besides in staging-repository).

"Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data."
Which is true and it is also true for the greater part of all complex operational systems (take a finacial operating system that keeps track of payments; The customer pays his bill, at the bank. The bank keeps track of this payment in it's own systems and somehow informs the fincial operating system of the payment, which then processes this payment" at every link or joint there is latency. The key thing is, can the DW meet the business demands on the length of the latency? If yes there is no problem.


" Over their life, data warehouses can have high costs." As compared to what??? Datawarehousing techniques are, in my oppinion, far superior and far cheaper in delivering REQUIRED BI (business intelligence) on a platform, be that simple flat reports or complex online multidimensional data, than f.ex trying to create those NECESSARY reports in the operational system (even standard systems like Axapta, ofcourse things are totally impossible with custom build systems that usually only the original programmers can understand without using endless hours ). I am aware that certain people, who usually fail to appreciate the complexity, the history and the weaknesses of the data of an larger organisation, thinks that delivering buisness information is "just" a question of programing some "web parts", "web services" nice, on paper, looking "SOA" and getting at the data is "just" writing som "SQL-queries" against the operational system. They are wrong, and that approch always ends up in countless hours of (expensive) programming (and re-programming) the end result might be that information is delivered and in a nice fashion, but the costs is partly relatively many hours of programming (nice for people like me) but more serious, a "web of confusion" of inter-dependent, inter-locking web services, web pages and the like, that after a month or so nobody can figure out. Which creates serveral problems for the organisation, amoung those: Critical dependency on those who builded it; Failure at flexible integration of organisations' data; Sacred "Black boxes", that nobody dares to thouch or fix (which f.ex. is very well suited for freezing the organisations future natural IT-infrastructure development).

What can be said, is that, on the contenary, Datawarehouses can have a high "upfront cost", that has to be paid/done before information can be delivered to the business, proving (or disproving) the worth of the datawarehouse.

" Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization." Which is true for all "IT-efforts". Ofcourse if one operational system is altered or replaced that creates a need to update the DW as well as all other systems (including humans) dependent on that altered system - that's part of life.


" There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems."

True, but " functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems." In my experience it's almost always the other way round. Expensive "Add ons" are built into the operating system to make it facilitate buisness requirements. But this could either have been done in the DW (often at a significantly lower cost) or the add-on could be a (perhaps, simple) stand alone system, that integrates with the data of the organisation via the DW. (which btw is the more flexible solution in most cases)Jomsviking (talk) 12:42, 5 January 2011 (UTC)

Quite a good killing of that section, Jomsviking. The section was not sourced, needs multiple "compared to X ..."s, and more nuances. I bet Kimball and Inmon have spend a paragraph or two on the "don't do that, it's wrong or inefficient"-tradeoff. Now what to do? I must say, your arguments here, however plausible, are OR too. May I invite you to delete the current text, and replace it with a new list, including sourced arguments? Imo the new text can be structured along aspects of "economies", "functionalities", "quality". Or maybe I'll take a look in Kimball myself (the book should fall open on the right page ;-) ). -DePiep (talk) 18:15, 5 January 2011 (UTC)
I don't think you can speak of "disadvantages" (of datawarehousing), however you can speak of "requirements". Is a datawarehouse is required by the business in question? and to which extend? If there is a requirement and it is meet by the datawarehouse, then it was probably an advantage to get it or vice versa.
When that's said, "practical" datawarehousing is many things, spanning from a dozen reports and small "marts" to "real" datawarehouses. Ofcourse you can head into a huge architectural discussion of how a DW can be utilized (or not) in a IT-infrastructe, that's beyound the scope of this article and one of those discussions that doesn't end. OR you can discuss pro et con with the old-school IT department that thinks that "those buisness guys should learn SQL if they want to query the data" or the new generation of half-wits "web programmers" who not only thinks that everything can be done in a SOA architecture, but also think that everything should be done in a SOA architecture.
Personally, I would just delete the sectionJomsviking (talk) 20:06, 5 January 2011 (UTC).
Right, again. Current section is deletable, if only for the OR/unsourced reason. After that, I think there could be a description of tradeoffs: is it usefull (effectivity) and is it worth it (efficiency) in a business. This will be nuanced as it is on a more gradual scale, not black/white. (btw, for me personally, useful borderlines are "they should learn SQL" and "just build it in the Transaction DB" indeed.) Now you delete. :-) -DePiep (talk) 20:52, 5 January 2011 (UTC)
Done! Jomsviking (talk) 21:45, 5 January 2011 (UTC)

Opening[edit]

I really think that we should try to write an opening to the article that explains what "datawarehousing" is as well as what a datawarehouse is. Kinda like, What's the point, why do people build datawarehouses etc. I think people that visit this article wonders what a datawarehouse (and datawarehousing) is. They are not primarely interested in abstract nonsense or theoretical biggering.

Current opening[edit]

A data warehouse maintains its functions in three layers: staging, integration, and access. Staging is used to store raw data for use by developers (analysis and support). The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users.

This is just nonsense, I don't understand it (and, in all modesty, if I don't understand a Datawarehousing topic, very few ordinarry people do, which kinda defeat the purpose of a wiki article, imo). Though I do have a guess on the meaning:

"Staging is used to store raw data for use by developers (analysis and support)"

No, staying with the ETL "picture", staging is used to store and prepare data for the final load. Now, you can make "analysis and support" on staging data, but that's not the primary reason why you have it.

"integration layer"

Never heard of it. Very possibly some people think of their datawarehouses with an "integration layer", and that's fine with me. But that's not part of the standard "sacrosant" models put forth by Kimball and (as I remember) Inmon, which all people in the trade have heard of and respect (if not worships). Not that Kimball and Inmon have said the last word worth saying on datawarehousing, but when it comes the the structure of a datawarehouse and the ETL system, I suggest that we keep to them as refference points, because of the wide respect they hold in the industry.

sign Jomsviking (talk) 10:20, 7 January 2011 (UTC)

Please update this article.[edit]

This article is out of date and does not truly represent what the article is about. It simply lists two of the many known ways to normalize and store data into a Data Warehouse (data repository used for reporting).

I might add that two users have colluded to prevent information from reaching the rest of the world, based upon what they view as a conflict of interest; as I am the inventor of the concept. Yet, these same users have created a page which has direct links to websites where the owners are selling books, and other materials to promote their concepts.

While I have tried to expose a new to the world (as of July 2012) a Schema called the Spider Schema for data warehousing to this page, these users have removed all references to it, suggesting that someone else needs to add the content not the creator. I wonder, who is more knowledgeable than the creator, especially when its new?

While my attempt to list something new on this page is removed, the site I referenced has zero advertising or links that are not relevant to the Spider Schema. Everything is free for the world to consume, and comment on.

Introducing something new, when what is existing is over 30 years old (Star Schema), is not spam, advertising, or a conflict of interest; it's called inovation, something the Data Warehousing community badly needs.

MHargraves (talk) 14:06, 16 July 2012 (UTC)

I (and others) have explained to Mr. Hargraves on his personal talk page that the article will be updated to include his invention of the "spider schema" when 3rd party reliable sources cover the new material. Studerby (talk) 17:00, 18 July 2012 (UTC)

Introduction needs updating[edit]

In the third paragraph it says 'the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema.' That would eliminate the need for an access layer that is very briefly touched upon in the next sentence.

I think the quotation i mention is a description of the access layer and not the data warehouse. The data warehouse needs to be organised in a way that is abtracted away from any specific usage of data and is closer to a description of the "real" world. The "real" world doesn't really change much whereas the usage of data changes very frequently. This typically leads to a logical data model of the data warehouse being in 3NF.

My suggestion would be to have the last three sentences of the third paragraph look like the following paragraph.

The integrated data are then moved to yet another database, often called the data warehouse database, where data is stored independently of any current or future use of data typically in third normal form (3NF). The access layer helps users retrieve data that is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The Access layer can be implemented as views on the data warehouse or as physically instantiated objects populated with data from the data warehouse.

This describes the best architecture according to my more than 20 years experience with data warehousing.Ehaahr (talk) 15:16, 29 January 2013 (UTC)