Talk:Extract, transform, load
|This is the talk page for discussing improvements to the Extract, transform, load article.
This is not a forum for general discussion of the article's subject.
|WikiProject Databases / Computer science||(Rated C-class, High-importance)|
I'm not an expert on this subject. But you guys who are experts should update this to include at least a reference to ELT (Extract, Load, Transform). You can find many articles about this and many huge companies using this approach for many-hundreds-terabytes datawarehouses. —Preceding unsigned comment added by 18.104.22.168 (talk) 20:00, 28 December 2009 (UTC)
The original term "ETL" was popularized in the early 1990's by W.H. Inmon in "Building the Data Warehouse." Many advocate using the term "ELT" because because the term "load" is typically used to describe the step of loading the data into the RDBMS. For example, in Oracle, the "SQL Loader" tool is used to load external data into the database. After loading comes the transformation processing--hence ELT. — Preceding unsigned comment added by Randygrenier (talk • contribs) 15:34, 10 May 2015 (UTC)
Specifically referring to the statement, "Increasingly, companies are buying ETL tools to help in the creation of ETL processes.,"... Really? I'm not sure we need citations when statements are common sense. Companies are always looking for ways to increase the bottom line and IT departments to increase the "plug-and-play" supportability of components. Certainly this is not accomplished through creating (or allowing to be created) tons of proprietary, in-house scripts. So, obviously, companies that discover ETL tools will buy ETL tools to create ETL processes. I wouldn't expect to see "Wind is the movement of air " That's just silly. 22.214.171.124 (talk) 01:42, 24 October 2009 (UTC)
- I do think that the citation is needed to validate the term "Increasingly" used in the paragraph. I believe there is still massive amount of manual creation of ETL without the use of ETL tools so some proof that this is diminishing and the use of COTS products are filling in that part of the marketshare would be useful. 126.96.36.199 (talk) 14:35, 11 March 2010 (UTC)
- I'm with the guy above. As someone who has worked in the industry for 20+ years, there's an awful lot of "ETL Tool evaluation" goes on, but companies (IMO) still shy away from the up-front cost, and err towards sticking with their in-house scripted solutions. This might be changing, but we do need a citation. If it were my $2 I'd say it's not on the increase. Michaelfromtheuk (talk) 09:18, 17 July 2012 (UTC)
Totally agree. In fact, there are organizations that have aborted the use of ETL tools (after much expense) because of their lack of flexibility compared to programming in the RDBMS language (e.g. PL/SQL, Transact SQL, etc.) — Preceding unsigned comment added by Randygrenier (talk • contribs) 15:36, 10 May 2015 (UTC)
I suggest that deleting the "External links to other tools" section is long overdue. It has become a spam magnet and a link repository, and Wikipedia is not a link repository. —Veyklevar 05:24, 12 May 2006 (UTC)
Very much agreed. I'd do it myself, but I'm biased.
I'd suggest that rather than a list of links, what would make more sense is a list of "major products" by market share - with citation, although this might be hard to come by. From my own experience (which is over 20 years) I'd say the top 3 are Oracle, IBM, and Ab Initio (which isn't even mentioned). Michaelfromtheuk (talk) 09:51, 5 March 2012 (UTC)
The Image on this page looks like it has been copied from some book. Is that a copyright violation? I don't know how to sign this comment, sorry. — Preceding unsigned comment added by 188.8.131.52 (talk) 18:25, 1 May 2013 (UTC)
Some ETL Tools
This list of ETL tools is a little selection of the usable products. A recent software is'nt present in this list : Talend Open Studio. I have find severals english source of information of this product and its show the relevance of this software:
The Official website is http://www.talend.com/default.html
I'm actually working to this company, it would be thus preferable that I'm not the author of this addition!
Can somebody it help me to complete article ETL by adding this reference?
Ocarbone 13:36, 17 October 2006 (UTC)
Some more ETL Tools offering companies are: http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html https://www.informatica.com/ http://www.cloveretl.com/ https://adeptia.com http://support.sas.com/software/products/etls/ https://www.mulesoft.com/
Hadoop based tools
Under the section that lists open source tools, shouldn't we mention tools like Cloudbase, Hive, and Pig (all based off Hadoop) as well? —Preceding unsigned comment added by Saurabhnanda (talk • contribs) 04:51, 2 June 2009 (UTC)
It certainly raises an interesting question, and big-data tools can have a part to play in processing very large data-sets into aggregate sets more useful for analysis, however Hadoop et al. typically doesn't perform the Extract or Load steps particularly well, and a traditional ETL tool is about orchestrating the motion, as well as transformation of data. Gradientdescent (talk) 05:21, 22 June 2012 (UTC)
Kettle seems to be a synonym for Pentaho Data Integration. Useful to mention this somewhere in the tool list, and maybe also have a disambiguation page for "Kettle". March 2010, ThP —Preceding unsigned comment added by 184.108.40.206 (talk) 10:30, 19 March 2010 (UTC) dsfsdfdf —Preceding unsigned comment added by 220.127.116.11 (talk) 10:50, 12 August 2010 (UTC)
In addition to the criticism that has been posted as to the "recipe" nature of part of this section (do this first, then do that, etc), as a database consultant with almost 20 years experience, I actually disagree with the fundamental point the author is making here. He says the "bottleneck" is in the loading of the database. Firstly, the "speed" of a load is rarely the step that takes the longest time in any migration / data-cleansing type project. It's the *analysis* that must be performed to map data and data types(!) from disparate sources into a new "consolidated" target database. This involves many many man-hours and meetings and talking to subject-mater experts, etc. In other words, the "technical" aspect of actually loading the final data is in a sense the *easiest* part of the project. It's the figuring out of all the nasty problems that crop up *before* one has a clean mapping of old-->new data and old--> new data types that take up the most time and intellectual effort. Finally, with the ever increasing speeds of computers and ever increasing quantities of fast memory, technical "loading" of data (yes, even with constraints etc needing to be created) is simply not even in the top 5 problems on a person's mind when working on such a project (ETL).
In conclusion, there are many difficult intellectual problems associated with ETL. However the one mentioned by the author (DB load speed) is probably the least among them. Finally, is it our job to "rank" these problems (associated in a project)? I'm not sure. —Preceding unsigned comment added by Frumiousfalafel (talk • contribs) 19:34, 26 October 2009 (UTC)
I agree with you - Wikipedia shouldn't be a recipe book of 'how to do', so any section like this ought to simply mention (by way of example) the types of challenges and activities that occur within ETL, rather than describing any specific process, as it's rare that "one solution fits all" So maybe a 'Challenges' or 'Activities' section detailing the what typically goes on, maybe linking into how this fits into an organisation. There could possibly be a refer-back to 'IT maturity' (I think Gartner publish something on what identifies the evolution of company IT policies - though bound to be behind their pay-wall) Michaelfromtheuk (talk) 09:57, 5 March 2012 (UTC)
I am very confused that whether this project is just similar to "Data Mining in heteregeneous Distributed System" or is completely different. —Preceding unsigned comment added by 18.104.22.168 (talk) 18:20, 11 August 2010 (UTC)
The section Transform says: "Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this calls for automated Data cleansing; no manual cleansing occurs during ETL". The Data cleansing article talks about removing or correcting invalid information, which I would see as different and separate from transcoding. In other words if the data source has "Male" and "Female" and the destination "M" and "F" then this change is transcoding but removing a record that has "Hermaphrodite" in the column would be data cleansing. -- Q Chris (talk) 15:22, 11 April 2011 (UTC)
The author claims "Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step." to gain speed. I don't consider this being a good practice, because triggers ensure database integrity, and if the ETL process fails after the initial step, this might mess up the database completely. — Preceding unsigned comment added by 22.214.171.124 (talk) 08:12, 20 October 2011 (UTC)
A web search backs up my feeling that the alternative 'data model recasting' listed in the section is not notable and the references cannot be considered reliable. Gradientdescent (talk) 02:12, 5 June 2012 (UTC)