Talk:Extract, transform, load

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Databases / Computer science  (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Databases, a collaborative effort to improve the coverage of database related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Computer science (marked as High-importance).
 

Jaspersoft ETL[edit]

IMO it makes sense to include Jaspersoft ETL in the list of Proprietary ETL frameworks —Preceding unsigned comment added by 178.236.241.148 (talk) 12:55, 16 February 2011 (UTC)

Refer ELT[edit]

I'm not an expert on this subject. But you guys who are experts should update this to include at least a reference to ELT (Extract, Load, Transform). You can find many articles about this and many huge companies using this approach for many-hundreds-terabytes datawarehouses. —Preceding unsigned comment added by 201.48.130.65 (talk) 20:00, 28 December 2009 (UTC)

The original term "ETL" was popularized in the early 1990's by W.H. Inmon in "Building the Data Warehouse." Many advocate using the term "ELT" because because the term "load" is typically used to describe the step of loading the data into the RDBMS. For example, in Oracle, the "SQL Loader" tool is used to load external data into the database. After loading comes the transformation processing--hence ELT. — Preceding unsigned comment added by Randygrenier (talkcontribs) 15:34, 10 May 2015 (UTC)

Citation Needed??[edit]

Specifically referring to the statement, "Increasingly, companies are buying ETL tools to help in the creation of ETL processes.[citation needed],"... Really? I'm not sure we need citations when statements are common sense. Companies are always looking for ways to increase the bottom line and IT departments to increase the "plug-and-play" supportability of components. Certainly this is not accomplished through creating (or allowing to be created) tons of proprietary, in-house scripts. So, obviously, companies that discover ETL tools will buy ETL tools to create ETL processes. I wouldn't expect to see "Wind is the movement of air [citation needed]" That's just silly. 71.67.189.53 (talk) 01:42, 24 October 2009 (UTC)

I do think that the citation is needed to validate the term "Increasingly" used in the paragraph. I believe there is still massive amount of manual creation of ETL without the use of ETL tools so some proof that this is diminishing and the use of COTS products are filling in that part of the marketshare would be useful. 157.203.43.103 (talk) 14:35, 11 March 2010 (UTC)
I'm with the guy above. As someone who has worked in the industry for 20+ years, there's an awful lot of "ETL Tool evaluation" goes on, but companies (IMO) still shy away from the up-front cost, and err towards sticking with their in-house scripted solutions. This might be changing, but we do need a citation. If it were my $2 I'd say it's not on the increase. Michaelfromtheuk (talk) 09:18, 17 July 2012 (UTC)

Totally agree. In fact, there are organizations that have aborted the use of ETL tools (after much expense) because of their lack of flexibility compared to programming in the RDBMS language (e.g. PL/SQL, Transact SQL, etc.) — Preceding unsigned comment added by Randygrenier (talkcontribs) 15:36, 10 May 2015 (UTC)

External links to other tools[edit]

I suggest that deleting the "External links to other tools" section is long overdue. It has become a spam magnet and a link repository, and Wikipedia is not a link repository. —Veyklevar 05:24, 12 May 2006 (UTC)

Very much agreed. I'd do it myself, but I'm biased.

I'd suggest that rather than a list of links, what would make more sense is a list of "major products" by market share - with citation, although this might be hard to come by. From my own experience (which is over 20 years) I'd say the top 3 are Oracle, IBM, and Ab Initio (which isn't even mentioned). Michaelfromtheuk (talk) 09:51, 5 March 2012 (UTC)

Copyvio[edit]

Some of the "Challenges" section of this article has been copied verbatim from [1]. Haakon 11:57, 22 May 2006 (UTC)

The Image on this page looks like it has been copied from some book. Is that a copyright violation? I don't know how to sign this comment, sorry. — Preceding unsigned comment added by 208.91.190.85 (talk) 18:25, 1 May 2013 (UTC)

Some ETL Tools[edit]

This list of ETL tools is a little selection of the usable products. A recent software is'nt present in this list : Talend Open Studio. I have find severals english source of information of this product and its show the relevance of this software:


The Official website is http://www.talend.com/default.html

I'm actually working to this company, it would be thus preferable that I'm not the author of this addition!

Can somebody it help me to complete article ETL by adding this reference?

Ocarbone 13:36, 17 October 2006 (UTC)

Some more ETL Tools offering companies are: http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html https://www.informatica.com/ http://www.cloveretl.com/ https://adeptia.com http://support.sas.com/software/products/etls/ https://www.mulesoft.com/

Please add some more providers if you know. — Preceding unsigned comment added by 125.63.70.135 (talk) 11:34, 7 March 2017 (UTC)

Hadoop based tools[edit]

Under the section that lists open source tools, shouldn't we mention tools like Cloudbase, Hive, and Pig (all based off Hadoop) as well? —Preceding unsigned comment added by Saurabhnanda (talkcontribs) 04:51, 2 June 2009 (UTC)

It certainly raises an interesting question, and big-data tools can have a part to play in processing very large data-sets into aggregate sets more useful for analysis, however Hadoop et al. typically doesn't perform the Extract or Load steps particularly well, and a traditional ETL tool is about orchestrating the motion, as well as transformation of data. Gradientdescent (talk) 05:21, 22 June 2012 (UTC)

Kettle[edit]

Kettle seems to be a synonym for Pentaho Data Integration. Useful to mention this somewhere in the tool list, and maybe also have a disambiguation page for "Kettle". March 2010, ThP —Preceding unsigned comment added by 88.101.4.80 (talk) 10:30, 19 March 2010 (UTC) dsfsdfdf —Preceding unsigned comment added by 202.67.6.11 (talk) 10:50, 12 August 2010 (UTC)

Performance Section[edit]

In addition to the criticism that has been posted as to the "recipe" nature of part of this section (do this first, then do that, etc), as a database consultant with almost 20 years experience, I actually disagree with the fundamental point the author is making here. He says the "bottleneck" is in the loading of the database. Firstly, the "speed" of a load is rarely the step that takes the longest time in any migration / data-cleansing type project. It's the *analysis* that must be performed to map data and data types(!) from disparate sources into a new "consolidated" target database. This involves many many man-hours and meetings and talking to subject-mater experts, etc. In other words, the "technical" aspect of actually loading the final data is in a sense the *easiest* part of the project. It's the figuring out of all the nasty problems that crop up *before* one has a clean mapping of old-->new data and old--> new data types that take up the most time and intellectual effort. Finally, with the ever increasing speeds of computers and ever increasing quantities of fast memory, technical "loading" of data (yes, even with constraints etc needing to be created) is simply not even in the top 5 problems on a person's mind when working on such a project (ETL).

In conclusion, there are many difficult intellectual problems associated with ETL. However the one mentioned by the author (DB load speed) is probably the least among them. Finally, is it our job to "rank" these problems (associated in a project)? I'm not sure. —Preceding unsigned comment added by Frumiousfalafel (talkcontribs) 19:34, 26 October 2009 (UTC)

I agree with you - Wikipedia shouldn't be a recipe book of 'how to do', so any section like this ought to simply mention (by way of example) the types of challenges and activities that occur within ETL, rather than describing any specific process, as it's rare that "one solution fits all" So maybe a 'Challenges' or 'Activities' section detailing the what typically goes on, maybe linking into how this fits into an organisation. There could possibly be a refer-back to 'IT maturity' (I think Gartner publish something on what identifies the evolution of company IT policies - though bound to be behind their pay-wall) Michaelfromtheuk (talk) 09:57, 5 March 2012 (UTC)

ETL Implementation[edit]

I am very confused that whether this project is just similar to "Data Mining in heteregeneous Distributed System" or is completely different. —Preceding unsigned comment added by 116.75.132.149 (talk) 18:20, 11 August 2010 (UTC)

Dubious[edit]

The section Transform says: "Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this calls for automated Data cleansing; no manual cleansing occurs during ETL". The Data cleansing article talks about removing or correcting invalid information, which I would see as different and separate from transcoding. In other words if the data source has "Male" and "Female" and the destination "M" and "F" then this change is transcoding but removing a record that has "Hermaphrodite" in the column would be data cleansing. -- Q Chris (talk) 15:22, 11 April 2011 (UTC)

Disable Triggers[edit]

The author claims "Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step." to gain speed. I don't consider this being a good practice, because triggers ensure database integrity, and if the ETL process fails after the initial step, this might mess up the database completely. — Preceding unsigned comment added by 88.1.46.185 (talk) 08:12, 20 October 2011 (UTC)

ETL Alternatives[edit]

A web search backs up my feeling that the alternative 'data model recasting' listed in the section is not notable and the references cannot be considered reliable. Gradientdescent (talk) 02:12, 5 June 2012 (UTC)