Talk:Apache Hadoop

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Java (Rated C-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Java, a collaborative effort to improve the coverage of Java on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
WikiProject Computing (Rated C-class)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.


I started taking a crack at cleaning this page up. Started with the description. Please let me know if this looks okay. If so, I'll add more citations and proceed to fixing the main body. Vinod (talk) 16:12, 28 October 2013 (UTC)

I still have no idea WHAT this thing does![edit]

And I'm in IT! What is the (at least intended) reading audience for this article? If it's other Hadoop experts then it needs a total rewrite. — Preceding unsigned comment added by (talk) 19:11, 25 January 2013 (UTC)

Haha my thoughts exactly! This article is a huge pile of buzzwords, none of which is understandable for the general public. A translation from Marketing to English would be much appreciated. (talk) 10:27, 25 August 2013 (UTC)

This is an important part of IT infrastructure that underpins many emerging technologies. Sadly, the whole article is so badly written that many readers are struggling to understand Hadoop from this explanation. Vote +1rewrite Andmark (talk) 11:56, 1 September 2013 (UTC).

Uh... Yes... But... I can clarify that I've lots of little bits about Hadoop over the last few years, but my proximate cause for reading the article was some corporate presentations about so-called big data. Overall the article has failed to provide me with the kind of insight that I was hoping for, but I admit that part of my framing was that I expected to see more on the position of Amazon vis a vis Hadoop, insofar as the company that employs me is mentioned several places and I had somehow reached the conclusion that Amazon was the big barrier to overcome here... Should I conclude that Amazon is much less relevant to Hadoop than I thought, or that the article has a PoV against Amazon? I sort of agree with the comment about too many buzzwords, too, but mostly I'm just kind of disappointed in the lack of enlightment after spending the time to read the entire thing fairly carefully. I actually feel I do have a small idea of "WHAT this thing [Hadoop] does", but it wasn't helped or refined by the article Shanen (talk) 07:45, 9 December 2013 (UTC)

Visit their website; and I will too. -- Charles Edwin Shipp (talk) 15:42, 30 May 2014 (UTC)

I have a degree in Computing Science and I still don't know what Hadoop is after reading this article. There needs to be an explanation of what it does, and what it is for, before any talk about its components. If describing a car I would start with it being a vehicle that is driven by person and that cars typically transport between 1 and 7 people. I would not begin by describing it as a framework consisting of an engine, gearbox and body. The article needs to be written as an encyclopedia article. FreeFlow99 (talk) 13:15, 6 June 2014 (UTC)
Another request for a plain explanation of what Hadoop is. The fog arrived for me when "data set" was used where I expected "data". Seriously, this topic is important enough to deserve the attention of a subject matter expert to explain what this magical blend of project, framework, distributed file system, distributed data base, distributed operating system, etc. is. I echo the above comments. patsw (talk) 14:55, 19 August 2014 (UTC)

What Hadoop is Not[edit]

It is not a cluster. Distributed computing and cluster computing are two very different things.

A 100 computer Hadoop System with 100 cpu's can never have 100 cpu's working on the data on one system. Each system with one cpu can have precisely one cpu working on the data on that node. You own 100 cpu's but get the benefit of only the cpu's that happen to be local to the data being worked on. A Slurm - MPI - Ganglia etc,,, true cluster allows all the cpu's you own to work with whatever data you have to whatever extent they can access the data and share the computation. In Hadoop data needing more processor attention must be duplicates as many times as you need processors to work on it. In practice 10 or more copies of the same data may be needed. If the data is already massive then this duplication can be costly and prohibitive. It is possible that data aware clustering has obsoleted hadoop and similar poor mans cluster technologies. Rocks clusters, beowulf style clusters with data aware slurm implementation can our perform at a lower cost and with less duplication of data.

In any case published Hadoop data from government users reveal that the cost of electricity is often so high no savings are realized over traditional data warehousing. Scottprovost (talk) 18:52, 1 September 2013 (UTC)

What Hadoop Is[edit]

Since over the years Hadoop has become many thinks and applications. Most of which can be run without HDFS or even any core Hadoop components. It would be a good addition to this article to provide a list and links to the 40 plus components that have become known as part of or in them selves "Hadoop". Sometimes referred to as the Alphabet soup of "Hadoop Ecosystem?" 2. Apache Pig 3. Apache Hive 4. Apache HCatalog 5. Apache HBase 6. Apache ZooKeeper 7. Apache Oozie 8. Apache Sqoop 9. Apache Flume 10. Apache Mahout ... Scottprovost (talk) 16:58, 15 March 2014 (UTC)


The article says "On February 19, 2008, Yahoo! launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query."

I thought that Bing was powering Yahoo search??? Kitplane01 (talk) 18:54, 26 August 2010 (UTC)

Y! are switching/have switched to Bing for index and search; I don't know what they use those same clusters for now, but as of august they were running a 4000 machine cluster, as mentioned on the Hadoop general mailing list [1]. That cluster is the largest #of machines in a single Hadoop cluster, though it is believed that Facebook have a bigger filestore in a cluster with less machines. (Newer servers have more higher-capacity disks in them. I have a photo of Arun and Owen from Y! running Terasort on one of Y!s clusters at Apachecon 2009; this includes a screen shot of the laptop as they set the then record for the petasort benchmark; this might make a good addition to the article.
Yahoo! runs more than 38,000 nodes across its various Hadoop clusters, the largest of which are 4,000 nodes. Even after the Bing switch-over, the clusters are used for analytics, machine-learning, ad targeting, content customization, etc. Yahoo! is still by far the largest user of Hadoop. —Preceding unsigned comment added by (talk) 07:30, 28 September 2010 (UTC)


This feels a little too much like promotional literature to me.

I don't think that's the case, but it is just fairly minimal right now. What we need is some information on the underlying architecture, some discussion of its strengths (scales) and weaknesses (Name node is a single point of failure, base performance not great, can be tricky to nurture if you don't know how to manage a cluster). Are you volunteering to add these? SteveLoughran (talk) 21:46, 23 June 2008 (UTC)
I agree that it looks more like a marketing brochure than a real wikipedia entry. 14:00, 30 October 2009 (UTC)
Added an architecture section, including coverage of limitations and specifics of the filesytems. Better?

I think this is a good overview of Hadoop ... concise ... relates the project and product well to the Who What Where and Why you'd be looking for in an Encyclopedia entry. The only thing I'd add is comparative discussion of other ways similar problems are solved to anchor context (FreddyMack (talk) 14:02, 14 April 2009 (UTC))

I would like to know what is involved in implementing it. What sort of limitations are imposed on developer making data processing code for this system? What sort of techniques can be used to make code more efficient for such a setup? Chillum 03:41, 21 May 2009 (UTC)

Google patents Hadoop?[edit]

Excerpt from

In mid-January, Google won a patent for MapReduce, the distributed data crunching platform that underpins its globe-spanning online infrastructure. And that means there's at least a question mark hanging over Hadoop, the much-hyped open source platform that helps drive Yahoo!, Facebook, Microsoft's Bing, and an ever-expanding array of other web services and back-end business applications. (talk) 17:35, 23 February 2010 (UTC)

Oh yeah? So they want to forbid that anyone else can slice an SQL query over several server within a cluster? Doesn't make any sense to me... -- (talk) 12:25, 12 January 2014 (UTC)

Podcast with Hadoop[edit]

A recent Software Engineering Radio podcast was about Hadoop:

Episode 157: Hadoop with Philip Zeyliger. Released 2010-03-08. Direct download URL for MP3. Length: 51 minutes 04 seconds.

It could be included in the article, e.g. in External Links. E.g. as in arcticle "Aspect-oriented programming".

--Mortense (talk) 12:00, 9 March 2010 (UTC)

Hadoop Podcast Focused On All Things Hadoop[edit]

perhaps can get put into this main hadoop page as a resource for use —Preceding unsigned comment added by Omniomega (talkcontribs) 04:42, 5 September 2010 (UTC)

What is the problem that Apache HAdoop is trying to solve[edit]

I read the article, but was unable to separate out the problem that the system seeks to solve from the implementation details. As far as I can tell, it seems to be useful wherever there is a large quantity of file-based data which can be processed independently from other data, but is expensive to transfer. This seems to read as if the problem is to create an index (hashmap?) that can direct you an appropriate node to compute on.. Is this right? Can someone splice in a section after the lede to aid understanding this? (talk) 11:46, 19 April 2011 (UTC)

Actually, it's solving very different things. E.g. MapReduce is about accessing different clusters containing different data (where a cluster consists of several servers containing the exact same data). So it's basically distributing the SQL query and afterwards asking each server for the result of a different subset, and finally merging the data to create one data set. However, this can be easily done and probably any large scale DB developer already does it. Finally I think that Hadoop is great for distributed file server, but only, since distributed DB queries can easily be done without hadoop. Anyway, it's basically a Java query implementation, the question is, do we need it or shouldn't we just implement our own map reducing systems? -- (talk) 12:31, 12 January 2014 (UTC)

Hadoop inspired by Google's GFS and MapReduce[edit]

The introduction erroneously says that Hadoop inspired Google's MapReduce and GFS. It is the other way around. Sanjay Ghemawat et al. published the GFS paper in 2003 [2], and Jeffrey Dean and Sanjay Ghemawat published the MapReduce paper in 2004 [3]. Hadoop developers have clearly stated that they used these works as inspiration to solve their scalability problems [4] [5]. (talk) 13:44, 1 June 2011 (UTC)

Well spotted! Someone edited the page page last week and flipped the credits. Reverted and added another warning to the IP address. SteveLoughran (talk) 20:38, 1 June 2011 (UTC)

Current Hadoop Versions are wrong[edit]

The current Hadoop versions rendered in the infobox are wrong. The 1.0.0 is the current beta version for the 1.0X branch and 0.20.203.X is the current stable version from the 0.20 branch [6]. — Preceding unsigned comment added by Aalexand85 (talkcontribs) 15:07, 2 February 2012 (UTC)

HDFS Not Mountable?[edit]

The section on HDFS contains the following paragraph, "Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem, at least for Linux and some other Unix systems."

That's a pretty big contradiction, with a FUSE based filesystem for HDFS, it can be mounted by an existing operating system. Also, what's the deal with the phrase "existing operating system", is that opposed to an operating system that doesn't even exist? Onlynone (talk) 17:22, 20 April 2012 (UTC)


This website has some of the same content as the article: [7]

Do we think it's someone copying Wikipedia, or could it be a copyvio? Andrew327 07:57, 2 April 2013 (UTC)

Data nodes can talk to them selves?

"Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. " This is wrong!

Jargon and techno-babble[edit]

The intro claims that Hadoop provides "reliability and data motion to applications". Data motion? Is this like interpretive dance? A bit ballet? The term "data motion" is undefined elsewhere in WP (thank heaven). It is also the name of a company and product line (that has noting to do with Hadoop). As well, a previous entry in this talk page draws attention to text in the article where process nodes "talk to each other". Do they do this via Twitter? Or do they use couriers on cyber bikes like in the movie Tron? The intro also refers to "computation-independent computers". Nice to see computers finally moving away from being dependent on computation...

This whole article needs a re-write to avoid sloppy writing, breezy jargon, and dubious techno-babble. Ross Fraser (talk) 22:12, 15 July 2013 (UTC)

A glossary and advisory statement at the beginning of the article would go a long way toward demystifying it. 2601:2:8D00:1E3:E986:AB21:7172:AA44 (talk) 19:40, 22 July 2014 (UTC) John Beale

Stratosphere extends Hadoop[edit]

There are no mention of Stratosphere — Preceding unsigned comment added by (talk) 09:28, 26 March 2014 (UTC)

Unbelievably bad bad bad article[edit]

The beginning of this article reads as follows:

"Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. "

This is fine, but the crucial word "processing" is one of the vaguest verbs in the English language and requires immediate elaboration.

Unfortunately, the word is nowhere elaborated. As a result, readers are left with the impression that Hadoop does nothing whatsoever.

Instead, all we get is innumerable paragraphs about its underlying architecture.

This is totally unacceptable. Hadoop is above all defined by what it does, not by how it is built. So: if the architecture paragraph are left in the article, they belong only after a good description of what Hadoop does.

As currently constituted, this article is exactly as if an article about Facebook mentioned in its first sentence that it was "social software" and then, with no elaboration on that description, proceeded to discuss for many, many paragraphs the software architecture of Facebook. That is how utterly ridiculous this article is.

I strongly urge that this article either be fixed immediately to explain what Hadoop does, or that it be removed, lest it give other unknowledgeable editors the wrong idea about what an encyclopedia article should be.Daqu (talk) 16:12, 30 October 2014 (UTC)

OK, I rewrote the introduction. I apologize, though, for the WP:SELFPUBLISH -- I couldn't find any other good source. Michaelmalak (talk) 17:17, 30 October 2014 (UTC)