Talk:MapReduce

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Google (Rated C-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Google, a collaborative effort to improve the coverage of Google and related topics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 

Why no 'needs work' tag?[edit]

Hello; I'm a new contributor; sorry for this comment's simplicity... This article on (what I would have thought) an important topic is pretty bad; the C grade it's received reflects this. It's also considered of "low-importance"; I have not yet located the project's importance scale, so I don't understand this. MY QUESTION: why is there no "needs work" indicator at the top of the article itself, like I see in so many others? Readers should be warned. DrTLesterThomas (talk) 15:01, 26 January 2014 (UTC)

Comparison with fork-join[edit]

Hi there,

I think a comparison with fork-join would be helpful. These two concepts seem similar, and pointing out both the similarities and differences would be helpful for the reader. There's an academic paper on the subject actually. Thank you. 205.175.116.125 (talk) 21:03, 18 March 2014 (UTC)

More prior art[edit]

I first heard about map-reduce frameworks in a lecture at Imperial College in 1994 given by Qian Wu. [This paper] covers some of that work. I wish I could find the lecture notes because they actually contained a diagram which was map-reduce. I'm just putting this here for people looking for prior art against Google's patent. Richard W.M. Jones (talk) 12:14, 3 May 2014 (UTC)

Interesting pointer. However, note that map-reduce is not about map, or reduce IMHO. It's about optimizing the shuffle once, to get fail-safe recovery from machine loss for a whole class of programs. I don't know if fail-safety and recovery has been handled in this prior art? (As for map, and reduce, these have been common in functional programming anyway). Nevertheless, your source would make a good addition to the article to emphasize this point, that map-reduce is not about the map+reduce functions themselves, but about how to make this scale. --Chire (talk) 08:36, 5 May 2014 (UTC)

Request for cleanup of Talk Page[edit]

I started fresh on this topic on Wikipedia (I am bit new to this page of Wikipedia), though i had some good understanding of the concept (MapReduce). As suggested on the top of the article, I wanted to improve the article to make it easy to understand. For this, i tried to look at the previous discussions and feedback, which seems pretty old (more than 3-4 years old now in 2015). Also it is difficult to understand about which comments are already addressed, and which ones need attention. For instance talks regarding the examples (K1, K2) and citation cleanup seems like already addressed. If some existing followers of this topic may throw some light of what is done and what is pending for action, that could be helpful. — Preceding unsigned comment added by Vishal0soni (talkcontribs) 13:30, 2 January 2015 (UTC)

If you notice that some discussion point no longer applies, consider adding a {{done}} template or {{Resolved mark}} or {{Fixed}} to it, so that others can easily see that this does not require attention. (See Template:done for a list of such markers.) Later on, we can also add an Wikipedia:Archive, once the outdated and the still-current discussions have been flagged out.
As for your YARN change, I reverted it, sorry. YARN is not a "programming model"; and MapReducev2 is not a new model either. Yarn is an Hadoop API change, but I don't think it is notable on its own for Wikipedia. It coincides with the Hadoop 2 milestone, and it cannot be used without Hadoop. It is a refactoring of the Hadoop codebase to allow sharing certain code between MapReduce and other jobs. MapReducev2 isn't fundamentally different - it's simply Hadoops MR, now using YARN for resource management instead of having an own resource management. I'm not aware of any major breakthrough on MR enabled by YARN; but from a MR point of view this is only maintainance. The appropriate article for YARN is Apache Hadoop, and it is already covered there. —138.246.2.241 (talk) 18:00, 2 January 2015 (UTC)
Thanks for the suggestion about using resolved templates. I'll again have a look at the suggestions and try to update article and talk page accordingly.

Regarding Yarn, can you please provide references for your explainations. For the changes i made, i had already mentioned appropriate source. As far my understanding, with MRv2, the entire architecture and functioning of MapReduce has changed. Now Job Tracker and Task Tracker does not exists, these are replaced by resource manager,Application manager and few other additional components. So the entire processing workflow has been redefined. Vishal0soni (talk) 02:15, 3 January 2015 (UTC)

Inclusion of YARN[edit]

Statement by 138.246.2.241 (talk) As for your YARN change, I reverted it, sorry. YARN is not a "programming model"; and MapReducev2 is not a new model either. Yarn is an Hadoop API change, but I don't think it is notable on its own for Wikipedia. It coincides with the Hadoop 2 milestone, and it cannot be used without Hadoop. It is a refactoring of the Hadoop codebase to allow sharing certain code between MapReduce and other jobs. MapReducev2 isn't fundamentally different - it's simply Hadoops MR, now using YARN for resource management instead of having an own resource management. I'm not aware of any major breakthrough on MR enabled by YARN; but from a MR point of view this is only maintainance. The appropriate article for YARN is Apache Hadoop, and it is already covered there. —138.246.2.241 (talk) 18:00, 2 January 2015 (UTC)

Regarding Yarn, can you please provide references for your explainations. For the changes i made, i had already mentioned appropriate source. As far my understanding, with MRv2, the entire architecture and functioning of MapReduce has changed. Now Job Tracker and Task Tracker does not exists, these are replaced by resource manager,Application manager and few other additional components. So the entire processing workflow has been redefined. Vishal0soni (talk) 02:15, 3 January 2015 (UTC)
Few more references:
  • "MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN." as stated by Apache Software Foundation [1]
  • "Sometimes called MapReduce 2.0, YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component,..." [2]
This only says they changed their implementation of MapReduce. But the article is not about Hadoop MapReduce, but about the MapReduce concept. Can you find any reference about anything that has changed on a theoretical side (not Hadoop implementation details)? Class name changes etc. are not of interest; not that they like calling their new implementation (of the same concept!) MRv2 now. The references you have are for nonencyclopedic implementation details of Hadoop, not for MapReduce as a processing model. As far as I known, YARN/MRvs is still the same MapReduce (only implemented slightly differently internally) as far as I can tell. --94.216.222.254 (talk) 17:04, 5 January 2015 (UTC)
I totally agree that this is about generic MapReduce Concept and not about its Hadoop implementation. My main concern was that in technology world, we very commonly keep hearing the terms MapReduce and its so called later version YARN (though only implementation level, but yes, it has become the talk of the topic) together. So would like to have a mention of YARN when someone is reading about the MapReduce concept. And that is why suggested to just mention about the latest update for Apache's MapReduce implementation, just next to where we have Apache Hadoop mentioned in the article. Vishal0soni (talk) 05:16, 15 January 2015 (UTC)
If it's about the implementation only, then it should be mentioned under MapReduce#Implementations_of_MapReduce. But it already says "Apache Hadoop", which includes both the old MapReduce and YARN, doesn't it? It's not as if YARN wasn't Hadoop! And I could not spot any reference specific to Apache Hadoop MapReduce v1, where it would make sense to mention the "v2 based on YARN". --Chire (talk) 10:41, 15 January 2015 (UTC)