Talk:Big data

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Mass surveillance  
WikiProject icon Big data is within the scope of WikiProject Mass surveillance, which aims to improve Wikipedia's coverage of mass surveillance and mass surveillance-related topics. If you would like to participate, visit the project page, or contribute to the discussion.
 ???  This article has not yet received a rating on the quality scale.
 ???  This article has not yet received a rating on the importance scale.
WikiProject Computing (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.


The term has grown immensely in usage and notability. The previous entry cited one source. Mine cites a large number of well regarded journals and two books. jk (talk) 22:20, 21 April 2010 (UTC)

As the author of the initial entry that was deleted, I'm glad to see it resurrected. It has my vote to keep! --Nick (talk) 17:41, 12 May 2010 (UTC)

What happened to petabytes?[edit]

'Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.[5]' — Preceding unsigned comment added by Samdutton (talkcontribs) 16:25, 26 October 2011 (UTC)

'Every day, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years.[11]' That reference leads to an IBM web page rather than a report or scientific paper. Please link to primary sources and peer reviewed studies so the reader can verify the claim and gather more information about its context and scope. -- (talk) 22:21, 1 February 2012 (UTC)this is correct.

"Big Data" vs. "big data"[edit]

Should we not stick to a consistent upper-/lower-case spelling in the article? --Sigi fikanz (talk) 14:55, 4 February 2012 (UTC)

I can't imagine why one would use upper case; feel free to change to lower case where ever you see it. Kuru (talk) 21:25, 4 February 2012 (UTC)

Data mining and Dana Boyd[edit]

Should Data Mining be added as a link ? But that article would need to be updated. Dana Boyd : I have my doubts as well, as their article talks only about "social" data. The article however asks some questions that need to be discussed. Renebach (talk) 10:12, 19 April 2012 (UTC)


The origin of the term "big data" would be interesting here. Who coined it?

I'm not sure if anybody actually coined the term, and maybe others were using it as well. Of course, all should read MS Fnd in a Lbry for a cautionary tale.

But here is some relevant history. [I was at Silicon Graphics 1992-2000, usually as a Chief Scientist split about 50/50 between doing {computer architecture, software, strategy, etc} and {talking to customers, analysts, press, i.e., evangelism}. SGI started with 64-bit micros in 1992, and by 1994 had large scalable multiprocessors (up to 128 CPUs with shared memory) with 64-bit operating systems and high-performance, journaled 64-bit file system (XFS, later contributed to LINUX). Back in 1997 when a Terabyte was a lot of disks, we demonstrated backing it up in an hour, of interest to the technical and commercial customers who bought systems for large capacity and high-performance I/O.]

I started using the phrase Big Data around 1994/1995 as part of general talks to customers, i.e., often used to set the stage for technology trends and what SGI was doing with them. The earliest mention I know of the HW/WW/SW talk was 11/05/94, in an abstract, which likely means it had existed for a while, as I'd used some of the material in talks in 1993.

By 4Q 1995 I was using "Hardware, Wetware, Software" in keynote talks at conferences (USENIX LISA, Tri-ADA . University Videos did a VHS tape of it in 1996. During 1995-1997, it was my usual talk about technology trends and their implications for audiences like universities or conferences, i.e., not straight product sales pitches. I even gave it for a group of local Red Cross people once. Sadly, I don't know of an online copy (used SGI Showcase, not PowerPoint), but I might scan a slide or two some day and put them up.

By then, I had a bunch of talks: "Big Band(width)" "Big Physics," "Big Latency," "Big Transition/Big Trouble (64-bit)," "Big Challenges in ComputerCity." I think the first time I used "Big Data" as the talk title itself was for a sales meeting in mid-1996, which was a derivative of the HW/WW/SW talk. I showed a foot-long piece of the high-peformance NUMALink cables used in the SGI Origin 2000 ccNUMA systems that started shipping in 1996, the point to the salesforce being that it was not enough to have big graphics and big compute, we had to be able to address and transfer big data. SC'96 in November 1996 in Pittsburgh, PA was a launch event for us, although I think I was out of the country, so didn't attend. Michael Woodacre kindly sent me a Picture of the Origin 2000. Three of the signs say "Big Graphics", "Big Compute," and "Big Data." The latter included "Effortless data handling for complex simulation problems," i.e., a message for high-performance technical customers, of whom about 8-10,000 attend these events.

By 1997, there were talks with slides like "Big Advantages from Big Tools for Big Data." Big Data had moved to become the main theme. I added another general talk (albeit much techier) called "Big Data and the Next Wave of Infrastress," about explosions in data and the stresses to be expected on infrastructure. The earliest slideset I still have is from September 1997. There was A talk at Stanford, July 29, 1998.

Both talks were used for training SGI salesforce (either initial or at sales meetings), talking to press and sometimes industry analysts, through 2000, for example:

1998 July 14-15 Industry Analyst Briefing "Technology Trends, Big Data, Silicon Graphics" Although some of the earlier (1993-1995) Challenge systems had been used for database data mining, the Origin systems were much more scalable and had more software and organization support for customers outside SGI's traditional technical markets.l I found a slide from October 1998 that had lots of commercial customers, such as financial institutions, banks, telecom companies, US Census Bureau, retail, etc, who didn't care about 3D graphics or traditional technical number-crunching, but liked having a mature 64-bit UNIX, 64-bit journal filesystem, and high-performance I/O. It was not enough just to have big data, people had to be able to move it around fast enough to be useful. As one (I/O) engineer always said "CPU for show, I/O for dough." Applications included various operations research uses, risk analyses, data mining for marketing, telecom fraud detection, etc.

1999 A later version was a keynote for 1999 USENIX, which links to a copy of the slides.

I think I did at least 200-300 public presentations and 500-700 sales pitches while at SGI, and some audiences were 500 people. I have no idea how many people saw videos or passed copies of slides around their companies. Thousands of people heard these talks live.

I certainly would not claim to have coined the phrase Big Data, as it surely had to have been used earlier, given how simple it is. Inside companies, phrases float around and people try them. In the mid-1990s, SGI was quite visible and I probably was the most persistent evangelist of Big Data there, but of course, others may have been doing that as well, regardless of the terminology used. We certainly competed with folks like Teradata, for example, who had long history in data mining.

In 2000, I wrote a retrospective on the Numaflex scalable systems family architecture, and what it took to build really scalable I/O systems. A crucial goal of this approach was the ability to scale I/O connectivity relatively independent of the CPU and memory. We started on those designs in 1997. The discussion included some examples from the earlier Origin 2000s, starting in 1996, back in the old days when a few Terabytes was Big Data:

'Some of them loved the massive I/O capabilities:
7 GB/sec from one file system [although one customer I knew wanted 15]
4+ GB/sec to/from one file
1 TB backup in an hour (to tape)
7-TB indvidual files actually used by real customers.
Disk farms in use at 10-100+ TBs, with no fsck.'

The first example was on oil&gas exploration outfit, who typically got multi-Terabyte seismic datasets and wanted to visually "fly around" in them, which simultaneously required Big Data, Big Compute, and Big Visualization. Commercial data mining customers (that I saw) tended to use the first, and maybe the second. The oil+gas folks would spend $1M to build curved-screen 3D graphics auditoriums driven by graphics supercomputers like this. — Preceding unsigned comment added by JohnMashey (talkcontribs) 20:14, 28 August 2012 (UTC)

It is amusing that the phrase has really taken off. The vagueness of origin has some resemblance to the history of the specific phrase "creeping featurism," where I don't recall actually coining the phrase, but so far, the earliest known usage is in a paper I wrote and I gave a lot of ACM National Lectures that mentioned it.

Sadly, there actually is little coinage in coining phrases.JohnMashey (talk) 18:23, 15 August 2012 (UTC)

Fascinating and colorful SGI details above! Other pieces of the story involve Francis Diebold at UPenn and Douglas Laney at Gartner. See Fdiebold (talk) 17:54, 30 August 2012 (UTC)

Yes, See Francis' paper, as it adds some useful perspectives.JohnMashey (talk) 23:50, 30 August 2012 (UTC)

See at NY Times blog, Steve Lohr's recent The Origins of ‘Big Data’: An Etymological Detective Story. JohnMashey (talk) 05:57, 7 February 2013 (UTC)

Fascinating information. Thanks! You're right is amusing how popular this little phrase has become.Apptrain (talk) 22:15, 23 March 2013 (UTC)

Capitalization of danah boyd[edit]

There are ongoing edits that capitalize/uncapitalize danah boyd. Please stop these ridiculous edit-war like changes, and once for all discuss here which spelling to use. In my opinion, sentences should start with an uppercase letter, even when she prefers to write herself lowercase. Plus, we should be free to spell her name in uppercase since it adhers to English language standards. Anyway, please stop changing it every week, but try to reach a consensus first. Thank you. -- (talk) 05:17, 15 May 2012 (UTC)

IMHO, Danah Boyd is a name, danah boyd is a logotype. Not sure that helps, but isn't supposed to be written in English, not PR-speak? p.r.newman (talk) 12:09, 15 May 2012 (UTC)

It is also acceptable capitalised. see for example "danah boyd ...also known as Danah Michele Boyd" and "Microsoft hires social-net scholar Danah Boyd" (talk) 09:46, 29 July 2012 (UTC)

As a name, it ought to be capitalized, in my opinion. "Alexander" and "alexander" are both the same name, but only the former is grammatically correct. Even if someone decides to change their name legally to "alexander" from "Alexander", the standard rules of capitalization ought to still apply, should they not? By itself, in that instance, "alexander" would be correct -- say, perhaps, if you are signing your name. In a sentence, particularly at the front, it should be "Alexander". There are no rules of grammar that allow for the first word in a sentence to remain uncapitalized, thus, the following would be incorrect: "alexander was an architect." The following might be correct: "Today, I met alexander, an architect." However, to establish that it was in fact a name and not a regular word, as it is in a sentence, it still ought to be capitalized. At the moment, the wikipedia article says "danah Boyd [...]", which isn't correct in the slightest. Talvieno (talk) 15:03, 19 January 2013 (UTC)


I've added the techologies for big data analysis from the 2011 McKinsey report (Manyika et al.) and move that from additional reading to a reference. Very pleasing to see every one of them already had a Wikipedia entry! p.r.newman (talk) 12:02, 15 May 2012 (UTC)

Simplified Introduction[edit]

Not everyone visiting Wikipedia has a degree. The number of polysyllabic words in the introduction exceeds good practice on readability. Any big data expert want to take on making a plain English version? MaryEFreeman (talk) 23:52, 5 December 2012‎ (UTC)

I took a look at the version from December 5 and compared it to the current version; it hasn't changed all that much since the time you commented. The problem with readability isn't with the number of polysyllabic words, the problem is "too many cooks" spoiling the broth. The three paragraph lead reads as if it were written by multiple editors, some of whom might not have a grasp on how we write lead sections, use references, and paraphrase quotations. Yes, it should be rewritten. Viriditas (talk) 11:48, 3 January 2013 (UTC)

New resource coming up[edit]

I wanted to call the attention of the regular editors/participants of this page to a new book that should be of interest to them: Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schönberger & Kenneth Cukier (NY: Houghton Mifflin Harcourt, 2013). I've just read the last-stage manuscript and have no commercial involvement with it -- but, as an educated non-techie, I found it very interesting and engaging. Should be out in late spring, 2013. --Michael K SmithTalk 19:58, 8 December 2012 (UTC)

SECTION: Science and research EXISTING TEXT(quoted by me): "When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information" MY COMMENT: The numbers do not add up, unless it is mentioned that the 140 TB is for 2 years OR clarified as to how many nights a year the data is taken. At 200 GB a night, it is over 72 TB a year, assuming data is collected every night. — Preceding unsigned comment added by Kodign (talkcontribs) 07:25, 15 December 2012 (UTC)

Architecture Section[edit]

This help request has been answered. If you need more help, place a new {{help me}} request on this page followed by your questions, contact the responding user(s) directly on their user talk page, or consider visiting the Teahouse.

The entire section on architecture appears to be single source, from self published material, added by an anonymous user. Can this be substantiated or removed?

  • This really isn't a question appropriate for asking the community-at-large. If you see something that needs a citation, look for one and add it to the article. If you can't find a citation, mark the questionable content with {{cn}} template, then contact the editor that added the content and ask for a citation. If it is unsourced controversial material or negative content about a living person, remove the content. Hope this helps. Cindy(need help?) 13:33, 22 March 2013 (UTC)

I tried cleaning up that section without realizing it contained self published material from an anonymous user. That material isn't really relevant and should probably go. Also, an Education section was added that may not belong either. This is a very trendy topic, and this article could get out of hand easily. Much of the material may belong in articles on Information Architecture, Data Analytics, Cloud Computing, Parallell processing, etc... Apptrain (talk) 22:10, 23 March 2013 (UTC)Apptrain

This Architecture section still needs to be reworked. A sentence or two about database architecture would help non expert readers understand the subject material. After that a domain expert needs to fill in some details. This reads like it was cut and pasted from another document and just dropped in here. F3meyer (talk) 21:25 25 January 2014 (UTC)

Content with french blog as source[edit]

Hi there. To anyone who may concern, I read the below content but they looks a bit strange to me. It seems to be translated from another language, and it has 2 source from a french blog which I don't understand. Please check if possible. Thanks.

If Gartner’s definition (the 3Vs) is still widely used ... to Big Data some predictive capabilities.

-- Pchackal (talk) 15:25, 29 June 2013 (UTC)


The human genome contains 3 billion base pairs, which amounts to about 700 MB of information. Hardly "big data". So why is it included here? (talk) 10:40, 17 October 2013 (UTC)

Because it's all about marketing bla blah...? -- (talk) 12:09, 12 January 2014 (UTC)

Is Wikipedia a "classic example"?[edit]

The caption to the picture says "the text and images of Wikipedia are a classic example of big data". However Wikipedia to my knowledge uses relatively standard relational databases in a souped up LAMP architecture, lightly sharded. No MapReduce, no relaxed consistency NoSQL, unless it's changed. Do you fit the model if relational databases still work? Is it a classic example? If so, the aspects of its architecture that make it "big data" should probably be briefly described among the list of examples. Tono-bungay (talk) 18:34, 22 October 2013 (UTC)

Applications of Bigdata[edit]

What about adding a section for covering Bigdata applications such as Predictive Analytics, Text Analytics, Social Media Analytics etc

Rick jens (talk) 03:12, 19 November 2013 (UTC)

About original research[edit]

Wikipedia is great partly because of its rules, made by many sharp people over time. One of these rules is no original research; Wikipedia is not a place to showcase new papers under the guise of citations, since it suggests the new paper is a reliable source; the fact that it cites other sources does not mean that it itself can be viewed as a secondary source for our purposes; what is wanted is citations one-source-removed, such as established journals, newspapers, textbooks -- impartial analysts, looking objectively at primary sources. In this case, the citation added is a primary source -- a pdf file of a research paper; don't see why it is in this article other than to promote this particular paper using search engine optimization.--Tomwsulcer (talk) 17:51, 3 February 2014 (UTC)

What happened to the 3V's?[edit]

The current article no longer mentions volume, velocity, and variety as ways of characterizing Big Data. Why? I though the combination was a good way to describe important aspects of Big Data. (talk) 15:45, 22 March 2014 (UTC)Mark Kerstetter

Proper analysis is still needed[edit]

Headine-1: Big data: are we making a big mistake? March 28, 2014

[The science of 'statistical learning' and 'computer learning' can keep research on track.] — Charles Edwin Shipp (talk) 01:48, 30 March 2014 (UTC)

Canadian Open Data Experience[edit]

Another editor is insisting that this article include a mention of "the Canadian Open Data Experience (CODE) Inspiration Day event held at the University of Waterloo Stratford Campus located in Stratford, Ontario" at which "renowned Data Scientist Hilary Mason spoke about Big Data." I don't see how this adds to a reader's understanding of this topic and remain convinced that it should be removed. Can other editors please comment or contribute to this discussion? Thanks! ElKevbo (talk) 11:57, 7 April 2014 (UTC)

I agree with ElKevbo.--Tomwsulcer (talk) 22:30, 7 April 2014 (UTC)
I do not agree with Elkevbo because the mention is similar in tone to the previous line about the IBM-sponsored championships. Perhaps the editor could consider removing the Hilary Mason mention - and simply identify the event. Statdata (talk) 03:24, 11 April 2014 (UTC)
You're right - the IBM event needs to be removed, too. ElKevbo (talk) 03:47, 11 April 2014 (UTC)
Agree with ElKevbo again. Perhaps an external link could be included to this stuff but in the body of the text, Wikipedia's rules want secondary sources, impartial one-step-distanced from source material.--Tomwsulcer (talk) 11:06, 11 April 2014 (UTC)

WP copyright policy violation in Benefits section[edit]

Running the Duplication Detector report reveals the following:

Comparing documents for duplicated text:

Downloaded document from (239986 characters, 7914 words)
Downloaded document from (71117 characters (UTF8), 5217 words)
Total match candidates found: 1202 (before eliminating redundant matches)

Please run the report itself to see which sentences are a exact match (about 6) & which are close paraphrases (about a dozen or more).
Peaceray (talk) 02:51, 8 June 2014 (UTC)

Notability of cartoon[edit]

Funny though it is, what is the ultimate source of the cartoon? Is it just a WP editor? WP:NOTBLOG... (talk) 08:05, 13 August 2014 (UTC)

Hi. As you can see on the file page this is a cartoon by T. Gregorius. I'm not aware that he is a Wikipedia editor (and I rather doubt that). IMHO, the cartoon is a good and very compact visualisation of the criticism aimed at Big Data. Well, of part of the criticism, of course. I did give this some thought before putting it in the article, whose criticism section is rather abstract. It's really hard to give a meaningful and comprehensible illustration of the Big Data paradigm. I think it's a perfect fit. We could add "Cartoon critical of big data application, by T. Gregorius" if you think that this would make it clearer that this is not "Wikipedia's" commentary. BTW, Wikipedia contains a lot of schematic illustrations (and these often have to leave something out, some complex aspects) - are they notable? (And are their authors notable? Wikipedians?) It comes down to editorial decisions. Are those illustrations appropriate, unduly biased, educational, help explain the subject etc. I think this cartoon is a perfect fit in the criticism section, it really makes the text more comprehensible. --Atlasowa (talk) 10:31, 13 August 2014 (UTC)
Thanks for the clarification, although I still have some doubts on Thierry Gregorius's notability, and of this work in particular, if it is just self-published. Anyway, I'm fine with your amended caption. Thanks. (talk) 12:11, 13 August 2014 (UTC)