Talk:Big data

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Article milestones
Date Process Result
November 25, 2009 Articles for deletion Deleted
April 21, 2010 Articles for deletion Kept
July 25, 2012 Articles for deletion Kept


It appears to me that all of the three variants "Big Data", "Big data" and "big data" are used throughout the text. This should be cleaned up.

Grstein (talk) 07:08, 16 April 2015 (UTC)

MOS:CAPS seems to be telling me, no proper names, hence no caps here. Except of course at head of sentence, in name of book, and other usual capitals. Jim.henderson (talk) 03:46, 18 April 2015 (UTC)
Yes, agree. It seems more and more excessive capital letters are creeping back in. I think this might be from marketing sources that do not use English but marketing-speak, where capital letters are use to promote terms by making them Sound More Grand. In particular, every term for which an acronym is coined, does not get capital letters unless it too is a proper noun. W Nowicki (talk) 21:59, 28 September 2016 (UTC)


The texts on U.S., India, and UK were copied from Copyright issue?--K3vinvmp (talk) 19:49, 12 September 2016 (UTC)

The material in the article appears to pre-date that publication; at least the US part. Kuru (talk) 22:09, 12 September 2016 (UTC)
Some journals like these do not care about copyright violations, and surprisingly many authors (in particular from india and china) copy from Wikipedia and get away with it. When seeing such an overlap, always check the Wikipedia version at that time - it most likely already contained the text before that paper was published... HelpUsStopSpam (talk) 20:03, 17 September 2016 (UTC)

Further reading & spam assessment[edit]

Please, let's go through Further reading section source by source here & not ax the entire section. I am an IT professional & I do not agree with HelpUsStopSpam that all the sources are spam. Please discuss each source first before removing it. Peaceray (talk) 19:55, 16 September 2016 (UTC)

So, which ones do you consider to be fundamental work on big data? "New Horizons for a Data-Driven Economy" has 0 citations. Clearly non-notable work yet. "A BRIEF REVIEW ON LEADING BIG DATA MODELS" has just 27 citations since 2014, another spam from a third-tier journal. "Technical Report CLOUDS-TR-2013-1" is a tech report, not even peer reviewed; so no fundamental work either. "Encrypted search & cluster formation in Big Data", as the name implies, is on a very specific subtopic only. "Big data for good" - 3 citations. The G&E whitepaper: not peer reviewed, very few citations. "Product Lifecycle Management: Vol 2. The Devil is in the Details." it's the appendix that is cited, and this book has 0 citations. and so on.
Why don't you name which of the sources you consider notable, and why? We can always add them back. But currently, the list consists of entries collected because of spam, not because someone decided to survey literature and identify the most fundamental work. HelpUsStopSpam (talk) 19:57, 17 September 2016 (UTC)
For the only somewhat well cited source, you can here see how they added this with very much advertising text: [1] at a time when it did not have many citations yet. Surprise: the IP is at the same university as the authors... so yes, this qualifies as WP:CITESPAM, and I have thus also removed this now. If you want some more notable broad source, consider ACM XRDS magazine, HelpUsStopSpam (talk) 20:21, 17 September 2016 (UTC)
  • "A BRIEF REVIEW ON LEADING BIG DATA MODELS" has 27 citations since 2014, & that's a lot for the Computer Science field, especially for only 2 years, so reinstating that.
  • Big Data computing and clouds: Trends and future directions has 126 citations, I'll take that as notable. Reinstating that.
  • Comment for removal of Product Lifecycle Management was that it was uncited. However, lists 705 citations, so reinstating that & updating citation information
  • The comment for removing "Big Data and the History of Information Storage" was "Spam link". While the link was to a page on Winshuttle's site, an SAP vendor, I don't accept that the page linked to was trying to sell something or otherwise qualified as spam as defined under WP:SPAM. Nevertheless, the information on the site was taken from an Forbes magazine article, "A Very Short History Of Big Data", so I will use the original source instead. Cited by 29 according to, & one should not expect to see any more for that since we are talking Computer Science history.
Peaceray (talk) 00:21, 18 September 2016 (UTC)
P.S. I think it would be more constructive at this point to find replacements if you have better sources. Summary executions Wholesale removal of Further reading citations would be disruptive at this point. While I am generally supportive of being bold in removing spam links, I believe further deletion of citations/external links from this section would no longer address that goal. Please get consensus here on this talk page before removing any citations from this section. Peaceray (talk) 00:46, 18 September 2016 (UTC)
@Peaceray: 27 citations is meagre for a broad topic such as big data, not "a fair amount". And J-stage is not the best publisher: [2]. Big data is not something very special. If you take e.g. this book ISBN 0544002695 (I have no idea if it's good) from 2013 titled "Big data: A revolution that will transform how we live, work, and think", it has 1886 citations on Google scholar. Silverman "Qualitative research", which probably is already too specific, has 2957 citations. Danah Boyds "Critical questions for big data" has 916 citations... Jure Leskovecs "Mining of massive datasets" has 1043 citations. And you call 27 citations good?!? You must be kidding. "Product Lifecycle Management", only the main book has citations. Not the appendix of vol 2, which is a separate book (appeared a year later). No, none of these are notable, and not on the main subject Product Lifecycle Management != Big Data. They are not "further reading", but "other crap that happens to mention big data somewhere". Let me spell out the requirements of Wikipedia:Further reading to you: Topical (not just related), Reliable (which would probably be 1000+ citations here, and rule out techreports and low-quality journals), Balanced (not just specific details) and Limited (no, we do not need to include everything; but we should include the most on-topic relevant literature only, i.e. the highly relevant textbooks). The section *never* was a good selection. It was spam right from the beginning, nobody chose these works because they were on-topics and good... thus, boldy removing them is the proper way, rather than keeping this useless random selection. HelpUsStopSpam (talk) 19:58, 18 September 2016 (UTC)
  • A couple things about the "A BRIEF REVIEW ON LEADING BIG DATA MODELS":
    • You are confusing the validity of the journal source, Data Science Journal, with the site that is hosting the paper, Japanese Science and Technology Information Aggregator, Electronic. Regarding that, I place much more significance to the Scimago Journal & Country Rank for Data Science Journal than the individual's blog entry that you posted in proof that J-Stage is not the best publisher. Indeed, the blog post was not about J-Stage, but a journal aggregated by J-Stage, the Journal of Physical Therapy Science. So this is totally irrelevant to our discussion here. However, if I can find a more direct link to the article I will use it, but right now seems broken. Done
    • Big data is still in many ways an emergent field & while there is much that is written about aspects of this field, there are aspects that are not as well covered. My search for scholarly research for "big data" & "data modeling" together yields sources with a similar number of citations. Is data modeling crucial to big data? I think so, but with an MS concentrating on data modeling & systems analysis & over 2½ decades in IT, what do I know? Perhaps I am biased ...
  • As someone who has 14,000+ plus edits on en.wikipedia, I have read Wikipedia:Further reading several times, although I can always use a refresher. As stated on the page "This page is an essay, containing the advice or opinions of one or more Wikipedia contributors. Essays are not Wikipedia policies or guidelines. Some essays represent widespread norms; others only represent minority viewpoints." That said, there is no numerical criteria in that essay for how many citations makes a source reliable. Indicating that 1,000 citations makes something notable is essentially an opinion about an opinion essay. As a former part-time university reference librarian (9 years). I know that in some more obscure areas, as little as a dozen citations can be considered a lot.
  • What really is at debate here is our approach. I categorize myself as an inclusionist whereas I believe you to be an exclusionist. There is a place in Wikipedia & other Wikimedia projects for both, & there is a need for balance & cooperation.
  • As I stated before, if you can find comparable but better sources to replace those currently in Further reading, please go ahead.
Peaceray (talk) 23:40, 18 September 2016 (UTC)
I am an inclusionist as far as notability is given, and the sources are on-topic. Also, for an inclusionists perspective, nothing is actually lost; the further reading links are not part of the actual textual content... For many of these sources, they are neither. The J-Stage journal is really low quality and low impact. The very link you gave, Scimago Journal & Country Rank for Data Science Journal usually has it in the worst quarter. The h index of 11 is that of a single low-reputation author. That tech report is not peer reviewed at all. By keeping off-topic links such as "Product Lifecycle Management" in the further reading, you encourage spam, instead of textual contributions. I can live with the Forbes History of Big Data. Forbes is reputable enough, and this one is also on-topic. I have offered you some high reputation and likely on topic sources, why don't you consider them for "further reading" instead of that spam? HelpUsStopSpam (talk) 18:44, 26 September 2016 (UTC)
"A Brief Review on Leading Big Data Models" was spammed on April 30, 2015 [3] by the first author. It is pretty much the definition of WP:CITESPAM. He tried before, [4] in January, twice [5] but back then someone noticed. It is spam, it must be removed. HelpUsStopSpam (talk) 18:51, 26 September 2016 (UTC)
Similar, have a look of the edits that led to the "Product Lifecycle Management" spam: Special:Contributions/LGB2015. This is spam. HelpUsStopSpam (talk) 19:09, 26 September 2016 (UTC)
@HelpUsStopSpam: First, thank you for your recent edits to the Further reading section. I welcome replacing the sources there with those of higher quality & number of citations by others. I believe that article is much improved due to the addition of quality sources rather than merely excising the section. Second, with regard to the H-index, you express an opinion but do not cite any sources. I want to know where to find the criteria is for judging what quality an H-index is, as per the discussion at Wikipedia talk:Notability (academics)#Further on h-index... Since the scale & relevancy of an H-index can vary widely from one field of study to another, I am searching for an authority on this. Peaceray (talk) 21:49, 26 September 2016 (UTC)
@Peaceray: a highly respected journal for big data is this: Scimago Journal & Country Rank for VLDB Journal (VLDB = very large data bases). It has h-index 63. So within the domain of big data, 63 is big. And aforementioned Jure Leskovec has an h-index of 67 on Google Scholar. danah boyd, who I also mentioned, had 46. Jeffrey Ullman has 106. John Langford, who wrote one of the articles in the XRDS magazine, has 49. Yes, h-indexes need to be taken relative to other authors in the same field. Computer science has very high h-indexes. HelpUsStopSpam (talk) 09:44, 27 September 2016 (UTC)

Definition of Big Biomedical and Health Data[edit]

Suggested for inclusion at the bottom of the Big Data Definition the six-characteristics of Big Biomedical and Health Data:

In biomedical and health sciences, the 3Vs qualitative euphemism describing "Big Data" has been extended to a constructive definition including characteristics like large size, diverse sources, multiple scales, incongruences, incompleteness, and complexity [1]. This description illustrates explicitly the types of methods, techniques, tools and services that are necessary to tackle intricate big data and predictive analytics challenges.
Wikipedia is NOT for advertising your own research. This definition lacks wide enough acceptance to be noteworthy for Wikipedia yet. Maybe in two years, when it has received a lot of citations? HelpUsStopSpam (talk) 17:26, 25 November 2016 (UTC)


Critical data studies[edit]

I found the Critical data studies article while doing new-page patrolling, and it seems to me that it is a rather narrow viewpoint, and should either be expanded with content from the Critique section of this article, or should even be merged into that section. I don't know much about the scholarship in this field, however, so I'm not confident to do this myself. --Slashme (talk) 08:59, 21 December 2016 (UTC)

Danah Boyd (photography) really needed on the article[edit]

Does really Big data article needs a photo of one of thousands of researches such as ... Danah Boyd ??? April 2017. — Preceding unsigned comment added by (talk) 12:22, 13 April 2017 (UTC)

Definitions, Socio-Economic Aspects, Society Impact, and Characterization of Data Sources[edit]

The article mainly focuses on the mainstream opinions around Big Data, but neglects many notions, definitions, and ideas about Big Data that exist outside the typical mainstream opinion. The article would require some better definitions, ideas around the concept of Big Data, and a more clearer categorization of the actual data sources. Many time Big Data is simply nothing else than a database plus a bit of analytics. The article would require some polishing and additions which are not in the mainstream opinion - or which are new ideas for describing the concept of Big Data.

More history on the origins of the "Big Data" phrase ~1993, certainly in public use by 1994[edit]

This may or may not be usable, but it is the real history, put together from old slides after Steve Lohr wrote in NY Times and meetup groups wanted talks. See Big Data - Yesterday Today and Tomorrow = Slides or Big Data - Yesterday, Today and Tomorrow -video at Stanford Slide 16 shows the use in "Hardware, Wetware, Software", the general-purpose talk I used 1994-1996 (and maybe a bit during 1993), which was captured on video by University Video Communications in 1996. It was the opening keynote for TRI-Ada conference, November 1995.

For a few years, most of the "Big Data" use was in my talks. By 1996, it was part of external marketing. Slide 15 shows "Big Data" as part of SGI booth at SC'96, supercomputing conference in Pittsburgh Slides 23-25 show sample slides from 1997, "Big Advantages from Big Tools for Big Data". JohnMashey (talk) 06:20, 19 May 2017 (UTC)