Talk:Big data

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Mass surveillance (Rated C-class, Top-importance)
WikiProject icon Big data is within the scope of WikiProject Mass surveillance, which aims to improve Wikipedia's coverage of mass surveillance and mass surveillance-related topics. If you would like to participate, visit the project page, or contribute to the discussion.
C-Class article C  This article has been rated as C-Class on the quality scale.
 Top  This article has been rated as Top-importance on the importance scale.
WikiProject Computing (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
Article milestones
Date Process Result
November 25, 2009 Articles for deletion Deleted
April 21, 2010 Articles for deletion Kept
July 25, 2012 Articles for deletion Kept

About original research[edit]

Wikipedia is great partly because of its rules, made by many sharp people over time. One of these rules is no original research; Wikipedia is not a place to showcase new papers under the guise of citations, since it suggests the new paper is a reliable source; the fact that it cites other sources does not mean that it itself can be viewed as a secondary source for our purposes; what is wanted is citations one-source-removed, such as established journals, newspapers, textbooks -- impartial analysts, looking objectively at primary sources. In this case, the citation added is a primary source -- a pdf file of a research paper; don't see why it is in this article other than to promote this particular paper using search engine optimization.--Tomwsulcer (talk) 17:51, 3 February 2014 (UTC)

What happened to the 3V's?[edit]

The current article no longer mentions volume, velocity, and variety as ways of characterizing Big Data. Why? I though the combination was a good way to describe important aspects of Big Data. (talk) 15:45, 22 March 2014 (UTC)Mark Kerstetter

Someone must have put it back in, along with a couple other Vs. But I have a question about the para in the "Characteristics" sxn that refers to Variety:

Variety - The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data.

I think this is supposed to be a definition, but all it really says is that variety is important, not what it means. And it says that in several uninteresting ways ("very essential", "needs to be known", "helps...effectively use", "to their advantage", "upholding the importance"). Sounds like a college essay where the student is using a bunch of verbiage to disguise the fact that he doesn't know the answer.
I can imagine several meanings for variety: variety of data type (string/ int/ real...), variety of data structures (objects in the object oriented programming sense, maybe), variety in how much of a given data structure is included for a given entity (like bibliographic information might have date of birth for Einstein, but not for Socrates), variety in the ways data structures are encoded (XML vs. plain text vs. tables), etc. So is the Variety in Big Data one of these, or all, or something else? Mcswell (talk) 20:58, 15 February 2015 (UTC)
Replying to myself: there are definitions of the Vs (and the C) here [[1]] which look to me to be much better than the existing definitions. But I'm not an expert on this stuff, so I hesitate to paraphrase it, and I'm unclear on the wikipedia requirements for quoting something verbatim. Mcswell (talk) 21:09, 15 February 2015 (UTC)

Proper analysis is still needed[edit]

Headine-1: Big data: are we making a big mistake? March 28, 2014

[The science of 'statistical learning' and 'computer learning' can keep research on track.] — Charles Edwin Shipp (talk) 01:48, 30 March 2014 (UTC)

Canadian Open Data Experience[edit]

Another editor is insisting that this article include a mention of "the Canadian Open Data Experience (CODE) Inspiration Day event held at the University of Waterloo Stratford Campus located in Stratford, Ontario" at which "renowned Data Scientist Hilary Mason spoke about Big Data." I don't see how this adds to a reader's understanding of this topic and remain convinced that it should be removed. Can other editors please comment or contribute to this discussion? Thanks! ElKevbo (talk) 11:57, 7 April 2014 (UTC)

I agree with ElKevbo.--Tomwsulcer (talk) 22:30, 7 April 2014 (UTC)
I do not agree with Elkevbo because the mention is similar in tone to the previous line about the IBM-sponsored championships. Perhaps the editor could consider removing the Hilary Mason mention - and simply identify the event. Statdata (talk) 03:24, 11 April 2014 (UTC)
You're right - the IBM event needs to be removed, too. ElKevbo (talk) 03:47, 11 April 2014 (UTC)
Agree with ElKevbo again. Perhaps an external link could be included to this stuff but in the body of the text, Wikipedia's rules want secondary sources, impartial one-step-distanced from source material.--Tomwsulcer (talk) 11:06, 11 April 2014 (UTC)

WP copyright policy violation in Benefits section[edit]

Running the Duplication Detector report reveals the following:

Comparing documents for duplicated text:

Downloaded document from (239986 characters, 7914 words)
Downloaded document from (71117 characters (UTF8), 5217 words)
Total match candidates found: 1202 (before eliminating redundant matches)

Please run the report itself to see which sentences are a exact match (about 6) & which are close paraphrases (about a dozen or more).
Peaceray (talk) 02:51, 8 June 2014 (UTC)

Notability of cartoon[edit]

Funny though it is, what is the ultimate source of the cartoon? Is it just a WP editor? WP:NOTBLOG... (talk) 08:05, 13 August 2014 (UTC)

Hi. As you can see on the file page this is a cartoon by T. Gregorius. I'm not aware that he is a Wikipedia editor (and I rather doubt that). IMHO, the cartoon is a good and very compact visualisation of the criticism aimed at Big Data. Well, of part of the criticism, of course. I did give this some thought before putting it in the article, whose criticism section is rather abstract. It's really hard to give a meaningful and comprehensible illustration of the Big Data paradigm. I think it's a perfect fit. We could add "Cartoon critical of big data application, by T. Gregorius" if you think that this would make it clearer that this is not "Wikipedia's" commentary. BTW, Wikipedia contains a lot of schematic illustrations (and these often have to leave something out, some complex aspects) - are they notable? (And are their authors notable? Wikipedians?) It comes down to editorial decisions. Are those illustrations appropriate, unduly biased, educational, help explain the subject etc. I think this cartoon is a perfect fit in the criticism section, it really makes the text more comprehensible. --Atlasowa (talk) 10:31, 13 August 2014 (UTC)
Thanks for the clarification, although I still have some doubts on Thierry Gregorius's notability, and of this work in particular, if it is just self-published. Anyway, I'm fine with your amended caption. Thanks. (talk) 12:11, 13 August 2014 (UTC)

Lablanche & Company[edit]

I proposed to add the CSS product designed by Lablanche & Company for data compression and data encryption in one step. It is possible?

SL — Preceding unsigned comment added by (talk) 16:30, 10 January 2015 (UTC)

No, as there is no indication whatsoever that the company is notable in any way.--McSly (talk) 23:18, 10 January 2015 (UTC)

The CSS product has an interest for data compression and encryption and can be sold to big companies and institutions. This product can generate ten of millions of dollards, so i think you should accept to include it in the wilkipedia big data page (just one sentence). "The start-up Lablanche & Company commercializes a prototype named CSS for big data vizualization and data compression/encryption using the recent compressed sensing theory". SL — Preceding unsigned comment added by (talk) 04:02, 12 January 2015 (UTC)

First, please us your account (Fgtyg78) instead of the IP to make communication easier. As mentioned on your talk page, the company you want to add here is completely unknown so it cannot be included (see WP:NOTABILITY). To be clear, there won't be any exception to that rule. It also looks like you have a conflict of interest which is not helping your case (see WP:COI). So I'm going to remove the link again. If you re-add it, I will file a report to have this page protected and the url added to the black list. --McSly (talk) 04:32, 12 January 2015 (UTC)

This company is not unknown because its website is referenced on Google and on the Compressed Sensing wilkipedia page. Fgtyg78

What do you have to answer to this? Fgtyg78

Please respond to appearance of conflict of interest. Peaceray (talk) 06:41, 12 January 2015 (UTC)

There is not conflict of interest because CSS is a unique software prototype for big data problems with an innovative mechanism of compression/encryption. This prototype has no equivalent in big companies and institutions. So this thing must appear in encyclopedy to inform people on the use of compressed sensing in big data.

Fgtyg78 — Preceding unsigned comment added by (talk) 11:01, 12 January 2015 (UTC)

The repeated prominent posting of a non-prominent company in the lede of an article will always appear to be spam and a a conflict of interest. If you had paid attention, you would have noticed that no other company is mentioned in the lede. Please stop posting this change to the lede. Discuss the issue fully here first to arrive by consensus at the appropriate way to include the information, or not. Peaceray (talk) 16:57, 12 January 2015 (UTC)

Tone of article[edit]

Hi. A few reverts back the tone of the text changed entirely and now sounds like a poorly written magazine article. Can someone take the article back to a stable point please? Rui ''Gabriel'' Correia (talk) 08:10, 29 January 2015 (UTC)

I have since restored parts of an earlier version to strenghten the lede. Rui ''Gabriel'' Correia (talk) 12:21, 29 January 2015 (UTC)

I disagree with these edits, but am not going to engage in edit warring on a subject for which I have no passion. The most I could do would be to slap about 4 or 5 requests for clarification tags — which the editor would most likely remove, as he has twice done with the "tone" tags, without appreciating that these are there to help.

  • "Data has always been Big." — this is opinion (original research) and vague and meaningless.
  • "The one aspect that differs now (if compared with the past) would be the sheer scale and accessibility of Data, which is the direct result of the super efficient speeds in which data can now be computed." — I think the editor is confusing concepts of size and importance
  • Big Data is therefore an all-encompassing term for any collection of large data sets that were once difficult to process." — so, nowadays it is easy? Just like that? Why?

Rui ''Gabriel'' Correia (talk) 14:34, 29 January 2015 (UTC)

For what it's worth, I agree with Gabriel on all these points. And in particular, I think it's important that the opening paragraph be a clear summary of the topic--if someone sees just the first couple of sentences (e.g. in a search results page) they should have some idea of what the topic is about. (I also chimed in on the other editor's talk page, User talk:Jugdev#Manual_of_Style.)
I also don't want to get into an edit war here! I've asked for admins to help. (Dispute Resolution Noticeboard) In the meantime, I'll put the "tone" tag back up on this page with a link to the discussion here. AIUI, removing a tag like that without participating in the discussion is a clear violation of WP mores, so let's assume that won't happen again... -- Narsil (talk) 00:41, 1 February 2015 (UTC)
I've been told that there hasn't been enough discussion to merit admin intervention yet. So... As I agree with User:Rui Gabriel Correia's take, I'm going to restore his edits. User:Jugdev, please respond to the issues he and I raised here rather than just reverting! Thanks, -- Narsil (talk) 20:54, 1 February 2015 (UTC)
Thank you for your edits. I have noted my thoughts on your commentary:
  • "Data has always been Big." — this is opinion (original research) and vague and meaningless.
The concept of big data is vague. original research is varied and does not convey the multidisciplinary perspective of the quote in question. The only quote that encapsulates the concept in a clear manner is the one which has been included.
  • "The one aspect that differs now (if compared with the past) would be the sheer scale and accessibility of Data, which is the direct result of the super efficient speeds in which data can now be computed." — I think the editor is confusing concepts of size and importance
This is a open ended criticism, without any depth - could you please elaborate on what you think is being confused? If i understand correctly, size is the reason why the concept (Big Data) has been given its ranter unusual name. It's importance is not referenced in the opening paragraph - more prominence has been given to the term itself (i.e. the subject of the article) and how it has entered into the public sphere...
  • Big Data is therefore an all-encompassing term for any collection of large data sets that were once difficult to process." — so, nowadays it is easy? Just like that? Why?
This is the sentence that Narsil keeps reverting back to... In response to the question, big data will always be difficult to process but it is easier now, as tools previously used by statisticians and analysts were limited. The study of statistics and data will evolve and procedures will become more elaborate as data sets grow larger.
Here are a few quotes from the wiki guidelines on tone:
"Wikipedia articles, [...] should be written in a formal tone. Standards for formal tone vary depending upon the subject matter, but should follow the style used by reliable sources, while remaining clear and understandable"
I believe that my version of this article is in line with the standards noted above. The sources are reliable and the content is clear and easily digested.
"Normally, the opening paragraph summarizes the most important points of the article. It should clearly explain the subject so that the reader is prepared for the greater level of detail that follows."
I believe that the opening sentence in this article is clearer than the version written prior to my involvement. It summaries the most important aspects of this new topic, which is presently being debated within the academy (i.e. all academic institutions) as I write. — Preceding unsigned comment added by Jugdev (talkcontribs) 09:46, 2 February 2015 (UTC)

Jugdev. I offered to help, yet you rejected my help, claiming you understood what you were doing. Apparently not. I completely understand what you are trying to do by presenting the most recent developments first, but it does not work that way. Let's look at the first sentence of Elephant, for example: "Elephants are large mammals of the family Elephantidae and the order Proboscidea." Now, if I want to add that the market for ivory is driving the African elephant to extinction, where do I add this? Before the original text - i.e., present the most recent information as you are doing with big data, or after the existing sentence? Let's take a look:

  • 1. The illegal trade in ivory is driving the African elephant to extinction throughout most of Africa. Elephants are large mammals of the family Elephantidae and the order Proboscidea.
  • 2. Elephants are large mammals of the family Elephantidae and the order Proboscidea. The illegal trade in ivory is driving the African elephant to extinction throughout most of Africa.

Which one (1. or 2.) is a most logical sequence?

That is ONE aspect of it. The other is the tone. The tone is going wrong precisely because you are swinging around the logical order of the bits of information. Which is why you need to add "Data has always been Big", otherwise the next sentence hangs, because segments like "aspect that differs now", "compared with the past", trigger in the reader a sense that something is missing. So you patched in the bit about "Data has always been Big" to cover it up (and plagiarised - read below).

You also claim to be familiar with the styleguide and have now had ample opportunity to analyse your edits to see if they comply. It amazes me then that you keep on claiming that you edit is in line with the styleguide and yet you have not yet picked up that there is a problem with "sheer scale" and "super efficient speed". This is partly because you just plagiarised the source, then changed or moved one or two words around "Data has been “big” all along. What has changed now is not just scale and cross-channel inputs, but the sheer speed and accessibility of data".

Greetings. Rui ''Gabriel'' Correia (talk) 11:00, 2 February 2015 (UTC)

Semi-protected edit request on 29 January 2015[edit]

Would it be possible to add link to e-Science page ( in the section discussing research applications of Big Data?

The e-Science page seems to provide more details of the examples mentioned on the Big Data one, and this linking could eventually motivate avoiding some duplication of the material.

Mheikkurinen (talk) 20:02, 29 January 2015 (UTC)

Red question icon with gradient background.svg Not done: it's not clear what changes you want to be made. Please mention the specific changes in a "change X to Y" format. It's already a wikilink. — {{U|Technical 13}} (etc) 20:23, 29 January 2015 (UTC)

RfC: Is the opening paragraph a good summary of the topic?[edit]


Jugdev has agreed to stop inserting the text in question. Manul ~ talk 23:39, 8 February 2015 (UTC)

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Is the opening paragraph a good summary of the Big data topic? Narsil (talk) 19:16, 2 February 2015 (UTC)

Background: There has been disagreement among three editors about the topic paragraph for this article. (You can see the discussion in #Tone of article.) User:Jugdev, the author of the current version (here's one diff where he restored it), feels that with this version "the opening sentence in this article is clearer than the version written prior to my involvement. It summaries the most important aspects of this new topic, which is presently being debated within the academy". User:Rui Gabriel Correia and I feel that this version is unencyclopedic in tone, contains subjective statements, and does not meet WP style (in particular, the opening sentence is not a summary of the topic). We would prefer the summary as it stood before Jugdev's edits (version 644139720).
We've reached an impasse--our discussion is basically, one says X, the other says "I disagree, Y", the first person says "no not Y, X". ;-) The edits have basically consisted of us reverting each other repeatedly, which is not good! Since there are already three editors involved I didn't know if WP:3O was appropriate. I'm hoping input from more editors will settle this one way or the other. Narsil (talk) 19:27, 2 February 2015 (UTC)

Original The version which starts Data has always been Big. is certainly unencyclopedic in tone and such conversational style should be avoided. I much prefer the one sentence version. The current formatting (with bold Big, Data and Big Data in the first, second and third sentences) is clearly not per manual of style. I think the previous version which starts Big data is an all-encompassing is better although could possibly be improved by breaking down "difficult to process them using traditional data processing applications." though with my limited familiarity with the topic I wouldn't be sure how. SPACKlick (talk) 10:17, 3 February 2015 (UTC)

Thank you for your contribution. Firstly, the above overview by Narsil is a convenient version of events. Please refer to the talk page (pasted below for your convenience) for a more thorough overview and also my criticism of the changes made. In response to your comments SPACKlick, I loosely agree with your concern regarding the formatting, and in defense feel that the formatting may help the article aesthetically by allowing users to identify keywords. I however disagree with the first point, as the sentences in question happen to be a quotation from a well regarded publication about Big Data. It is my understanding that the quote evokes a particular frame of mind/ thought, which in turn allows the reader to begin grappling with the complex topic. I do not believe that we have enough evidence to completely revert the article. All we have seen is a critique of two sentences that happen to be from a publication that specialise on the subject in question.

Please see a summary of the talk page below: [quoted from User talk:Jugdev#Manual of style ]

User:Jugdev--I hope it's okay, I removed the copied-and-pasted version of the "Talk:Big data#Tone_of_article" section (people can scroll up to read it, and/or can follow the link I just pasted). Pasting the entire discussion was confusing the formatting of this talk page, since it included a heading and lots of paragraphs. If there are particular parts of that discussion you think are most relevant, feel free to copy those! (Though then it's usually good to mark them clearly as quotes so we can see who said what.) Narsil (talk) 20:11, 3 February 2015 (UTC)
Evidently you feel it's necessary to quote the whole discussion instead of providing a link. If you really really think so... I've at least set it off in a blockquote so people can see what's new and what's quoted. Narsil (talk) 19:13, 4 February 2015 (UTC)
To offer an opinion/response: I think it is quite appropriate for editors to offer a "critique of two sentences", since the very issue we are discussing is whether those two sentences are a good lead for the article! It may well be true that those two sentences come from a publication on the topic. For all I know, it's a very good publication indeed (I have to take your word for it since you don't give us a link). But even if it is, that publication probably has a whole lot of sentences that may be true and helpful but are not a good lead for the WP page.
Per WP:LEADSENTENCE, "If possible, the page title should be the subject of the first sentence." (If you look at other WP pages, the vast majority begin with something like "A Foo is a..." or "Bar were a...") I see no reason why we should defer defining "big data" until the third sentence. When people land on this page, the first thing we should tell them is what "big data" means., if it needs saying, I'm voting for Original version. -- Narsil (talk) 20:21, 3 February 2015 (UTC)
And by way of examples--I just hit "random article" 5 times, and got articles with these lead sentences:
(Sandcastle Waterpark) "Sandcastle is a water park located in the Pittsburgh suburb of West Homestead." — (Hypersthene) "Hypersthene is a common rock-forming inosilicate mineral belonging to the group of orthorhombic pyroxenes." — (Psychostick discography) "The following is the complete discography of official releases by Psychostick." — (Rainbow Gladiator) "Rainbow Gladiator is an album by the American jazz violinist Billy Bang recorded in 1981 and released on the Italian Soul Note label." — (Khanlar Safaraliyev) "Khanlar Safaraliyev was an Azerbaijani oil field worker, labor organizer, and Moslem social democrat."
So four out of five random examples begin with a sentence that uses the article subject as the sentence subject. It's not an ironclad rule--one of the five was an exception--but there needs to be a good reason for it (in this case, the exception is a page that contains a list, so it doesn't need a definition). Narsil (talk) 20:35, 3 February 2015 (UTC)

I've added a "tone" tag to the page to direct visitors to this discussion. User:Jugdev, please do not remove the tag. The tag is there to indicate that editors disagree about whether the tone is appropriate, and this disagreement clearly exists. Don't remove the tag just because you think the tone is good--we know that! honestly!--the tag is so other editors will come here and give their opinions. If they agree with you about your edits, then they'll say we should remove the tag, and this should get wrapped up sooner. But if you remove the tag yourself, this could be considered disruptive (per WP:DISRUPT) or even edit-warring. Narsil (talk) 03:15, 4 February 2015 (UTC)

I disagree with the tone tag - just to repeat myself : the paragraph in question is a published quotation from a highly regarded title from the field of big data. -JG (talk) 08:57, 4 February 2015 (UTC)

So what part don't you get? Wikipedia tone is wikipedia tone. What your "highly regarded title from the field of big data" does is its own business. If it is such a "highly regarded title", I guess it has a style guide. And guess what - that style guide is for their publications; we have ours, as do all other big and serious publiations each have their own. Rui ''Gabriel'' Correia (talk) 14:46, 4 February 2015 (UTC)
I believe my question is clear. Please let me know if you need me to rephrase in a more digestible manner. the title has been referenced... which publication do you write on behalf? -JG (talk) 15:00, 4 February 2015 (UTC)
The issue is, the opening paragraph is not in keeping with wikipedia tone. It may be from an extremely reputable source, but that doesn't mean it's written in WP style. The Iliad and Finnegan's Wake are both highly respected books, but that doesn't mean they're written in the right style for Wikipedia articles.
Since you aren't offering any response on the issue we're discussing--whether the opening paragraph is in Wikipedia style--I'm going to revert it again. Please do not re-revert it until after you've offered a response on that issue here. Right now every WP editor who's commented has agreed that your edits are not appropriate for Wikipedia--saying the quote comes from a very good book doesn't change that.
(To answer your question to Gabriel, "which publication do you write on behalf?"--Gabriel writes on behalf of Wikipedia. So do I. So do you, while you're writing here. So if you're writing here, write in WP style!) -- Narsil (talk) 19:13, 4 February 2015 (UTC)
apologies for the delay. In response, I have quoted the wikipedia style guide above, which suggests that the opening sentence is within the requirements. Slightly confused why this has been reverted again... -JG (talk) 09:19, 5 February 2015 (UTC)
You have not quoted the style guide in any way that is relevant to the conversation. The part you keep quoting says "The lead should be able to stand alone as a concise overview. It should define the topic, establish context, explain why the topic is notable, and summarize the most important points". But you have not responded to the issues we're discussing. For example: (WP:LEAD) "If possible, the page title should be the subject of the first sentence"; (WP:TONE) "English language should be used in a businesslike manner". ...User:Rui Gabriel Correia, User:Bluerasberry, User:SPACKlick--would one of you be willing to restore the original version? I frankly feel that User:Jugdev is engaged in disruptive editing but if I'm the only one reverting his changes we don't have a very clear case... Narsil (talk) 19:02, 5 February 2015 (UTC)
  • Comment I find this RfC to be malformed. In my opinion, a better way to do it is to propose a specific change to the lead of the article. After that, ask whether that change should be enacted. Obviously the proposed change is controversial. It should not be live without consensus as there is opposition. It would be best to find if some parts are problematic and other parts acceptable, or to otherwise find the nature of the dispute. In any case - the usual way to manage this would be to revert to the previous version, then only update it if there is consensus on the talk page. That is not special advice for this case, but in my opinion, the usual workflow of Wikipedia. Blue Rasberry (talk) 19:32, 5 February 2015 (UTC)
    • Thanks much! I'm not sure how to play that, though. I created this RfC because User:Jugdev had already changed this page to the current one ("Data has always been big"). This change is certainly controversial--I'm not finding any other editors who agree with it--but if people try to revert it, he just reverts it back. Admins told me there had not been enough discussion to justify admin intervention, so that's why I created this RfC (to try to get more editors involved). I would love to go with your approach--switch to the old lead ("Big Data is an all-encompassing term for any...") and keep it that way until there's consensus for a change. But as we have seen, JG will simply revert to his version, and say "my version is in keeping with WP style". So what's our next play? Narsil (talk) 19:47, 5 February 2015 (UTC)
Narsil This is not a complicated issue yet. The article had been stable for years with small changes till about a week ago. At this time, a user made changes. Multiple people immediately called for more discussion.
Per WP:BRD, the user who made the change was WP:BOLD, then someone WP:REVERTed the change, and now it is time to discuss it here. If the result of the discussion is that the change should stand, then it does. If there is no consensus to keep the change, then it is not kept. The fact that this RfC came shortly after the change is not relevant to the BRD process which is typical on Wikipedia. Blue Rasberry (talk) 19:54, 5 February 2015 (UTC)
  • Obvious oppose to opening with "Data has always been Big." Narsil and Rui have covered the high points. This RfC is very unclear, amongst a host of other problems (e.g. why is there a cut & pasted chunk of Jugdev's talk page here, warning template and all?). Jugdev aka JG, you say, "the paragraph in question is a published quotation from a highly regarded title from the field of big data". If you are talking about the lead paragraph -- or any paragraph without a direct quotation -- then it must be removed per WP:COPYVIO. I suggest an immediate close of this RfC per WP:SNOW to avoid wasting the time of anyone drawn into this. Manul ~ talk 00:41, 6 February 2015 (UTC)
Just to clarify, I opened the RfC to get the page changed back from "Data has always been Big" to the dryer version. ;-) At the time, there'd only been two editors involved besides Jugdev, and one of those two had said he was giving up--the RfC seemed like the best way to get other editors involved. Apologies if it was the wrong approach! (As for why we quoted the entire discussion from the other talk page--that's because JG pasted it here and insists on keeping it, and I only wanted to fight one battle at a time...) Narsil (talk) 02:23, 6 February 2015 (UTC)
My comment sounds harsher than intended. I appreciate that you were looking to obtain outside input in order to avoid warring with Jugdev. When an editor's preferred change is almost nonsensical ("Data has always been Big"), and everyone else opposes the change, and the editor doesn't back down, it's like a Randy from Boise situation. What to do? Perhaps the original research noticeboard would be the closest fit -- after all, if "Data has always been Big" isn't nonsense then it is WP:OR. But I would say that if an editor doesn't agree to stop inserting stuff that's universally recognized as weird, it becomes more of a conduct issue. Jugdev, would you please agree to drop this? "Data has always been Big" will never gain consensus, so there's no need for this RfC. Manul ~ talk 03:46, 6 February 2015 (UTC)
Although I still disagree, I will not revert again. I will find a better quote as the present one does not work - I look forward to working with you all soon. -JG (talk) 09:42, 6 February 2015 (UTC)

Jugdev, writing for the Wikipedia is not about collating quotes. It is about making sense of information found in reliable sources, conveying the information found in the sources in your own words in an encyclopaedic style and citing the sources consulted. If you are going to use quotations, this must be done sparingly, where applicable and justifiable, but not as the opening of a lede or article. Regards, Rui ''Gabriel'' Correia (talk) 10:58, 6 February 2015 (UTC)

Rui, I've been told the nothing encapsulates the essence of a debated topic the way a published quotation does. I will find one that's fit for our purpose. -JG (talk) 11:14, 6 February 2015 (UTC)

Thanks, fixed. And you used a wrong word. As for quotation, I am certain you will do as you please - as always. And cheers, this is the very last time you hear from me. Rui ''Gabriel'' Correia (talk) 11:26, 6 February 2015 (UTC)
thanks - I look forward to working with you. — Preceding unsigned comment added by Jugdev (talkcontribs) 03:30, 6 February 2015

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

The way forward[edit]

I don't think it is very productive to have a whole team of editors monitoring one edit by one editor. We have done all that can be considered par for the course, we have pointed out what is deficient about the version the editor would like to use, all to no avail. If said editor cannot grasp a simple thing, such as not starting an article/ lede on a minor sentence, then he needs a tutor. I don't know if appointing a tutor is foreseen in the mechanisms to deal with stubborn editors. If not, progressive blocking seems to be the only solution. Regretably. Rui ''Gabriel'' Correia (talk) 23:57, 5 February 2015 (UTC)

Lede Sentence 3[edit]

The trend to larger data sets equates to additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on." This is badly written I tried to correct it but I don't understand what it's trying to say enough to repair the English. SPACKlick (talk) 17:05, 10 February 2015 (UTC)

As I read that sentence, you could rephrase it as "When a large amount of related data is in a single data set, it is possible to derive information from it that could not be derived from an equivalent amount of data in smaller data sets. This allows correlations to be found..." But that sentence isn't supported by the citation (in The Economist: Data, data everywhere), which just talks about the total amount of data, and not whether that data is in one data set or many. OTOH, as I read it, the original sentence has the same problem (that it's drawing a conclusion not supported by the source). So I'd just cut out the whole bit about numbers of data sets (one big vs many small), and change it to The trend to larger data sets allows new correlations to be found to "spot business trends, prevent diseases, combat crime and so on." Narsil (talk) 19:23, 10 February 2015 (UTC)
Quite right; it used far too many words to say too little of value, and perhaps to suggest something that isn't even true. I further adjusted your words to "Analysis of these larger data sets can find new correlations, to 'spot . . .'" partly because of my personal dislike of passive voice. Feel free, as usual, to point out where I may have gone wrong. Jim.henderson (talk) 13:46, 12 February 2015 (UTC)


Previously it was described as they filtered 99.999% of data. Upon reading further the following Thesis, it looks like they filter more than that. I've updated things accordingly and thrown in what was surely a clumsy citation. Feel free to clean it up, and then delete this talk entry.

L1 filtering 40Mhz to ~60-65Ghz (so ~.015% data retained). L2 filtering 65Khz to 6Khz so (10% of data retained) L3 filtering 5-6Khz to 500-600hz so (10% of data retained). So 99.99995 % of data was filtered. — Preceding unsigned comment added by (talk) 13:34, 24 March 2015 (UTC)