Talk:Data mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Mass surveillance  
WikiProject icon Data mining is within the scope of WikiProject Mass surveillance, which aims to improve Wikipedia's coverage of mass surveillance and mass surveillance-related topics. If you would like to participate, visit the project page, or contribute to the discussion.
 ???  This article has not yet received a rating on the quality scale.
 ???  This article has not yet received a rating on the importance scale.
 
WikiProject Computing (Rated B-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
Note icon
An image has been requested for this article. Please remove the needs-image parameter once the image is added.
WikiProject Computer science (Rated B-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
WikiProject Databases / Computer science  (Rated B-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Databases, a collaborative effort to improve the coverage of database related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Computer science (marked as High-importance).
 
WikiProject Statistics (Rated B-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B-Class article B  This article has been rated as B-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 

Data mining in Econometrics[edit]

Is it possible/desirable to include the rather different approach/attitude to data mining that econometricians tend to have? I was expecting to find this here but found a rather different approach. The stuff I have in mind might be found here: Lovell, Michael C. (1983) ‘Data mining’, The Review of Economics and Statistics, 65: 1–12., here Hoover, Kevin D. (1995) ‘In defense of data mining: some preliminary thoughts’, in Kevin D. Hoover and Steven M. Sheffrin (eds) Monetarism and the Methodology of Economics: Essays in Honour of Thomas Mayer. Aldershot: Edward Elgar or here Kevin D. Hoover and Stephen J. Perez (2000) Three attitudes towards data mining, Journal of Economic Methodology 7:2, 195–210. In this article they offer a definition of data mining:

"Data mining’ refers to a broad class of activities that have in common a search over different ways to process or package data statistically or econometrically with the purpose of making the final presentation meet certain design criteria."
And they list three attitudes towards it. "Data mining is"
  1. "to be avoided and, if it is engaged in, we must adjust our statistical inferences to account for it"
  2. "inevitable and that the only results of any interest are those that transcend the variety of alternative data mined specifications.
  3. "essential and that the only hope that we have of using econometrics to uncover true economic relationships is to be found in the intelligent mining of data."

This stuff does seems to be using data mining more like a synonym for some aspects of data dredging. Anyway I was expecting to find stuff on this data mining here but didn't and think it ought to be somehwere. Best wishes (Msrasnw (talk) 10:38, 11 September 2012 (UTC))

Yes, this clearly refers to the "old" use of the term data mining with respect to generating hypotheses, which is covered by the article data dredging. The term "data mining" is way too broad to cover everything, and it is not used consistently (or correctly) throughout literature. The much clearer defined term is "knowledge discovery". I do not think we need to cover all abuses of the term in the article, but instead we should focus on the "knowledge discovery" based term; other are to be references as "maybe you are looking for: data dredging". --Chire (talk) 11:08, 15 September 2012 (UTC)

Should we add SEMMA to the Process section?[edit]

Should we add SEMMA to the Process section? On the one hand, yes, it is a process model, but on the other hand, it's been proposed and adopted as a standard by only one vendor (SAS). What do people think?Karl (talk) 14:04, 14 November 2012 (UTC)

I found several independent sources for SEMMA and a comparison of it and CRISP-DM, so I added it to this page. If other people feel differently about this addition, please edit it. The SEMMA page received an orphan-page tag today also, so adding a link to it on this page also helps resolve that problem. FYI, I added a link to SEMMA on the CRISP-DM page also. Karl (talk) 16:49, 15 November 2012 (UTC)

I'm wondering how much SEMMA is really about data mining in the scientific sense of the word. For all I see, it probably fits better into the Business Intelligence part that loves to label itself with the buzzword "data mining". IMHO an article on SEMMA is not needed, but it should be merged into some SAS Institute Inc. related larger article. Clearly, SAS Enterprise Miner is a superset of SEMMA, and doesn't have an article yet. This is like discussing the steering wheel without having an article on what a car is. --Chire (talk) 14:16, 16 November 2012 (UTC)

Including material about R's increasing popularity & Satisfaction ratings of tools[edit]

At the bottom of the lists of open-source data and commercial data mining tools, I inserted a sentence that indicates which are the most widely used tools; and for the commercial tools, I also noted which have received the highest satisfaction ratings. I want to openly state that I have a complete COI, since I am the primary author on the research I cite to support these statements. However, when I try to look objectively at this wikipedia page, I think that the inclusion of this information will be very useful to readers, especially to readers who are not familiar with the large number of tools that are available. So, in good faith I've made these additions to data mining. If other wikipedians think the material should be removed, I will not argue. I am simply putting it out there for others to evaluate. When reviewing this material I encourage people to look at the value of the material being included on the page, and not simply to remove it with a black/white interpretation of COI. Thank you. Karl (talk) 15:08, 27 November 2012 (UTC)

I don't find information about software popularity to be particularly helpful in this article, so I removed it again. - MrOllie (talk) 15:52, 27 November 2012 (UTC)
I like how you said that, MrOliie. I agree with your statement that "If it is important someone else will add it." So I agree that, due to my COI, it is better for me not to add this information.
However, I disagree with your statement about software popularity not being helpful to the article. It is my personal view that an encyclopedia entry on a technology topic is enhanced by knowing which software has been more widely adopted. But perhaps I am too close to this topic, so I will leave it to others to evaluate and add material if they think it would be helpful. I found two other published sources (that I have no COI with) that speak to the wide and growing adoption of R. One is in the NY Times, and the other is in Java Developers Journal. I will leave my thoughts here on the TALK page, and if other people feel the material is worth adding, they can add it.
This is the material I suggest adding after the list of open-source tools:
R is the most popular open-source data mining tool.[1] [2] In 2010, the R language overtook commercial tools to become the most widely used analytic tool among data miners (43% of data miners reported using it).[3]
This is the material I suggest adding after the commercial tool list:
Among the commercial tools, SAS, IBM/SPSS, and StatSoft software are the most widely used. And STATISTICA Data Miner, IBM SPSS Modeler, and Salford Systems tools have received the strongest satisfaction ratings from data miners in recent surveys.[4][5]
And this is the bullet point I suggest adding to the Marketplace survey list:
I once again want to openly state that I recognize I have a COI with the above material, so I leave it to others to decide if it is useful to add or not. Karl (talk) 16:23, 27 November 2012 (UTC)
I can't think of any encyclopedic reason to add the material. I think it best to follow WP:RECENTISM in such situations. --Ronz (talk) 18:42, 27 November 2012 (UTC)
Thank you. I am a novice wikipedia editor, and hadn't seen WP:RECENTISM before. I like the perspectives expressed there, and will try to keep them in mind during my wikipedia writing. In this case, I agree that, on the one hand, the material is recent. However, on the other hand, my interpretation of WP:RECENTISM is that it is OK to include small amounts of recent material -- it's just that we don't want large amounts of recent material in an entry to overwhelm the stable agreed-upon and historical information in a wikipedia entry. E.g., in United States presidential election we don't want the bulk of the material to be on the most recent election. However, it is OK for the entry to briefly mention the most recent election. In my view, this material about the adoption rate of data mining tools is only a couple sentences, and would be OK. I also feel the the ideas of WP:RECENTISM are balanced by the usefulness of informing readers (briefly) about the current state of tool usage. Over the past decade, open source tools in general have seen increasing adoption. This multi-year trend has been the case with the R programming language as well. While I have never used R myself, it would be a COI for me to put the following info in data mining (and it would be excessive and therefore violate both good judgement and WP:RECENTISM), but other wikipedians evaluating this TALK page discussion might be interested to see the following trend in R adoption that we have seen in our five surveys:
  • 23% of data miners reported using R in the 2007 Survey (N=314)
  • 36% of data miners reported using R in the 2008 Survey (N=348)
  • 38% of data miners reported using R in the 2009 Survey (N=710)
  • 43% of data miners reported using R in the 2010 Survey (N=735)
  • 47% of data miners reported using R in the 2011 Survey (N=1,319)
Karl (talk) 19:42, 27 November 2012 (UTC)
As KDNuggets has found the same in their surveys, it might be an option to cite them. However, is it of encyclopedic relevance what is the current favorite toy? Much of Rs popularity might come from the recent trend of trying to solve everything by matrix factorization. A few years ago, it probably was Weka. In a year or two, it might be Hadoop/Mahout, for example. We don't try to cover all that there is on the Interwebs, but try to keep focused. This article should focus on the scientific aspects IMHO, and I see little value in putting tool rankings into Wikipedia. There are websites such as yours for that. --Chire (talk) 23:52, 27 November 2012 (UTC)

Picture[edit]

Having some pictures would be nice (for example a classification diagram) --Lbertolotti (talk) 14:39, 21 February 2013 (UTC)

Making it more readable[edit]

Excuse my uncoordinated overhaul (jan-14) I'm experiencing the article as unclear: long blocks of text, containing redundant information and some parts consists of mainly examples. Am suggesting an overhaul to make the article readable. Philip Habing (talk) 12:30, 19 January 2014 (UTC)

The article is quite a mess because everybody wants his products and results to be prominently listed. Which is largely why the "notable uses" section is so big, gets additions again and again and isn't called "examples".
as you can see I added back most of your changes. But when done in smaller steps (have a look at the diff of your edit!) It's easier to see how much really has changed.
I'd prefer to keep the “buzzword” paragraph in the lead; because one of the reasons that the article is such a mess is that literally everything is being dubbed “data mining” by public media and marketing these days. NSA: data mining. Amazon: data mining. Google: data mining. Audi: data mining. Wikipedia: godfather of data mining.
The article should help to understand that not every data collection and processing is sensibly to be titled “data mining”, but sometimes we should stick to calling it “massive data collection”, “privacy breach”, “mass surveillance”. The current abuse of the term (fortunately, they tend to prefer the big data bullshit bingo these days) IMHO belongs to the introduction, not the “history”. 188.98.222.114 (talk) 16:53, 19 January 2014 (UTC)
Aha, I can see. :-) I understand doing it in smaller steps would've been better, sorry about that :-(
Your idea about splitting the lemma sounds as a good idea: Pull out all the examples.
I partly agree on your buzzwords text. Buzzwords itself adds to the information. The sentence "Even the popular book..." doesn't add to understanding the idea's behind data-mining.
When splitting the lemma, the Privacy concerns is still at place in the Data-mining-article.
btw. your name, an IP-adress, is sort of strange to talk to. No offence meant. Philip Habing (talk) 15:05, 20 January 2014 (UTC)

Sharad is .net devloper on LERA TECHNOLOGIES but now he is working on hadoop. Unfortunately he succeeded and became as hadoop expert and now he is going to start institute in UP nothing but his native place. Now — Preceding unsigned comment added by 183.82.3.79 (talk) 10:57, 24 June 2014 (UTC)

Data Access Mining[edit]

With the advent of the new cryptography based virtual currencies such as Bitcoin and myriads of other so-called alternative or Alt-coins, "Data Mining" has taken on a new meaning besides the older connotation that it has in this sense. I am currently in the process of creating a startup enterprise, named Data Access Mining And Gaming, and to differentiate traditional data mining from my business's activities I have decided to use the word "access" as a means of differentiating the two. Data Access Mining in this case would be running computer hardware, be it CPU, GPU, ASIC, or otherwise to process the public registry, or Blockchain, for reward based on the established algorithms of the networks of the virtual coin. I encourage others to help me find this topics rightful place on Wikipedia, because I feel like this topic deserves its own listing. Matthew Biebel (talk) 19:30, 24 June 2014 (UTC)

Can you name literature that uses it this way? Bitcoin mining seems to be the common term. If I were you, I would avoid overloading "data mining", it only causes confusion. --Chire (talk) 09:50, 25 June 2014 (UTC)
Thank you for your response, I will have to do some research to find "literature" that has used the terminology that I have described, but Bitcoin itself is essentially data. Also, the mining I am referring to does not necessarily restrict itself to only Bitcoin but to alternative coins based on similar algorithms as Bitcoin such as Litecoin or the myriads of other examples. Encountering "Data Mining" in reference to bitcoin mining has been something more of a first hand experience for me. For instance, I spoke recently with someone, and after telling him the name of my company, he said to me "Data Mining? Oh, like Bitcoin? I keep getting emails about that. Have you found any yet?" Perhaps the literature has yet to catch up with the lingo. Matthew Biebel (talk) 01:23, 26 June 2014 (UTC)
Everything is data. If you move your mouse, it is data. Yet, visualizing your move movements on the screen is not data mining. I don't think literature has to catch up with uninformed use of words. You will notice that the article Bitcoin does not make use of the term "data" much, nor does cryptocurrency. If you want to avoid the name Bitcoin because of alternative coins, why don't use use cryptocurrency mining? That would be much more standard and precise. Why use something ambiguous? Do yourself a favor, and avoid the stop word "data", and the buzzword "data mining"! --Chire (talk) 09:54, 26 June 2014 (UTC)
To be honest, the reason I chose the term for my company is because we had an acronym first and I developed and trademarked the name. At the time, I did not realize that "data mining" was its own field in computing, and when I spoke of "data mining" in the context of "bitcoin mining" people understood that I was talking about machines crunching data to produce bitcoin. I agree that "cryptocurrency mining" is more precise. Thank you for your contribution and clarifications.Matthew Biebel (talk) 19:00, 26 June 2014 (UTC)