How can we clean up this page so there is not a constant argument about links going on, the edit history looks very bad.
What is web mining?
To me "web mining" is the action of extracting (machine-readable) info out of the Web (or any web for that matter) like web search engines (like Google) do. Both extracting info out of content (general info web pages provide) and structure (relations between web pages) which hyperlinks provide.
This page also includes eg SGML document structure, which should be at other pages like DOM, SGML or Markup. It also includes things like Web usage tracking which should either be moved to Internet tracking or given it's own page. IMHO these subjects don't fall under "Web mining". —Preceding unsigned comment added by SvartMan (talk • contribs) 18:20, 4 June 2009 (UTC)
How to do?
I've read a few things on vertical search engines, specialised search engines based on API's of general purpose search engines as Google. Could anyone add some inforamtion on this? many thanks in advance. —Preceding unsigned comment added by 220.127.116.11 (talk) 08:09, 6 May 2010 (UTC)
Web content mining section looks wrong
I tried to fix it up a bit (add citations, linked to TF-IDF wiki page), but still it says "when the length of the words in a document goes to" and there is some mathematical symbols that do not render right. Does anybody know what the author was trying to say? Ctorchia87 (talk) 03:14, 10 October 2012 (UTC)
Data mining and Data collection are two different things
Data Mining is the trend analysis stuff, and data collection is the gathering/harvesting/structuring of data into useful data. This isn't reflected in the information at all so there is a lack of clarity.
If it is correct that data collection is a subset of data mining, rather than a separate entity, then this should also be mentioned. See here http://blog.import.io/post/data-mining-vs-data-collection