Talk:Data scraping

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing (Rated Start-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.


Data_extraction) has been a persistent target of spammers. The commercial link was typically labeled "Know About Screen Scraping" in the See Also section. It has been removed and replaced several times. history Mrnatural (talk) 19:07, 5 August 2009 (UTC)

Lynx stdout scrapers[edit]

Somewhat intermediate between
true "screen scrapers" in the modern sense, which interface with GUIs of applications, and consequently need either OCR to read the bitmapped screen directly and convert to text, or access to the underlying data objects,
HTML parsers
would be something that takes already-rendered textual output from lynx and tries to figure out what it is seeing, basically tries to infer the underlying HTML to some degree.

Some wiki expert, please find an appropriate place to put this topic in this wiki page, either as a new section between "2 Screen scrapers" and "3 Web scrapers, or as a sub-part of one of those two sections. IMO renaming section 2 to read "Application-UI scrapers", with sub-sections "2.1 General" "2.2 GUI scrapers" "2.3 "Standard output scrapers" would be make the most sense. Section 2.3 would include scraping any of: PTY output, or sub-process pipe stdout, or true pipe output, or Unix command-line vertical-bar piping, or TELNET output, etc.

Serious but COI addition thoughts, LOL[edit]

Scraping is often viewed as a way to get around web site attempts to "protect" data. I've gotten into these battles with a number of people and encourage anyone working on this to site some of these pieces or reliable refs therein,

OTS needs API

cygwin api comments

outdated summary

bio discussion on API

request for SEC API

These topics sometimes come up on the itext mail list as pdf authors seem to be the most prone to creating "protected" documents that are difficult to use with computers.

Thanks. Nerdseeksblonde (talk) 17:26, 24 August 2009 (UTC)

I guess I'd be thinking about reliable sources that discuss reasons why data scraping is even needed ( it sounds silly) and that would naturally lead to issues with commercial sites that are supported with ads that have no value if no one is exposed to them, to things like concern for slowing public awareness of the contents of required public filings ( everything from building permits to SEC filings could be an issue here but the SEC is at least making machine readable documents available, if not a complete automated API). I'm not sure how ever if these topics get much beyond forums and blogs. Also not entirely sure it is an encyclopedic issue. Nerdseeksblonde (talk) 23:57, 24 August 2009 (UTC)


I merged Report mining into this article. Reyk YO! 21:42, 3 April 2013 (UTC)