|The content of Report mining was merged into Data scraping on Apr 3 2013. That page now redirects here. For the contribution history and old versions of the redirected page, please see ; for the discussion at that location, see its talk page.|
|WikiProject Computing||(Rated Start-class, Mid-importance)|
Data_extraction) has been a persistent target of spammers. The commercial link was typically labeled "Know About Screen Scraping" in the See Also section. It has been removed and replaced several times. history Mrnatural (talk) 19:07, 5 August 2009 (UTC)
Lynx stdout scrapers
Somewhat intermediate between
true "screen scrapers" in the modern sense, which interface with GUIs of applications, and consequently need either OCR to read the bitmapped screen directly and convert to text, or access to the underlying data objects,
would be something that takes already-rendered textual output from lynx and tries to figure out what it is seeing, basically tries to infer the underlying HTML to some degree.
Some wiki expert, please find an appropriate place to put this topic in this wiki page, either as a new section between "2 Screen scrapers" and "3 Web scrapers, or as a sub-part of one of those two sections. IMO renaming section 2 to read "Application-UI scrapers", with sub-sections "2.1 General" "2.2 GUI scrapers" "2.3 "Standard output scrapers" would be make the most sense. Section 2.3 would include scraping any of: PTY output, or sub-process pipe stdout, or true pipe output, or Unix command-line vertical-bar piping, or TELNET output, etc.
Serious but COI addition thoughts, LOL
Scraping is often viewed as a way to get around web site attempts to "protect" data. I've gotten into these battles with a number of people and encourage anyone working on this to site some of these pieces or reliable refs therein,
These topics sometimes come up on the itext mail list as pdf authors seem to be the most prone to creating "protected" documents that are difficult to use with computers.
I guess I'd be thinking about reliable sources that discuss reasons why data scraping is even needed ( it sounds silly) and that would naturally lead to issues with commercial sites that are supported with ads that have no value if no one is exposed to them, to things like concern for slowing public awareness of the contents of required public filings ( everything from building permits to SEC filings could be an issue here but the SEC is at least making machine readable documents available, if not a complete automated API). I'm not sure how ever if these topics get much beyond forums and blogs. Also not entirely sure it is an encyclopedic issue. Nerdseeksblonde (talk) 23:57, 24 August 2009 (UTC)