Talk:Web crawler/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

More details needed for personal project

Well, this a basics of the web crawler. I have to design a web crawler that will work in client/server architect. I have to make it using the Java. Actually I am confused about the how will I implement the client/server architect. What I have in my mind is that I will create a light weight component using swing for client interaction and an EJB that will get the instructions from the client to start crawling. Now the server will have another GUI that will monitor the web crawler and administrate it.

Do anyone have a simple or another way for doing this.

It is actually not that difficult to build a web crawler, Off the shelf components are available in languages such as Java, python and perl. If you need to build one in python (I am talking about a simple crawler) you can use the library urllib, and in perl LWP. For more information search for these terms on the web. If you want to look at libcurl or curl which provides a very good starting point for c/c++ based crawlers. A lot of academic websites also provide crawlers, but make sure you obtain the documentation for these too.--IMpbt 20:25, 20 May 2005 (UTC)

For downloading a single Web site or a small bunch of Web sites, you can use almost any tool, for instance, just "WGET"; however, for downloading millions of pages from thousands of Web sites the problem gets much more complicated. Run some test crawls with 2-3 candidates and see which crawler suits better for your needs. --ChaTo 08:50, 18 July 2005 (UTC)

The biggest problem I hit writing one of these was keeping track of URL's. I'd keep a list of URL's to be processed, and a list of URL's that had been processed. Thus, before I added one to the list of url's to be processed, I'd check that it was in neither list. My spider is written in python, and on my first attempt, I simply used a list and listname.__contains__. This got slow. Eventually I wrote a binary search tree for this. This was very fast, but around several 100,000 URL's processed (and quite a few more to be processed), it went through all the RAM (the machine I dedicated can only hold 192mb). The solution I finally setteled upon was a hash table. It only goes through a few mb's of ram, yet can process several 10,000's of operations per second on my slow machine. I guess in summary, if you hit a roadblock with your URL list, hash tables work well. This has to be the biggest thing I've tackled involving my spider. Also, because most of your table is stored on disk, you can store as much additional info about each URL as you want without a big hit.

Use a database with a primary key for URLs; ram-based ojbects aren't really optimized for this in the way an RDBMS is, as you've learned.

As far as client/server, I just used xmlrpclib. Probably something similar in Java.

As far as I concerned, how does a Web crawler collect URL automatically as many as possible?

Anti-merge

I dissaprove of merging this article, as not all web crawlers are search bots, for example maintenance bots and spam bots! The Neokid 09:55, 28 January 2006 (UTC)

I think crawlers include: search engine bots (for indexing), maintenance bots (for checking links, validating pages) and spam bots (for harvesting e-mail address) ChaTo 10:29, 28 January 2006 (UTC)

Vandalism

I've never come across a vandalized page before and was not quite sure what to do about it. I removed some of the material on the vandalized page, but did not revert the content. If someone with more experience could assist, I would be grateful. AarrowOM 16:16, 20 February 2007 (UTC)AarrowOM 11:15 EST, 20 February 2007

Bad PDF links

There are a lot of bad links at the PDFs at the bottom of the page. I am not an experienced wiki editor, but they should have their hyperlinks removed or fixed or something. 70.142.217.250 13:33, 15 July 2007 (UTC)

Seemingly contradictory section

I added {{confusing}} to the section Crawling policies because it essentially seems to say both that the nature of the Web makes crawling very easy, and that the nature of the Web makes crawling difficult. Can someone rewrite it in a way that clarifies things, or is the problem with how I am reading it? $\sim$ Lenoxus " * " 13:06, 24 March 2007 (UTC)

I couldn't find where does it says that crawling is easy. Here is the (apparent) contradiction: building a simple crawler should be in theory very straightforward, because it's basically download, parse, download, parse, ... so, if you want to download a web site or a small set of, say, a few thousand pages, it's very easy to write a program to do so. But in practice it's very hard, because of the issues listed in the article. If you want you can add an explanation like this to the article, but ... where? - ChaTo 15:26, 12 April 2007 (UTC)

Oh, wow, I'd totally forgotten about this. Well, it seems perfectly good now. :)

\sim

Lenoxus " * " 15:01, 4 May 2008 (UTC)

SEOENGBot

SEOENGBot, originally created for the purpose of providing a focused crawler for the SEOENG engine on a per Website basis (2004-2007), was later retrofitted as a general purpose, highly distributed crawler which is reponsible for crawling millions of webpages, while archiving both webpages and links. The archived data is used to inject into the SEOENG engine for its own commercial use. SEOENGBot, as well as SEOENG, remains a highly guarded system and its source and location are not currently published. Seoeng (talk) 04:45, 10 May 2008 (UTC)

References

I changed several references to an inline format as there was two styles of referencing which had attracted a cleanup-tag. Once all the references mentioned in the article text had been converted to inline form there were five references left over that were not mentioned. They may be left over from previous edits that have since been deleted or they may be appropriate references for certain parts of the text. Rather then delete them I have moved them here so that the article only uses one reference style. I will try going back through the history to see if they are references that should have been deleted. Please add them back as inline references if you can see an appropriate place. WaysToEscape (talk) 01:14, 23 March 2009 (UTC)

Burner, M. (1997). Crawling towards eternity – building an archive of the World Wide Web. Web Techniques, 2(5).
Castillo, C. (2004). Effective Web Crawling. PhD thesis, University of Chile.
Miller, R. and Bharat, K. (1998). Sphinx: A framework for creating personal, site-specific web crawlers. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia. Elsevier Science.
da Silva, A. S., Veloso, E. A., Golgher, P. B., Ribeiro-Neto, B. A., Laender, A. H. F., and Ziviani, N. (1999). Cobweb – a crawler for the Brazilian web. In Proceedings of String Processing and Information Retrieval (SPIRE), pages 184–191, Cancun, Mexico. IEEE CS Press.
Yibei Ling and Jie Mi, An optimal trade-off between content freshness and refresh cost, Journal of applied probability, 2004, vol. 41, no3, pp. 721-734. —Preceding unsigned comment added by WaysToEscape (talk • contribs) 01:13, 23 March 2009 (UTC)

web crawler

web crawler is use ful for indexing for more details plz visit to www.php.com is this field you can see more details abouts crawler rishi —Preceding unsigned comment added by 122.168.235.103 (talk) 13:38, 9 April 2009 (UTC)

"Web" as a proper noun

As a matter of style I believe that "Web" should be capitalized when used as a proper noun -- for example as in "World Wide Web" (meaning the singular largest connected graph of HTML documents avaiiable by HTTP), or "the Web" (short for the above) -- but not when used in a compound noun such as "web crawler", "web page", "web server", where it acts like an adjective meaning something more like "HTML/HTTP". -- 86.138.3.231 11:58, 13 October 2007 (UTC)

It should also be noted that 'Internet' is always a proper noun. Someone with some time ought to clean up this page. —Preceding unsigned comment added by 24.96.244.134 (talk) 21:56, 10 August 2008 (UTC)

English teacher here. Adjectives formed from proper nouns are proper adjectives. As such, they, too, are capitalized (America -> American). But so many rules have been bent and broken in the digital age, I don't know that Web has to be capitalized, or even if Internet does. I searched for style clues, and it seems to be up for grabs. Some authorities have announced they will no longer capitalize either word. Some, only if used as a noun (as you indicated). A couple, always -- as per the general English rule for proper nouns and adjectives. Perhaps Wikipedia could create their own style page?PapaWhitman (talk) 16:17, 22 April 2009 (UTC)

Simple English version

Can someone create the SE version of this, so non-geeks can understand the concepts. —Preceding unsigned comment added by 83.104.132.41 (talk) 12:49, 3 June 2009 (UTC)

Are there any particular bits that are difficult to follow? The article is only likely to be changed by editors who understand the concepts; as they understand them they are not in the best position to know which parts are unclear. If anyone finds any part of the article difficult to understand please mention it here. Thanks. WaysToEscape (talk) 00:58, 4 June 2009 (UTC)

picture?

does anyone have a good picture of one of these web crawlers in action? it might add interest to the article 80.69.30.17 (talk) 14:55, 22 May 2009 (UTC)

there is no such thing as a "picture" of a crawler, they are simply scripts that usually start on one page, read its links, then visit them, read the links on the visited page and then visit them and so on.... You could take a picture of the crawler's source code (but then again, its different depending on its quality and language its written in), or GUI if they have one. I doubt it would "add interest" to the page thou. 78.1.130.222 (talk) 02:58, 18 December 2009 (UTC)

Open Source and Free to use Web Crawlers

What about Web Harvest - http://web-harvest.sourceforge.net/ On their website it is advertised as Web Data Extraction tool (JAVA based) but really it also works as a webcravler, autonomously visitng set of links, etc... Maybe this be added to the Open Source Web Crawlers list! —Preceding unsigned comment added by 131.231.108.132 (talk) 16:56, 22 February 2010 (UTC)

What about Seekquarry/Yioop - http://www.seekquarry.com/ It is an open source, GPLv3 search engine and crawler. The demo site http://www.yioop.com/ currently has a crawl of over 1/3 of a billion pages.

What about import.io? - http://import.io -it's free and the pricing page claims it always will be (disclaimer: I work with them, but that doesn't make it less true!)

"no search engine indexes more than 16% of the Web" Out of date and with out proper context

This study was done in 1998. It may be the case that the rate at which the internet grows paces relative to the rate at which web crawlers crawl. This ratio may also be based on all accessible content rather then inexable content. Not all content is indexable, for example some pages don't want to be crawled.

Another estimate published, although in a very short form, claims that there are 11.1 billion indexable pages^[1] on the web, and that Google, for example, has indexed 8 billion of them. This paper was published in 2005 in World Web Conference at Chiba, Japan (I have not heard of it but it seems to be associated with ACM^[2]).

This should be updated.

FIXED, referenced Gulli & Signorini ChaTo (talk) 10:21, 3 August 2010 (UTC)

Transwikied Content

The following came from b:Web crawler. The sole contributor was 129.186.93.50

Web Crawler A program that downloads pages from the internet by following links

Examples: - google bot - yahoo ...

In general all the search engines have a web crawler that collects the pages from the web for them. This is done by starting with a page, then downloading the pages that it points to, then downloading the pages that these pages point to and so on and so forth. The names of the already downloaded pages are kept into a databese in order to avoid redownloading them.

The reach (pages from the web that are downloaded) of this whole technology is depndent upon the initial pages where the downloading starts. Basically the downloaded pages are all the reacheabel pages from those initial pages (unless addititonal constraints are specified). The current eight bilion Pages that Google crawls are estimated to be only 30% of the web for this reason

Article History

21:55, 18 May 2005 Popski (VfD)

20:38, 15 May 2005 129.186.93.50

20:29, 15 May 2005 129.186.93.50

Merge with spidering

The new article on Spidering should definately be moved into this article. Fmccown 18:47, 8 May 2006 (UTC)

Absolutely Not! Spidering and Web Crawling are exactly opposite terms.

   Spidering = The network of web pages and their inter-connection to each other.

   web-crawling = The art of finding specific information from that web or internet.

I guess this is most comprehensive that i can say! Any comment/suggestion is purely welcomed?

  Raza Kashif (l1f05mscs1025@ucp.edu.pk)

These should not be merged. Spidering is more a method of navigation compared to scraping. Scraping is the specific task of acquiring data from a page, and navigating to it is just one facet among 100s of scraping. — Preceding unsigned comment added by DataFace (talk • contribs) 10:46, 17 February 2014 (UTC)

203.161.72.40 11:57, 21 May 2007 (UTC)== Verifiability ==

"Some spiders have been known to cause viruses." No citation, examples, or explanation for how this is possible. I'm removing this sentance, as I don't believe it is true. Requesting a document by URL can't give the server a virus! ( Of course, if somebody knows something I don't, please restore the sentance, and cite your sources! )

--Sorry, not source for this, but I have heard of several cases where the spider has literally flooded a server with requests, reulting in the server going down temporarily. It's not a virus in any way, but it is certainly possible to try and overwhelm a server with thousands of requests. A simple way to prevent this would be a cap on the number of times a server can receive requests per minute.— Preceding unsigned comment added by 203.161.72.40 (talk) 11:57, 21 May 2007 (UTC)

Question

How comes WebBase is considered as an open source crawler while his source is unknown !!— Preceding unsigned comment added by 196.203.196.12 (talk) 11:35, 1 February 2007(UTC)

WebCrawler the Search Engine

Before Google, there were four search engines i.e. MS, Yahoo, Lycos, and Web Crawler. Web Crawler is no where to be found, not even in Wikipedia. Any idea what happened? Hassanfarooqi (talk) 15:02, 4 August 2010 (UTC)

You're probably thinking of Excite, Lycos, AltaVista, Inktomi, etc. Microsoft didn't have it's own search engine until later. Yahoo was only a directory and used Inktomi and Google before launching their own search. However, [Webcrawler] does indeed have it's own Wikipedia page. NipponBill (talk) 02:28, 5 August 2010 (UTC)

Rest easy, Hassanfarooqi as WebCrawler is alive and well: http://www.webcrawler.com/ Yours, Wordreader (talk) 02:44, 17 June 2013 (UTC)

Worlds first web crawling search engine

http://www.bbc.co.uk/news/technology-23945326 I think this might be JumpStation (albeit archived version) http://web.archive.org/web/19971210202429/http://js.stir.ac.uk/jsbin/jsii

maybe also

http://web.archive.org/web/20040721083401/http://js.stir.ac.uk/

(213.167.69.4 (talk) 09:55, 4 September 2013 (UTC))

Visual Scrapers vs Programatic

There are a number of visual scraper products available on the web. One of the main difference between a classic and a visual scraper is the level of programming ability required to create a crawler/scraper. The latest generation of 'visual scrapers' like outwithub and import.io remove the majority of the need to be able to program to scrape. That opens open the ability for non-programmers to scrape and structure data from the web.

The Visual Scraping methodology relies on the user 'teaching' a piece of software which patterns to follow to get data and in what schema. The dominant method is highlighting data in a browser and training columns and rows. This video demonstrates the theory: https://www.youtube.com/watch?v=xaktDXnBxNs DataFace (talk) 11:17, 17 February 2014 (UTC)

Investment and acquisitions

In 2013 investment in 'visual web scrapers' has increased, just look at Crunchbase for plenty of examples. This is being driven by the increase in demand for Business intelligence aps particularly driven by big data.

in 2013 Import.io - a visual scraper / crawler company rasied 1.3m in Seed funding, and google has acquired scraping companies (Needle base being the prime example) in the past, and subsequently shut them down. — Preceding unsigned comment added by DataFace (talk • contribs) 11:18, 17 February 2014 (UTC)

Eyeball poison

A lot of work has clearly gone into this article, but the "nomenclature" paragraph I just demoted from the lead was verging on eyeball poison. First impressions, and all that. — MaxEnt 01:35, 22 February 2016 (UTC)

External links modified

Hello fellow Wikipedians,

I have just modified one external link on Web crawler. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Corrected formatting/usage for http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—^{cyberbot II}_{Talk to my owner:Online} 14:44, 31 March 2016 (UTC)

Proposal to merge Knowbot into this article

I propose that Knowbot be merged into this article. It's highly unlikely that we're going to find new sources for that article, and I think the content there would be better suited here than as a permastub. Enterprisey (talk!) _{(formerly APerson)} 03:49, 19 July 2016 (UTC)

Merge - I agree. That/This should happen. –MJL ‐Talk‐^☖ 19:23, 9 December 2017 (UTC)
Merge - I agree with APerson and MJL. = paul2520 (talk) 00:09, 19 February 2018 (UTC)
Do not merge - I disagree with APerson, MJL, and paul2520. = alexandrenf (talk!) 04:34, 5 March 2018 (UTC)

[1] A. Gulli, A. Signorini, "The Indexable Web is More then 11.5 Billion Pages." World Wide Web Conference. Chigba, China. 2005

[2] ACM Portal Link

[1]

[2]