Talk:Deep web (disambiguation)/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2008-2014

The pay per view site plague

Unmentioned in the article (and I am not about to, the resulting warfare is some in which I do not have time to engage) is that by search engines choosing to search the deep web and thereby filling the early set of returned hits with pay per view site hits, the very same information available for free elsewhere (say on the original authors' home pages of publications) may be effectively hidden from the person searching for it, in favor of the versions on pay per view sites which are often high ranked hits. To me, it would be useful for all search engines to have an easy way to exclude from the returned list of hits all returned URLs that go through a pay per view site path. If any participant can think of a way to say this in the article that wouldn't result in a "remove/repost" war, you have my blessing to steal the idea. Xanthian (talk) 07:51, 26 June 2008 (UTC)

Title change from "invisible web" to "deep web"

Why was this page moved from invisible web to deep web? The first term produces 102,000 results in Google; the latter only 29,900. I'd suggest keeping invisible web as the main page, with a redirect from deep web, not the reverse as is the case now. Any thoughts? Spinster 12:40, 2 Apr 2004 (UTC)

I changed the name after reading the arguments in http://library.albany.edu/internet/deepweb.html , which seem reasonable to me, but I do not mind too much if it is changed back. --Patrick 19:55, 2 Apr 2004 (UTC)

I see -- I can agree with these arguments, so on second thought I don't have a too strong opinion for or against both terms either (more widespread use vs. a more correct term). What to do? Turn this into a poll and let more opinions speak? ;-) Spinster 16:13, 3 Apr 2004 (UTC)

Web pages moving from deep Web to surface Web

I think there is a lot of confusion about whether something can move from the deep web to the surface web. For example, assume I have a web page at http://foo.org?abc that can only be found by entering a search query, and I don't have a link pointing to it. We would classify this web page as being in the deep web. But if someone else found the page and put a link to it from their web site, would the page still remain in the deep web? If a search engine finds my page by using some new technique (not crawling), does the page remain in the deep web?

In other words, under what conditions would we say a web page has moved from the deep web to the surface web?

Discussion? I think this info would be useful in the article.

Thanks, Frank

I think that if a page can normally only be accessed by interfacing with a database it is part of the deep web, even if linked to from other places. This, however, causes a contradiction in some cases such as with Wikipedia itself. The only real way to get to articles in Wikipedia is to use the search box, making it part of the deep web. But some many people link to it, and it links to itself so much, that it is ranked high in all Google searches. -- Singpolyma^{T E} 09:30, 29 September 2005 (UTC)

Actually wikipedia is not deep web, it's surface web. On a top page there is a link to http://en.wikipedia.org/wiki/Special:Allpages this is the first page of the alphabetical index with links to ALL pages AND it has a link to next, so a crawler can find that index, find all the successive pages and that connects to all the pages. But if we ignore that then it is a good example.

Many digital libraries do this as well to make their websites crawler-friendly. They create webpages for crawlers to follow so their once deep web resources become surface web resources. Since everyone wants to be in Google, websites are being altered to move deep web resources into the surface web. -- Fmccown 14:29, 18 November 2005 (UTC)

A point I notice is that the surface web depends on what the set of starting pages is. RJFJR 05:36, 9 October 2005 (UTC)

Hosting website on my computer

If I was to host a website off my computer, it would only be identified by my IP. Since it is inaccessible to the vast majority of users, does this make it part of the deep web? Deskana 00:04, 15 December 2005 (UTC)

Yes, as long as search engines are not aware of the site, and as long as someone doesn't post a link from the surface web to your web site (thus making it discoverable to a search engine), your site will remain in the deep web. Fmccown 18:25, 24 May 2006 (UTC)

Site referred to but not linked to

If I created a site without meta-tags, and, keep in mind, no-one knows about it, so that there are no hyperlinks pointing to it, and someone posts a link like www.deepornot.com, not this kind of link, but just a plain text link, would it be part of the deep web? 70.25.138.179 04:21, 9 February 2006

Maybe, maybe not. who knows. But you can use the <meta name="robots" content="none"> tag to prevent robots to index it, also place a robots.txt file as defined by the Robots Exclusion Standard. -- Frap 18:42, 3 May 2006 (UTC)

If there is an actual hypertext link to the site from the surface web, then the site is no longer in the deep web. If the site pointing to your site is unknown to all search engines, then your site will remain in the deep web until you somehow make your presence known (for example, submitting your URL to Google). Fmccown 18:23, 24 May 2006 (UTC)

External link to Discovery Engine

User:Vald, you added a link to the Discovery Engine (currently the last external link). The link now redirects to another website, which does not really have much information, but is more of an advertising nature. Is this what you wanted to link to? Or has the Discovery Engine been a search engine, but turned to closed by now?

Would be nice if you could have a look at that and remove the link if appropriate. Thanks! —84.157.234.77 07:26, 20 August 2006 (UTC)

Accessing: Poogee

When I used the external link to Poogee, all I saw was a People and Business Search. I don't think this site is a very good example of a deep web engine. Perhaps more is under development from Poogee; however, it is not very useful for searching the deep web with only yellow/white page searches made available. I would recommend providing a link to the following site instead: BrightPlanet's http://www.completeplanet.com. There are also other resources, such as subject directories that are useful in digging deeper and more effectively through the web: Infomine (http://infomine.ucr.edu), LII (http://www.lii.org), and RDN (http://www.rdn.ac.uk/).

Changes to Crawling the deep web & Classifying resources

I just made additions to the content to Crawling the deep web & Classifying resources. This is my very first time editing on wikipedia, so I hope I did it correctly and followed all the rules. I mistakenly marked one change as minor, but it wasn't. I created a dummy edit to indicate the mistake. Let me know if there is anything that I need to adjust.... Dside 22:23, 6 September 2006 (UTC)

Images

Should the bit about images not being the deep web be removed, because most search engines index images now. --Dan Leveille 05:40, 22 January 2007 (UTC)

Size of the deep Web

The textual reference to "How Much Information Is There?" is vague and is not backed up by a reference or external link (don't confuse Lesk 1997, which is ref'd and linked with the U. of Cal. publication). There are in fact two such UC publications/projects, a 2003(!) version: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/ and the "original" 2000 version: http://www2.sims.berkeley.edu/research/projects/how-much-info/ . Somebody should listen to the Science Friday archive at http://www.sciencefriday.com/pages/2007/Jul/hour2_072707.html and figure out which Mostafa's citing and modify the article to be appropriate specific (and link to the relevant resource). Moreover, it would be helpful to ascertain whether the projections -- they must be projections since the project dates to 2003 at the most recent -- are part of the original work by Lyman et al. at Berkeley or whether they represent additional work by Mostafa. —Preceding unsigned comment added by Paregorios (talk • contribs) 20:43, 19 September 2007 (UTC)

The article says the Deep Web is greater than Surface Web in several orders of magnitude. But it also says the estimation of its size as of 2004 is about 300k sites. The comparison with Surface Web isn't clear, since the article of this says its size is about 63 billion pages. A site should have a lot of pages, but it seems Surface Web is bigger. Also, Deep Web has estimated 7500TB, but that says nothing because that kind of information isn't available for the Surface article. Khullah (talk) 13:27, 2 August 2012 (UTC)

Yes, the size of the deep web is very outdated. — Preceding unsigned comment added by 69.132.221.40 (talk) 02:50, 18 April 2013 (UTC)

Interesting facts

I like this section, but don't you think using a 500kbps connection for the statistics is a little outdated? Mine's 8Mbps and you can get over 30Mbps now. —Preceding unsigned comment added by RaphaelBriand (talk • contribs) 00:38, 21 December 2007 (UTC)

Source of term "invisible Web"

According to Bergman (2001), Jill Ellsworth coined the term "invisible Web" in 1994. Bergman cites Garcia (1996) which reads like a contemporary interview with Ellsworth. Are there any 1994-95 sources where Ellsworth used the term? Nurg (talk) 03:00, 10 February 2008 (UTC)

HTTP headers Pragma and Cache-Control

"pragma:no-cache/cache-control:no-cache HTTP headers), prohibiting search engines from browsing them and creating cached copies" ... can these HTTP heders really stop any SE from browsing or caching document, e.g. in Google cache? --JanRenner 14:56, 25 April 2008 (UTC)

Requested move 1

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the move request was: page moved per request. - GTBacchus^(talk) 01:44, 2 July 2011 (UTC)

Deep Web → Invisible Web – For a cursory search in google books, almost all books call it either only invisible, some call it both invisible and deep, and a very few call it only deep. Examples:

Per WP:COMMONNAME, the most common name in English language reliable sources is Invisible Web.

--Enric Naval (talk) 15:10, 22 June 2011 (UTC)

The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

Have Alternate DNS root servers been taken into consideration?

Such as http://en.wikipedia.org/wiki/Alternate_DNS_root Although there are plenty more.

What?

167 terabytes.

So the whole open web fits on a hundred hard drives. — Preceding unsigned comment added by 69.157.176.191 (talk) 11:06, 16 September 2011 (UTC)

Maybe they were refering to the Internet in 1997. Wikifan21century (talk) 19:10, 28 September 2011 (UTC)

Back when hard drives had at most 8.4 GB[2] = 0.0084 terabytes. (remember that hard drive manufacturers use base 10 for disk capacities, so all numbers are off! I'm too tired to make the correct calculations)

So you would need 167/0.0084 = 19,880 hard disks.

And, at 9.31 cents per megabyte, 9.31 * 167,000,000 = 1,554,770,000 cents = 15,547,700.00 dollars = over 15 million dollars to store the whole Internet, in 1997 dollars.

Not to mention the physical space occupied by the disks. --Enric Naval (talk) 21:21, 28 September 2011 (UTC)

Requested move 2

The following discussion is an archived discussion of the proposal. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the proposal was move per request per the proposal and support. In addition to the Scholar search cited, I also performed this restricted search of Google Books for 2011 to 2012, excluding the topics suggested in the proposal, and found Deep Web to be more common on a pure numbers basis.--Fuhghettaboutit (talk) 13:30, 6 May 2012 (UTC)

Invisible Web → Deep Web – As the author most cited for this concept^[1], I request a change to the standard title for this page back to "Deep Web". It has been the most common title for this article for eight years, and was only changed to "Invisible Web" as the request for a move by Enric Naval in June 2011.

While Naval noted a higher incidence of Google Book references to justify the change to "Invisible Web", I believe this justification to be incomplete and in error for the following reasons (all citations are as of April 29, 2012):

As cited in the main article, I am the original explicator of this concept and noted the appropriateness of "deep Web" over "invisible Web" in my original paper with this sentence: "For this study, we have avoided the term 'invisible Web' because it is inaccurate. The only thing 'invisible' about searchable databases is that they are not indexable nor able to be queried by conventional search engines."
While it is true there are more Google Book citations for "invisible Web" (22,200) than "deep Web" (13,900), the first five "deep Web" citations are all relevant and from the years 2011, 2011, 2005, 2008 and 2000 (my original paper); the first five "invisible Web" citations are earlier (2003, 2001, 2001, 1991 and 1929), and only two are relevant to the concept, with these three clearly out of scope:
- The Invisible Web: Gender Patterns in Family Relationships, by Marianne Walters, Betty Carter, Peggy Papp, MSW - 1991 - 422 pages
- An Invisible Web (fiction), by Julie Coffin - 2003 - 99 pages
- The invisible web: strange tales of the French sûreté, by Harry Ashton-Wolfe, Edmond Locard - 1929 - 284 pages
A more authoritative basis for citation counts is Google Scholar. There are twice as many citations on Scholar for "deep Web" (7,560) and opposed to "invisible Web" (3,860), reflecting the research community's embracing of the "deep Web" term since I first published on it
There are nearly two more orders of magnitude greater Web-wide citations on "deep Web" (10,700,000) than "invisible Web" (750,000), also reflecting its common adoption.

Thus, per WP:COMMONNAME, the most common name in English language reliable sources, especially from both authoritative and common usage standpoints, is Deep Web. I therefore request reversion back to the standard title of "deep Web". Mkbergman (talk) 19:53, 29 April 2012 (UTC)

Now there's someone who's done his research. Not knowing much of the concept, I'm persuaded to Support, but I'm willing to entertain counterevidence. Powers ^T 18:59, 30 April 2012 (UTC)
Oh, well, I stand corrected. And very humbled. Wikipedia is one of the few places where you can get your edits pulled apart and corrected by the most informed people in the relevant field. I of course support this move. --Enric Naval (talk) 22:27, 30 April 2012 (UTC)

The above discussion is preserved as an archive of the proposal. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

Access

Should we list a few sites that can access it? DeepPeep, Intute, Deep Web Technologies, and Scirus have articles. These others may be worthy of articles or external links?

--Canoe1967 (talk) 22:47, 20 July 2012 (UTC)

Another part of the Web that is neither Surface nor Deep?

There are many parts of the Web that are simply private sections where you have to enter with a password; for example, many discussion groups. These are, as far as I know, not in general indexed by any standard search engines, yet I doubt that most people would think the phrase "Deep Web" should be used to describe them.

Is this correct? If so, then the article should not define the Deep Web just as that portion of the Web not indexed by standard search engines. There should be (at least) a third portion of the Web, perhaps called the "Middle Web" or something Already in article as Private Web like that.Daqu (talk) 20:54, 7 November 2014 (UTC)

I agree, the article looks totally unfocused to me. The article should stick to what the majority of sources say. Which, I would guess, is mainly sites on the Tor network. zzz (talk) 21:27, 7 November 2014 (UTC)

I mostly disagree with the main idea. Quote from the article, "Methods used to prevent web pages from being indexed by traditional search engines may be categorized as"..."Private Web: sites that require registration and login (password-protected resources)". (Note: I have editted the section). In regard to "in general, [indexing] by any standard search engines": There are many forum/discussion websites with username/password that are indexed by standard search engines.

Tor is another method of preventing a page from being indexed by traditional search engines.--Nodove 20:59, 14 November 2014 (UTC)

"theoretically almost any site can be accessed via its IP address"

While this statement does contain the qualifiers "theoretically" and "almost", I think it is still misleading. Apache can be configured to only allow a DNS URL and not return a page if the IP address is used, especially when many sites share one IP address. — Preceding unsigned comment added by 216.13.210.98 (talk) 17:33, 16 December 2014 (UTC) See 09:55, 17 December 2014 edit in article history--Nodove (talk) 09:58, 17 December 2014 (UTC)

^ Bergman, Michael K (2001). "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing. 7 (1). doi:10.3998/3336451.0007.104. {{cite journal}}: Unknown parameter |month= ignored (help)

[bergman2001-1] Bergman, Michael K (2001). "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing. 7 (1). doi:10.3998/3336451.0007.104. {{cite journal}}: Unknown parameter |month= ignored (help)

[1]