Talk:Robots exclusion standard

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing (Rated C-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 
WikiProject Internet (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Internet, a collaborative effort to improve the coverage of the internet on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 

Old archive[edit]

For the old archive, please see Talk:Robots.txt protocol. —Preceding unsigned comment added by Vacuum (talkcontribs) 02:29, 27 March 2004

this is a red link, whoever moved the page messed up, or the page was moved using some form of bot/app, which just posted the mesage without checking for the presence of a talk page.--|333173|3|_||3 05:38, 27 June 2006 (UTC)

Renaming[edit]

Perhaps this article should be named Robots exclusion standard instead of Robots Exclusion Standard? Wmahan. 00:17, 2004 Sep 12 (UTC)

Seems like it was approved. -62.219.97.118 (talk) 23:07, 2 July 2008 (UTC)

guest: This page used to be findable under "robot exclusion standard" as my browser still remembers finding it there. I see no reason not to retain a top-level link under robot versus robots so that either "robot exclusion" or "robots exclusion" will find the same page, esp. as nobody would guess to seach for it under the plural (I didn't.) I gripe because the old page was not found, not even "this was renamed or moved." —Preceding unsigned comment added by 76.235.68.177 (talk) 01:09, 13 December 2008 (UTC)

Warning[edit]

I have ixed the Warning section, and reduced the level to a level 3 heading (=== ... === instead of == .. ==), and femoved the {{tone}} tag.--|333173|3|_||3 05:38, 27 June 2006 (UTC)

Google info removed[edit]

Google uses comments for the same purpose: <!--googleoff: index--> ... <!--googleon: index-->

A source is needed. - Ta bu shi da yu 13:53, 14 August 2006 (UTC)

NOINDEX[edit]

Can anyone confirm this? It sounds general, but I know of not a single reference anywhere. projectphp 00:13, 15 August 2006 (UTC)

AFIK, NOINDEX tag has been introduced by Yandex, a russian search engine, see Yandex help page (in Russian). 212.176.39.52 12:12, 15 August 2006 (UTC)

<!--noindex--> / <!--/noindex-->[edit]

One another way to exclude a portion of webpage from indexing is used by ASPSeek and DataparkSearch search engines: two special comments for the begin and the end of region to exclude <!--noindex--> / <!--/noindex-->, see DataparkSearch's documentation.

Examples section[edit]

There's a bit of a discrepancy between the first two and the other examples; the first two talks about "robots" while the latter about "crawlers". Should this be fixed/changed? Aeluwas 21:14, 30 May 2007 (UTC)

When search engines talk about their robots, they tend to call them "crawlers". However, the robots.txt applies to all robots, even the ones that don't crawl (and just check sites). Accordingly, I suggest that we use "robots" as a standard term for this article unless it's in a section that is very clearly only about a search engine crawler (such as the crawl delay).Ian McAnerin (talk) 05:20, 21 November 2007 (UTC)


The first example says that it allows all robots to crawl all directories so why is Mediapartners-Google mentioned in the user-agent section?--87.80.96.31 (talk) 19:37, 30 June 2008 (UTC)

Spam / Useless Links[edit]

I just removed the following external link: *[ht tp://www.google-msn-yahoo.info/ Windows XP Update Repaire] It caught my eye when I noticed "repaire" was spelled wrong. When I followed the link, it went to one of the spammier sites I've ever seen. The top half was all about wooden flooring, and there was a little tiny note at the bottom saying that robots.txt is important. Ian McAnerin (talk) 05:05, 21 November 2007 (UTC)

Great, have a cookie M. Poirrot. —Preceding unsigned comment added by 91.125.242.254 (talk) 00:45, 16 July 2009 (UTC)

History[edit]

http://yro.slashdot.org/comments.pl?sid=377285&cid=21554125 gives the history of the robots.txt standard. However, I'm not sure if the information is purticulary encyclopedic, and I'm betting a slashdot comment isn't a reliable, verifiable source. OTOH, the people monitoring this talk page might want to chase it down. Theorbtwo 23:54, 2 December 2007 (UTC)

Dynamic Links[edit]

There's no info on dynamic links. ceo 13:21, 7 December 2007 (UTC)

www.share_ali.com —Preceding unsigned comment added by 82.38.218.169 (talk) 07:48, 21 September 2008 (UTC)

standard ?[edit]

The title is misleading. This is no standard but a protocol. hAl (talk) 15:30, 13 February 2009 (UTC)

A protocol can be a standard. But this one has no standards body, no formal procedures, and the document that describes it is rather colloquial and lacks the formal rigor that is to be expected from a spec. The criteria for an actual standard are not fulfilled. It might be called a de facto standard, but the lemma shouldn't use the word standard at all.--87.162.37.163 (talk) 02:38, 5 January 2010 (UTC)

Requested move[edit]

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the move request was no consensus. @harej 00:16, 24 August 2009 (UTC)



Robots exclusion standardrobots.txt — Some googling suggests this term may be more common. --Cybercobra (talk) 18:29, 14 August 2009 (UTC)

  • Weak oppose — I've been putting off giving my opinion on this one, but I see that it's in the backlog now. I see your point Cybercobra, but that "robots.txt" is a more common term seems to be an artifact of the fact that people are uninformed or just lazy. There is an actual standard here, and "robots.txt" is simply a filename given to an implementation of that standard, so I think that naming the article robots.txt could be unnecessarily constraining. Using a redirect from the filename to the actual standard seems to be a more appropriate setup, here.
    V = I * R (talk) 23:31, 22 August 2009 (UTC)
The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

ACAP[edit]

Does ACAP belong here? It's no kind of extension for robots.txt - it's a totally different proposal (not even a standard) —Preceding unsigned comment added by 78.86.8.122 (talk) 10:30, 16 October 2009 (UTC)

Regular expressions[edit]

The big 3 (Google, Yahoo, MSN/Live/Bing) now support wildcards (* and $) in robots.txt, f.e.:


disallow: /*.php$

matches anything ending in php —Preceding unsigned comment added by Smremde (talkcontribs) 12:14, 25 October 2009 (UTC)

External Links Suggestions[edit]

As per the comments in the external links code, can I propose adding a link to

Top Ten active webcrawlers ?

Mtcooper (talk) 18:20, 22 February 2010 (UTC)

I like the proposal, but that page doesn't tell us much. It may be just be the top ten among the page requests made to that particular site during some unknown recent time interval, or not even that. A similar link to a well-argued, reputable quantification of relative search spider visitation frequencies would be welcome. This doesn't appear to be one. Rp (talk) 22:38, 19 March 2010 (UTC)

"Sorry, this page no longer valid"[edit]

When I tried to check the url: http://www.kpmg.com.hk/robots.txt I got this message: "Sorry, this page is either no longer valid or currently under maintenance." What does it mean? The main URL does have a valid web page. Ottawahitech (talk) 17:29, 18 March 2010 (UTC)

It means that the webmaster of KPMG's Hong Kong office forgot to create a new robots.txt when they installed a new site, so they are not providing any directions to spiders (such as Google's) on which pages to index and which pages to avoid. (This should be clear to you from reading this article. What is missing?) Rp (talk) 20:33, 18 March 2010 (UTC)
Thanks for the response, Rp. I am wondering if this means it is OK to spider this site? Ottawahitech (talk) 18:25, 19 March 2010 (UTC)
I don't think so - it only means we don't know. I think legally spidering is OK but some webmasters get very upset when you do it so if it's a big deal for you, ask. Rp (talk) 20:48, 19 March 2010 (UTC)
It means that no URL at their site is restricted. The original robots.txt specification says that when the file is missing (404 response), everything at the site may be fetched without limitation. Any other 4xx or 5xx response means that everything is forbidden. The draft RFC from 1997, section 3.1, makes this clear (404 -> everything is fair game). 71.106.210.230 (talk) 06:59, 28 July 2010 (UTC)

Should "Agent" in "User-agent" be capitalized?[edit]

Granted, the robots exclusion protocol has not historically capitalized "agent" in any of its specifications or examples. However, as it's an informal agreement, not a standard or a proposal (including an RFC), such is not controlling. RFC 2616 does capitalize "agent", and it has been accepted into the official HTTP standard. Therefore, as the use of "User-agent" in the robots context refers to the same header name and data as in the HTTP protocol ("User-Agent" at Section 14.43 et. al.), shouldn't "agent" be capitalized in the robots context also? 71.106.210.230 (talk) 06:36, 13 July 2010 (UTC)

I can't find an example on robotstxt.org that does capitalize it, so no. <sarcasm>Inconsistent standards, yay!</sarcasm> --Cybercobra (talk) 10:01, 13 July 2010 (UTC)
Further:
The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:

http://www.searchtools.com/robots/robots-txt.html

(emphasis mine) --Cybercobra (talk) 10:04, 13 July 2010 (UTC)

Looking back at the original 1994 robots.txt definition, it says that field names are case insensitive, so the "A" may be capitalized or not.[1] Additionally, the field data (for User-Agent info) should also be interpreted by robots as case insensitive for matching purposes. The draft RFC in 1997 repeats the case insensitivity for field names, even though it shows a lowercase "a" for this field's name in the ABNF syntax.[2] 71.106.210.230 (talk) 06:32, 28 July 2010 (UTC)

I can't find anything in the 1997 RFC to support the field names being case-insensitive. Regarding case, it says only "The name comparisons are case-insensitive.", but this is clearly referring to the names of the robots themselves. Could you specify a section number or quote from the RFC supporting your position? --Cybercobra (talk) 07:12, 28 July 2010 (UTC)

Both appears to be acceptable in the robots.txt, but "User-agent" seems more common place and outweighs the RFC anyway. I would go with that. --Hm2k (talk) 08:34, 28 July 2010 (UTC)

Glitched up Wayback Machine Beta in Night?[edit]

Hi, when you try to access Wayback Machine in Beta version, you get this error message: robots.txt has blocked this content from being crawled. Is there a was it should be fixed in Night, when in Korea. 121.164.146.185 (talk) 16:31, 11 November 2010 (UTC)

Chinese robots.txt?[edit]

From [ http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2011-08-15/Technology_report ]:

"A question raised at Wikimania – why the Chinese Wikipedia was getting so much more traffic than it used to – turned out to have a technical answer. The robots.txt file for the Chinese Wikipedia was written in both traditional and simplified Chinese, causing problems for bots from search engines and the like, a Chinese Wikimedian explained ( http://ultimategerardm.blogspot.com/2011/08/why-chinese-wikipedia-is-doing-so-well.html )."

I have no idea why someone would want to make a file that is not meant to be read by humans have multiple languages, but I don't see anything in the Robots exclusion standard that covers this. Does anyone have more information? Guy Macon (talk) 01:17, 17 August 2011 (UTC)

Robot (and/)or Crawler?[edit]

In section Examples the words robots and crawlers are intermixed. It may be confusing, but it may also be educational. I'm not sure if it needs fixing, and if so, what word to choose. David A se (talk) 17:21, 3 March 2012 (UTC)

Good catch. I just made it all "robot". "Crawler" is incorrect; "crawler" is a subset of "robot", and robots.txt makes requests of all robots, not just those robots that are also web crawlers. I also changed a few places where the page said robots.txt allows robots to do something with the more correct robots.txt telling the robot what to do; robots.txt doesn't actually allow or disallow anything but rather makes requests which robots are free to ignore. --Guy Macon (talk) 19:21, 3 March 2012 (UTC)

Robot blocker[edit]

Robot blocker can be cited [3]. --Trivanderumtequila (talk) 05:06, 26 November 2013 (UTC)

It looks like this phrase has been used a handful of times in the past month as tabloid journalists tried to explain robots.txt to a non-technical readership (possibly all quoting the same initial article). If that's all this is, I don't think it needs recording for posterity. --McGeddon (talk) 09:34, 26 November 2013 (UTC)

bingbot[edit]

The article states that bingbot complies with robots.txt, but that's not always factual. Last year we blocked crawlers from downloading images to save on bandwidth. This worked for every bot except bingbot, which continued to crawl directories that have been explicitly disallowed by robots.txt. (There are no bingbot/msnbot sections overriding this).

User-agent: *
Disallow: /images/
Disallow: /image.php
Disallow: /imagesize.php
Disallow: /photogallery/

Recently rules were added to "Forbid" bot requests to these disallowed urls, and bingbot is the only bot to trigger them (at a rate of about 2-5 requests per minute). They all seem to come from legitimate MS IPs, here is one line:

157.55.39.42 - - [28/Feb/2015:20:25:53 -0500] "GET /images/deals/BT_1_973646942.jpg HTTP/1.1" 403 266 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

I looked it up and it seems I'm not the only one who's observed bingbot's disregard for robots.txt:

http://www.abivia.net/forum/13-web-site-hosting-and-web-site-management/763-blocking-spiders-that-ignore-robotstxt-eg-bing

https://graphiclineweb.wordpress.com/2013/06/14/bing-banned/

https://www.techinasia.com/bing-denies-wrongdoing-sogou-privacy-leak-mess/

What would be the best way to improve the article's accuracy? Add examples of bingbot ignoring robots.txt at this point? "Some major search engines following this standard include Ask,[7] AOL,[8] Baidu,[9] Bing,[10] Google,[11] Yahoo!,[12] and Yandex.[13]" A new section maybe? 69.112.203.39 (talk) 02:02, 1 March 2015 (UTC)

First contact the authors and ask them what their bot is doing? Rp (talk) 09:11, 4 March 2015 (UTC)
I asked, and this is what they say:

We've received a response from our Product Group and we're happy to inform you that Bingbots respected the Robots.txt directives by not showing the content. The robots.txt disallow directive generally does not exclude a URL from index but blocks its content. Evidently, Bing's image crawlers does not show the blocked images of the site as configured in its robots.txt, this shows that Bingbot is following the directives.

I've asked for further clarification if there is any mechanism supported by bingbot that would stop the crawling. In regards to the article, it looks like a new section may be needed to clarify that not all bots interpret "disallow" to mean stop crawling. 69.112.203.39 (talk) 16:41, 8 March 2015 (UTC)
They've finally responded:

We would like to inform you that a backend fix was done by our Product Group which should address your concern. Kindly verify this from your end.

Their fix did stop bingbot from crawling blocked URLs on our site. It's not clear to me if this was a local fix, but hopefully, as of March 27, bingbot is obeying everyone's robots.txt. If so, then the article's facts are now accurate.24.190.42.28 (talk) 14:07, 31 March 2015 (UTC)

This Article Needs a Major Disclaimer about Robots.txt not Working[edit]

Robots.txt instructs crawlers not to crawl a page, but that's not sufficient to keep content out of search results. To accomplish that (with Google, Bing and Yahoo), you must *not* use robots.txt and instead you must use the noindex HTML tag and the X-Robots-Tag HTTP headers. Since most people coming to this page are trying to understand how to block a page from appearing in search engines, I think a new section should be added that explains this in detail. Currently, this article addresses this at the very end by saying:

> Even if a robot honors robots.txt, it is still possible for the robot to find and index a disallowed URL from other places on the web. This can be prevented by using robots.txt directives in combination with robots meta tags or X-Robots-Tag headers.

That's just barely correct. If you use robots.txt in combination with robots meta or X-Robots-Tag headers, the result will be that your content is not crawled, the the meta tags will not be seen, and the item will show up in search results.

I intend to update this article with this information because in my experience almost all web devs expect robots.txt to work, and are then shocked when it doesn't. For many search engines (including Bing, Yahoo! and Google, the three most important ones to English readers) it's a deprecated way of blocking content, and this article should reflect that.

Before I make these changes, I'm interested in any feedback people might have about this proposal. --mjlissner (talk) 18:04, 8 April 2015 (UTC)

I think it would be wrong. Specifically, your statement that you must not use robots.txt is wrong; Google, Bing and Yahoo do honor robots.txt and I can't find any evidence that it is deprecated by anyone. See e.g. Bing's explanation on how to create one. Before you make any change I think it would be good to find some sources for your statements. Rp (talk) 09:40, 9 April 2015 (UTC)
If you have a reference that explains the problem and solution, that would be great. I don't think sharing your experiences directly on this article fits the Wikipedia model; it would make more sense to write that up as a blog post or article somewhere. I would be very interested to see references about robots.txt being deprecated. Hope this helps, Npdoty (talk) 17:33, 9 April 2015 (UTC)
Sorry, deprecated isn't what I meant. It's not technically deprecated, it's more that it just doesn't work anymore and shouldn't be encouraged. If a page is listed in robots.txt, and it's linked to by any other page on the Web, it can show up in Google and Bing, even if you use the noindex HTML or HTTP tags. This is because robots.txt tells crawlers not to crawl a page, and since they do support robots.txt, that means that they can't crawl the page and thus can't see the HTML or HTTP noindex flag. As a reference, here's Google's article on this, which explains it well. The only way to make sure a page doesn't show up in Google and Bing is to not include it in robots.txt, and instead to use noindex flags. Bonus points for inluding the page in your sitemap.xml so crawlers are explicitly encouraged to visit the page and discover the noindex flags. This is why I say that robots.txt is deprecated -- it doesn't work for the purpose it was intended.--mjlissner (talk) 23:44, 20 April 2015 (UTC)
You *did* write a blog post about this, I think. Link to Google is good, but a link to your post also explains this. For reliability purposes the Google page is best. Are there similar pages on Bing or other search engines? Brianwc (talk) 01:28, 21 April 2015 (UTC)
Yes, Bing has a page on this topic too, including juicy quotes such as: "make sure not to disallow the URL from being crawled using robots.txt" and "you should not block the URL from being re-crawled through robots.txt". Here's my blog post, though it's not relevant until about halfway through, and the Google/Bing pages are probably more authoritative. --mjlissner (talk) 01:35, 21 April 2015 (UTC)
It occurs to me that the way to frame this is something like, " It is ironic that, due to the policies of the two leading U.S. search engines, the robots.txt protocol does not actually prevent pages from being indexed in all cases and indeed, in order to ensure a page's content is not indexed both of these search engines require that one not list a page you wish blocked in one's robots.txt file.[1,2cites]" That is, it's entirely their fault that things are like this. they could adopt a different policy that would say: "we'll retrieve your robots.txt, compare it to our index, and we'll proactively remove anything you list." Nothing stops them adopting that policy but their own desire to index more pages and decision to thwart the purposes of the robots.txt protocol. So 'deprecated' is definitely the wrong word. Thwarted by powerful incumbents is accurate.Brianwc (talk) 03:33, 21 April 2015 (UTC)
FWIW, here's Google's official explanation of why they don't use robots.txt as the sole determiner of whether to crawl.--mjlissner (talk) 06:37, 21 April 2015 (UTC)