For the old archive, please see Talk:Robots.txt protocol. —Preceding unsigned comment added by Vacuum (talkcontribs) 02:29, 27 March 2004

Perhaps this article should be named Robots exclusion standard instead of Robots Exclusion Standard? Wmahan. 00:17, 2004 Sep 12 (UTC)

Google uses comments for the same purpose: <!--googleoff: index--> ... <!--googleon: index-->

AFIK, NOINDEX tag has been introduced by Yandex, a russian search engine, see Yandex help page (in Russian). 12:12, 15 August 2006 (UTC)

One another way to exclude a portion of webpage from indexing is used by ASPSeek and DataparkSearch search engines: two special comments for the begin and the end of region to exclude <!--noindex--> / <!--/noindex-->, see DataparkSearch's documentation.

There's a bit of a discrepancy between the first two and the other examples; the first two talks about "robots" while the latter about "crawlers". Should this be fixed/changed? Aeluwas 21:14, 30 May 2007 (UTC)

When search engines talk about their robots, they tend to call them "crawlers". However, the robots.txt applies to all robots, even the ones that don't crawl (and just check sites). Accordingly, I suggest that we use "robots" as a standard term for this article unless it's in a section that is very clearly only about a search engine crawler (such as the crawl delay).Ian McAnerin (talk) 05:20, 21 November 2007 (UTC)

The first example says that it allows all robots to crawl all directories so why is Mediapartners-Google mentioned in the user-agent section?-- (talk) 19:37, 30 June 2008 (UTC)

History[edit] gives the history of the robots.txt standard. However, I'm not sure if the information is purticulary encyclopedic, and I'm betting a slashdot comment isn't a reliable, verifiable source. OTOH, the people monitoring this talk page might want to chase it down. Theorbtwo 23:54, 2 December 2007 (UTC)

Dynamic Links[edit]

There's no info on dynamic links. ceo 13:21, 7 December 2007 (UTC) —Preceding unsigned comment added by (talk) 07:48, 21 September 2008 (UTC)

standard ?[edit]

The title is misleading. This is no standard but a protocol. hAl (talk) 15:30, 13 February 2009 (UTC)

A protocol can be a standard. But this one has no standards body, no formal procedures, and the document that describes it is rather colloquial and lacks the formal rigor that is to be expected from a spec. The criteria for an actual standard are not fulfilled. It might be called a de facto standard, but the lemma shouldn't use the word standard at all.-- (talk) 02:38, 5 January 2010 (UTC)

Does ACAP belong here? It's no kind of extension for robots.txt - it's a totally different proposal (not even a standard) —Preceding unsigned comment added by (talk) 10:30, 16 October 2009 (UTC)

Regular expressions[edit]

The big 3 (Google, Yahoo, MSN/Live/Bing) now support wildcards (* and $) in robots.txt, f.e.:

disallow: /*.php$

matches anything ending in php —Preceding unsigned comment added by Smremde (talkcontribs) 12:14, 25 October 2009 (UTC)

As per the comments in the external links code, can I propose adding a link to

I like the proposal, but that page doesn't tell us much. It may be just be the top ten among the page requests made to that particular site during some unknown recent time interval, or not even that. A similar link to a well-argued, reputable quantification of relative search spider visitation frequencies would be welcome. This doesn't appear to be one. Rp (talk) 22:38, 19 March 2010 (UTC)

When I tried to check the url: I got this message: "Sorry, this page is either no longer valid or currently under maintenance." What does it mean? The main URL does have a valid web page. Ottawahitech (talk) 17:29, 18 March 2010 (UTC)

It means that the webmaster of KPMG's Hong Kong office forgot to create a new robots.txt when they installed a new site, so they are not providing any directions to spiders (such as Google's) on which pages to index and which pages to avoid. (This should be clear to you from reading this article. What is missing?) Rp (talk) 20:33, 18 March 2010 (UTC)
Thanks for the response, Rp. I am wondering if this means it is OK to spider this site? Ottawahitech (talk) 18:25, 19 March 2010 (UTC)
I don't think so - it only means we don't know. I think legally spidering is OK but some webmasters get very upset when you do it so if it's a big deal for you, ask. Rp (talk) 20:48, 19 March 2010 (UTC)
It means that no URL at their site is restricted. The original robots.txt specification says that when the file is missing (404 response), everything at the site may be fetched without limitation. Any other 4xx or 5xx response means that everything is forbidden. The draft RFC from 1997, section 3.1, makes this clear (404 -> everything is fair game). (talk) 06:59, 28 July 2010 (UTC)

Should "Agent" in "User-agent" be capitalized?[edit]

Granted, the robots exclusion protocol has not historically capitalized "agent" in any of its specifications or examples. However, as it's an informal agreement, not a standard or a proposal (including an RFC), such is not controlling. RFC 2616 does capitalize "agent", and it has been accepted into the official HTTP standard. Therefore, as the use of "User-agent" in the robots context refers to the same header name and data as in the HTTP protocol ("User-Agent" at Section 14.43 et. al.), shouldn't "agent" be capitalized in the robots context also? (talk) 06:36, 13 July 2010 (UTC)

I can't find an example on that does capitalize it, so no. <sarcasm>Inconsistent standards, yay!</sarcasm> --Cybercobra (talk) 10:01, 13 July 2010 (UTC)

The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:

(emphasis mine) --Cybercobra (talk) 10:04, 13 July 2010 (UTC)

Looking back at the original 1994 robots.txt definition, it says that field names are case insensitive, so the "A" may be capitalized or not.[1] Additionally, the field data (for User-Agent info) should also be interpreted by robots as case insensitive for matching purposes. The draft RFC in 1997 repeats the case insensitivity for field names, even though it shows a lowercase "a" for this field's name in the ABNF syntax.[2] (talk) 06:32, 28 July 2010 (UTC)

I can't find anything in the 1997 RFC to support the field names being case-insensitive. Regarding case, it says only "The name comparisons are case-insensitive.", but this is clearly referring to the names of the robots themselves. Could you specify a section number or quote from the RFC supporting your position? --Cybercobra (talk) 07:12, 28 July 2010 (UTC)

Both appears to be acceptable in the robots.txt, but "User-agent" seems more common place and outweighs the RFC anyway. I would go with that. --Hm2k (talk) 08:34, 28 July 2010 (UTC)

From [ ]:

"A question raised at Wikimania – why the Chinese Wikipedia was getting so much more traffic than it used to – turned out to have a technical answer. The robots.txt file for the Chinese Wikipedia was written in both traditional and simplified Chinese, causing problems for bots from search engines and the like, a Chinese Wikimedian explained ( )."

I have no idea why someone would want to make a file that is not meant to be read by humans have multiple languages, but I don't see anything in the Robots exclusion standard that covers this. Does anyone have more information? Guy Macon (talk) 01:17, 17 August 2011 (UTC)

Robot (and/)or Crawler?[edit]

In section Examples the words robots and crawlers are intermixed. It may be confusing, but it may also be educational. I'm not sure if it needs fixing, and if so, what word to choose. David A se (talk) 17:21, 3 March 2012 (UTC)

Good catch. I just made it all "robot". "Crawler" is incorrect; "crawler" is a subset of "robot", and robots.txt makes requests of all robots, not just those robots that are also web crawlers. I also changed a few places where the page said robots.txt allows robots to do something with the more correct robots.txt telling the robot what to do; robots.txt doesn't actually allow or disallow anything but rather makes requests which robots are free to ignore. --Guy Macon (talk) 19:21, 3 March 2012 (UTC)

Robot blocker can be cited [3]. --Trivanderumtequila (talk) 05:06, 26 November 2013 (UTC)

It looks like this phrase has been used a handful of times in the past month as tabloid journalists tried to explain robots.txt to a non-technical readership (possibly all quoting the same initial article). If that's all this is, I don't think it needs recording for posterity. --McGeddon (talk) 09:34, 26 November 2013 (UTC)


The article states that bingbot complies with robots.txt, but that's not always factual. Last year we blocked crawlers from downloading images to save on bandwidth. This worked for every bot except bingbot, which continued to crawl directories that have been explicitly disallowed by robots.txt. (There are no bingbot/msnbot sections overriding this).

User-agent: *
Disallow: /images/
Disallow: /image.php
Disallow: /imagesize.php
Disallow: /photogallery/

Recently rules were added to "Forbid" bot requests to these disallowed urls, and bingbot is the only bot to trigger them (at a rate of about 2-5 requests per minute). They all seem to come from legitimate MS IPs, here is one line: - - [28/Feb/2015:20:25:53 -0500] "GET /images/deals/BT_1_973646942.jpg HTTP/1.1" 403 266 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +"

I looked it up and it seems I'm not the only one who's observed bingbot's disregard for robots.txt:

What would be the best way to improve the article's accuracy? Add examples of bingbot ignoring robots.txt at this point? "Some major search engines following this standard include Ask,[7] AOL,[8] Baidu,[9] Bing,[10] Google,[11] Yahoo!,[12] and Yandex.[13]" A new section maybe? (talk) 02:02, 1 March 2015 (UTC)

First contact the authors and ask them what their bot is doing? Rp (talk) 09:11, 4 March 2015 (UTC)
I asked, and this is what they say:

We've received a response from our Product Group and we're happy to inform you that Bingbots respected the Robots.txt directives by not showing the content. The robots.txt disallow directive generally does not exclude a URL from index but blocks its content. Evidently, Bing's image crawlers does not show the blocked images of the site as configured in its robots.txt, this shows that Bingbot is following the directives.

I've asked for further clarification if there is any mechanism supported by bingbot that would stop the crawling. In regards to the article, it looks like a new section may be needed to clarify that not all bots interpret "disallow" to mean stop crawling. (talk) 16:41, 8 March 2015 (UTC)
They've finally responded:

We would like to inform you that a backend fix was done by our Product Group which should address your concern. Kindly verify this from your end.

Their fix did stop bingbot from crawling blocked URLs on our site. It's not clear to me if this was a local fix, but hopefully, as of March 27, bingbot is obeying everyone's robots.txt. If so, then the article's facts are now accurate. (talk) 14:07, 31 March 2015 (UTC)

This Article Needs a Major Disclaimer about Robots.txt not Working[edit]

Robots.txt instructs crawlers not to crawl a page, but that's not sufficient to keep content out of search results. To accomplish that (with Google, Bing and Yahoo), you must *not* use robots.txt and instead you must use the noindex HTML tag and the X-Robots-Tag HTTP headers. Since most people coming to this page are trying to understand how to block a page from appearing in search engines, I think a new section should be added that explains this in detail. Currently, this article addresses this at the very end by saying:

> Even if a robot honors robots.txt, it is still possible for the robot to find and index a disallowed URL from other places on the web. This can be prevented by using robots.txt directives in combination with robots meta tags or X-Robots-Tag headers.

That's just barely correct. If you use robots.txt in combination with robots meta or X-Robots-Tag headers, the result will be that your content is not crawled, the the meta tags will not be seen, and the item will show up in search results.

I intend to update this article with this information because in my experience almost all web devs expect robots.txt to work, and are then shocked when it doesn't. For many search engines (including Bing, Yahoo! and Google, the three most important ones to English readers) it's a deprecated way of blocking content, and this article should reflect that.

Before I make these changes, I'm interested in any feedback people might have about this proposal. --mjlissner (talk) 18:04, 8 April 2015 (UTC)

I think it would be wrong. Specifically, your statement that you must not use robots.txt is wrong; Google, Bing and Yahoo do honor robots.txt and I can't find any evidence that it is deprecated by anyone. See e.g. Bing's explanation on how to create one. Before you make any change I think it would be good to find some sources for your statements. Rp (talk) 09:40, 9 April 2015 (UTC)
If you have a reference that explains the problem and solution, that would be great. I don't think sharing your experiences directly on this article fits the Wikipedia model; it would make more sense to write that up as a blog post or article somewhere. I would be very interested to see references about robots.txt being deprecated. Hope this helps, Npdoty (talk) 17:33, 9 April 2015 (UTC)
Sorry, deprecated isn't what I meant. It's not technically deprecated, it's more that it just doesn't work anymore and shouldn't be encouraged. If a page is listed in robots.txt, and it's linked to by any other page on the Web, it can show up in Google and Bing, even if you use the noindex HTML or HTTP tags. This is because robots.txt tells crawlers not to crawl a page, and since they do support robots.txt, that means that they can't crawl the page and thus can't see the HTML or HTTP noindex flag. As a reference, here's Google's article on this, which explains it well. The only way to make sure a page doesn't show up in Google and Bing is to not include it in robots.txt, and instead to use noindex flags. Bonus points for inluding the page in your sitemap.xml so crawlers are explicitly encouraged to visit the page and discover the noindex flags. This is why I say that robots.txt is deprecated -- it doesn't work for the purpose it was intended.--mjlissner (talk) 23:44, 20 April 2015 (UTC)
You *did* write a blog post about this, I think. Link to Google is good, but a link to your post also explains this. For reliability purposes the Google page is best. Are there similar pages on Bing or other search engines? Brianwc (talk) 01:28, 21 April 2015 (UTC)
Yes, Bing has a page on this topic too, including juicy quotes such as: "make sure not to disallow the URL from being crawled using robots.txt" and "you should not block the URL from being re-crawled through robots.txt". Here's my blog post, though it's not relevant until about halfway through, and the Google/Bing pages are probably more authoritative. --mjlissner (talk) 01:35, 21 April 2015 (UTC)
It occurs to me that the way to frame this is something like, " It is ironic that, due to the policies of the two leading U.S. search engines, the robots.txt protocol does not actually prevent pages from being indexed in all cases and indeed, in order to ensure a page's content is not indexed both of these search engines require that one not list a page you wish blocked in one's robots.txt file.[1,2cites]" That is, it's entirely their fault that things are like this. they could adopt a different policy that would say: "we'll retrieve your robots.txt, compare it to our index, and we'll proactively remove anything you list." Nothing stops them adopting that policy but their own desire to index more pages and decision to thwart the purposes of the robots.txt protocol. So 'deprecated' is definitely the wrong word. Thwarted by powerful incumbents is accurate.Brianwc (talk) 03:33, 21 April 2015 (UTC)
FWIW, here's Google's official explanation of why they don't use robots.txt as the sole determiner of whether to crawl.--mjlissner (talk) 06:37, 21 April 2015 (UTC)

