Talk:Robots exclusion standard

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing (Rated C-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 
WikiProject Internet (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Internet, a collaborative effort to improve the coverage of the internet on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 

Old archive[edit]

For the old archive, please see Talk:Robots.txt protocol. —Preceding unsigned comment added by Vacuum (talkcontribs) 02:29, 27 March 2004

this is a red link, whoever moved the page messed up, or the page was moved using some form of bot/app, which just posted the mesage without checking for the presence of a talk page.--|333173|3|_||3 05:38, 27 June 2006 (UTC)

Renaming[edit]

Perhaps this article should be named Robots exclusion standard instead of Robots Exclusion Standard? Wmahan. 00:17, 2004 Sep 12 (UTC)

Seems like it was approved. -62.219.97.118 (talk) 23:07, 2 July 2008 (UTC)

guest: This page used to be findable under "robot exclusion standard" as my browser still remembers finding it there. I see no reason not to retain a top-level link under roBOT versus robotS so that either "robot exclusion" or "robots exclusion" will find the same page, esp. as nobody would guess to seach for it under the plural (I didn't.) I gripe because the old page was not found, not even "this was renamed or moved." —Preceding unsigned comment added by 76.235.68.177 (talk) 01:09, 13 December 2008 (UTC)

Warning[edit]

I have ixed the Warning section, and reduced the level to a level 3 heading (=== ... === instead of == .. ==), and femoved the {{tone}} tag.--|333173|3|_||3 05:38, 27 June 2006 (UTC)

Google info removed[edit]

Google uses comments for the same purpose: <!--googleoff: index--> ... <!--googleon: index-->

A source is needed. - Ta bu shi da yu 13:53, 14 August 2006 (UTC)

NOINDEX[edit]

Can anyone confirm this? It sounds general, but I know of not a single reference anywhere. projectphp 00:13, 15 August 2006 (UTC)

AFIK, NOINDEX tag has been introduced by Yandex, a russian search engine, see Yandex help page (in Russian). 212.176.39.52 12:12, 15 August 2006 (UTC)

<!--noindex--> / <!--/noindex-->[edit]

One another way to exclude a portion of webpage from indexing is used by ASPSeek and DataparkSearch search engines: two special comments for the begin and the end of region to exclude <!--noindex--> / <!--/noindex-->, see DataparkSearch's documentation.

Examples section[edit]

There's a bit of a discrepancy between the first two and the other examples; the first two talks about "robots" while the latter about "crawlers". Should this be fixed/changed? Aeluwas 21:14, 30 May 2007 (UTC)

When search engines talk about their robots, they tend to call them "crawlers". However, the robots.txt applies to all robots, even the ones that don't crawl (and just check sites). Accordingly, I suggest that we use "robots" as a standard term for this article unless it's in a section that is very clearly only about a search engine crawler (such as the crawl delay).Ian McAnerin (talk) 05:20, 21 November 2007 (UTC)


The first example says that it allows all robots to crawl all directories so why is Mediapartners-Google mentioned in the user-agent section?--87.80.96.31 (talk) 19:37, 30 June 2008 (UTC)

Spam / Useless Links[edit]

I just removed the following external link: *[ht tp://www.google-msn-yahoo.info/ Windows XP Update Repaire] It caught my eye when I noticed "repaire" was spelled wrong. When I followed the link, it went to one of the spammier sites I've ever seen. The top half was all about wooden flooring, and there was a little tiny note at the bottom saying that robots.txt is important. Ian McAnerin (talk) 05:05, 21 November 2007 (UTC)

Great, have a cookie M. Poirrot. —Preceding unsigned comment added by 91.125.242.254 (talk) 00:45, 16 July 2009 (UTC)

History[edit]

http://yro.slashdot.org/comments.pl?sid=377285&cid=21554125 gives the history of the robots.txt standard. However, I'm not sure if the information is purticulary encyclopedic, and I'm betting a slashdot comment isn't a reliable, verifiable source. OTOH, the people monitoring this talk page might want to chase it down. Theorbtwo 23:54, 2 December 2007 (UTC)

Dynamic Links[edit]

There's no info on dynamic links. ceo 13:21, 7 December 2007 (UTC)

www.share_ali.com —Preceding unsigned comment added by 82.38.218.169 (talk) 07:48, 21 September 2008 (UTC)

standard ?[edit]

The title is misleading. This is no standard but a protocol. hAl (talk) 15:30, 13 February 2009 (UTC)

A protocol can be a standard. But this one has no standards body, no formal procedures, and the document that describes it is rather colloquial and lacks the formal rigor that is to be expected from a spec. The criteria for an actual standard are not fulfilled. It might be called a de facto standard, but the lemma shouldn't use the word standard at all.--87.162.37.163 (talk) 02:38, 5 January 2010 (UTC)

Requested move[edit]

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the move request was no consensus. @harej 00:16, 24 August 2009 (UTC)



Robots exclusion standardrobots.txt — Some googling suggests this term may be more common. --Cybercobra (talk) 18:29, 14 August 2009 (UTC)

  • Weak oppose — I've been putting off giving my opinion on this one, but I see that it's in the backlog now. I see your point Cybercobra, but that "robots.txt" is a more common term seems to be an artifact of the fact that people are uninformed or just lazy. There is an actual standard here, and "robots.txt" is simply a filename given to an implementation of that standard, so I think that naming the article robots.txt could be unnecessarily constraining. Using a redirect from the filename to the actual standard seems to be a more appropriate setup, here.
    V = I * R (talk) 23:31, 22 August 2009 (UTC)
The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

ACAP[edit]

Does ACAP belong here? It's no kind of extension for robots.txt - it's a totally different proposal (not even a standard) —Preceding unsigned comment added by 78.86.8.122 (talk) 10:30, 16 October 2009 (UTC)

Regular expressions[edit]

The big 3 (Google, Yahoo, MSN/Live/Bing) now support wildcards (* and $) in robots.txt, f.e.:


disallow: /*.php$

matches anything ending in php —Preceding unsigned comment added by Smremde (talkcontribs) 12:14, 25 October 2009 (UTC)

External Links Suggestions[edit]

As per the comments in the external links code, can I propose adding a link to

Top Ten active webcrawlers ?

Mtcooper (talk) 18:20, 22 February 2010 (UTC)

I like the proposal, but that page doesn't tell us much. It may be just be the top ten among the page requests made to that particular site during some unknown recent time interval, or not even that. A similar link to a well-argued, reputable quantification of relative search spider visitation frequencies would be welcome. This doesn't appear to be one. Rp (talk) 22:38, 19 March 2010 (UTC)

"Sorry, this page no longer valid"[edit]

When I tried to check the url: http://www.kpmg.com.hk/robots.txt I got this message: "Sorry, this page is either no longer valid or currently under maintenance." What does it mean? The main URL does have a valid web page. Ottawahitech (talk) 17:29, 18 March 2010 (UTC)

It means that the webmaster of KPMG's Hong Kong office forgot to create a new robots.txt when they installed a new site, so they are not providing any directions to spiders (such as Google's) on which pages to index and which pages to avoid. (This should be clear to you from reading this article. What is missing?) Rp (talk) 20:33, 18 March 2010 (UTC)
Thanks for the response, Rp. I am wondering if this means it is OK to spider this site? Ottawahitech (talk) 18:25, 19 March 2010 (UTC)
I don't think so - it only means we don't know. I think legally spidering is OK but some webmasters get very upset when you do it so if it's a big deal for you, ask. Rp (talk) 20:48, 19 March 2010 (UTC)
It means that no URL at their site is restricted. The original robots.txt specification says that when the file is missing (404 response), everything at the site may be fetched without limitation. Any other 4xx or 5xx response means that everything is forbidden. The draft RFC from 1997, section 3.1, makes this clear (404 -> everything is fair game). 71.106.210.230 (talk) 06:59, 28 July 2010 (UTC)

Should "Agent" in "User-agent" be capitalized?[edit]

Granted, the robots exclusion protocol has not historically capitalized "agent" in any of its specifications or examples. However, as it's an informal agreement, not a standard or a proposal (including an RFC), such is not controlling. RFC 2616 does capitalize "agent", and it has been accepted into the official HTTP standard. Therefore, as the use of "User-agent" in the robots context refers to the same header name and data as in the HTTP protocol ("User-Agent" at Section 14.43 et. al.), shouldn't "agent" be capitalized in the robots context also? 71.106.210.230 (talk) 06:36, 13 July 2010 (UTC)

I can't find an example on robotstxt.org that does capitalize it, so no. <sarcasm>Inconsistent standards, yay!</sarcasm> --Cybercobra (talk) 10:01, 13 July 2010 (UTC)
Further:

The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:

http://www.searchtools.com/robots/robots-txt.html

(emphasis mine) --Cybercobra (talk) 10:04, 13 July 2010 (UTC)

Looking back at the original 1994 robots.txt definition, it says that field names are case insensitive, so the "A" may be capitalized or not.[1] Additionally, the field data (for User-Agent info) should also be interpreted by robots as case insensitive for matching purposes. The draft RFC in 1997 repeats the case insensitivity for field names, even though it shows a lowercase "a" for this field's name in the ABNF syntax.[2] 71.106.210.230 (talk) 06:32, 28 July 2010 (UTC)

I can't find anything in the 1997 RFC to support the field names being case-insensitive. Regarding case, it says only "The name comparisons are case-insensitive.", but this is clearly referring to the names of the robots themselves. Could you specify a section number or quote from the RFC supporting your position? --Cybercobra (talk) 07:12, 28 July 2010 (UTC)

Both appears to be acceptable in the robots.txt, but "User-agent" seems more common place and outweighs the RFC anyway. I would go with that. --Hm2k (talk) 08:34, 28 July 2010 (UTC)

Glitched up Wayback Machine Beta in Night?[edit]

Hi, when you try to access Wayback Machine in Beta version, you get this error message: robots.txt has blocked this content from being crawled. Is there a was it should be fixed in Night, when in Korea. 121.164.146.185 (talk) 16:31, 11 November 2010 (UTC)

Chinese robots.txt?[edit]

From [ http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2011-08-15/Technology_report ]:

"A question raised at Wikimania – why the Chinese Wikipedia was getting so much more traffic than it used to – turned out to have a technical answer. The robots.txt file for the Chinese Wikipedia was written in both traditional and simplified Chinese, causing problems for bots from search engines and the like, a Chinese Wikimedian explained ( http://ultimategerardm.blogspot.com/2011/08/why-chinese-wikipedia-is-doing-so-well.html )."

I have no idea why someone would want to make a file that is not meant to be read by humans have multiple languages, but I don't see anything in the Robots exclusion standard that covers this. Does anyone have more information? Guy Macon (talk) 01:17, 17 August 2011 (UTC)

Robot (and/)or Crawler?[edit]

In section Examples the words robots and crawlers are intermixed. It may be confusing, but it may also be educational. I'm not sure if it needs fixing, and if so, what word to choose. David A se (talk) 17:21, 3 March 2012 (UTC)

Good catch. I just made it all "robot". "Crawler" is incorrect; "crawler" is a subset of "robot", and robots.txt makes requests of all robots, not just those robots that are also web crawlers. I also changed a few places where the page said robots.txt allows robots to do something with the more correct robots.txt telling the robot what to do; robots.txt doesn't actually allow or disallow anything but rather makes requests which robots are free to ignore. --Guy Macon (talk) 19:21, 3 March 2012 (UTC)

Robot blocker[edit]

Robot blocker can be cited [3]. --Trivanderumtequila (talk) 05:06, 26 November 2013 (UTC)

It looks like this phrase has been used a handful of times in the past month as tabloid journalists tried to explain robots.txt to a non-technical readership (possibly all quoting the same initial article). If that's all this is, I don't think it needs recording for posterity. --McGeddon (talk) 09:34, 26 November 2013 (UTC)