Wikipedia talk:Controlling search engine indexing

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Wikipedia Help Project  
WikiProject icon This page is within the scope of the Wikipedia Help Project, a collaborative effort to improve Wikipedia's help documentation for readers and contributors. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks. To browse help related resources see the Help Menu or Help Directory. Or ask for help on your talk page and a volunteer will visit you there.
 ???  This page does not require a rating on the project's quality scale.
 ???  This page has not yet received a rating on the project's importance scale.
 

noindex on new, unpatrolled pages[edit]

I was asked in the help channel about a page being noindexed, and it turns out (according to Phabricator) that new, unpatrolled pages are noindexed by default. It might be worth putting on this page, but at the very least, I wanted to put it on the talk page so people might be able to find it. --MarkTraceur (talk) 14:02, 25 October 2016 (UTC)

Possible updates needed[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

  • (copied from User_talk:Xaosflux)
NOINDEX in mainspace

I was reading through the latest NPP RfC and related threads and noticed something I'm apparently confused about. I figure you'd probably know :) Much has been made of the possible dangers of "bad" new articles getting indexed by search engines. Speedy tags apply {{NOINDEX}}, but is it actually the case that new articles are not indexed until patrolled? I don't believe that was true last year (last time I had a reason to notice), but I may be behind. WP:NOINDEX seems to be out of date. I tried to test it by creating an article with my sock, but was lucky enough that it was patrolled quickly. (Yet the article still isn't indexed on google, and others I created today with my main account are... unfortunately I don't have enough articles stored up for any more sampling :) Opabinia regalis (talk) 06:59, 23 November 2016 (UTC)

@Opabinia regalis: (no)indexing isn't always an exact science. Since October 2016 (c.f. phab:T147544) new articles get an html attribute applied <meta name="robots" content="noindex,nofollow"/> until they are patrolled (or until 90 days go by, I haven't tested that). This attribute is also applied if the article contains a deletion tag such as {{speedy}}. Note, this is similar, but not identical to the behavior that __NOINDEX__ uses. There is a configuration parameter $wgExemptFromUserRobotsControl that prevents the INDEX, NOINDEX magic words from overriding the namespace default (which for Namesapce:0 (Article) is set to INDEX). This is to prevent vandals from NOINDEX'ing random pages out of search. The noindex meta tab is merely a request to web crawlers - Google generally honors these - but some search engine may not. Finally, being available for indexed doesn't require or "push" a notice to all of the search providers of the world - it is up to them to fetch and index a page - sometimes this is fast, sometimes it takes a long time. Hope this helps. — xaosflux Talk 14:06, 23 November 2016 (UTC)
Yeah, I know it's not a push notification that pops up immediately; it's just very noticeable that I created three articles yesterday on very similar topics, and the two created by this account were indexed immediately, but the one that needed "patrolling" is still absent. Hardly statistically significant, but I hadn't given it much thought because most of my articles are on very obscure topics and they usually pop up in google searches for the title near-instantly. I suppose we'll have to at least update the boilerplate for autopatrolled - the conventional wisdom is that the user right doesn't benefit the holder, but does benefit others by saving them some work; that's clearly no longer true if we assume that people create new articles because they want others to find and read them.
Anyway, thanks for the phab link, that's what I was looking for. Opabinia regalis (talk) 20:34, 23 November 2016 (UTC)
Opabinia regalis I can't find the documentation - but I hear that google does follow our new pages feed - but that non-autopatrolled page (autopatrol is included in your sysop group) would have been skipped - so now it would have to wait to get spider-indexed. I'm assuming you are referring to New Jersey polyomavirus. I pulled the source on it, and it is not (now) flagged for noindex. I made a minor edit on it, that may help kickstart indexing on an external site. Please note, none of this behavior has changed due to removing the patrol behavior from autoconfirmed users - non-autopatrolled editors woudl still have needed someone else to mark their page as patrolled. This indexing behavior is likely different due to the October software update. — xaosflux Talk 20:49, 23 November 2016 (UTC)
Opabinia regalis FYI - I used google webmaster tools to submit a request to crawl that page now - and now it is the #2 search result: google-result-here. — xaosflux Talk 20:53, 23 November 2016 (UTC)
p.s. google has massive caches - I got it to show me that result, but when reloading its not up - their index will take a little time to replicate. — xaosflux Talk 20:57, 23 November 2016 (UTC)
Ahhh, if google is directly following the new pages feed then that would make sense. Thanks, I see it now! (The other two articles are MW polyomavirus and STL polyomavirus, which were autopatrolled and indexed right away.) Opabinia regalis (talk) 21:26, 23 November 2016 (UTC)

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

DuckDuckGo Zero-click box showing content of no-index pages[edit]

FYI: [1]. TigraanClick here to contact me 17:50, 4 January 2017 (UTC)

Indexing in user space[edit]

I have raised a question here about permitting particular pages in user space to be indexed: Noyster (talk), 20:50, 28 February 2017 (UTC)

On userpages[edit]

Suppose we were to put in https://en.wikipedia.org/robots.txt

Disallow: /wiki/User:

Would that serve as a deterrent to people's putting their resumes, etc. in userspace? St. claires fire (talk) 22:29, 12 April 2017 (UTC)

Would this mean that no page in user space could be indexed? If so, the discussion here has made it abundantly clear that many established editors value the facility to invoke indexing of their user pages and would not accept its removal: Noyster (talk), 08:53, 13 April 2017 (UTC)

Automatic article indexing[edit]

What was the reasoning behind the 30-day indexing? Is it a holdover from before NPP, as a way to stop new junk being indexed? I would think indexing when an article is patrolled should be sufficient now.

The people who care most about indexing are those who are NOTHERE and creating articles for promotional purposes. Their articles should not be indexed at all. If their stuff manages to fall through the cracks and not get seen at creation, it would in effect be "accepted" automatically after sitting around for a month, which isn't good. With the huge backlog of pages to patrol that we have now, an article not being looked at for 30 days isn't unlikely, and basically the effect is that NPP is made ineffective by there being a big backlog. Apart from promotion, other poor articles will also show up on search engines just by hanging around.

So... Do we really need that 30-day thing? If not, how can we get rid of it?

Yeryry (talk) 16:08, 13 April 2017 (UTC)