Wikipedia:Bots/Requests for approval/Navi-bot

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Withdrawn by operator.

Navi-bot

Operator: Navi-darwin (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 22:02, Monday, October 17, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Go

Source code available: https://gist.github.com/DarwinLYang/6a8e1067fdf22a20d9a48bd9c62fe33c

Function overview: Goes into Wikipedia articles within Navi's database and imports all the links on the Wikipedia pages to the Navi servers

Links to relevant discussions (where appropriate):

Edit period(s): Occasionally when needed. After running the first time, this may only be run once a week on newly updated Wikipedia pages.

Estimated number of pages affected: 0

Exclusion compliant (Yes/No): No, the bot is not editing pages, only gathering links on pages

Already has a bot flag (Yes/No):

Function details: The Navi bot will simply be scanning through existing Wikipedia pages and gathering links as well as the html source of each page. The purpose for this is to create a comprehensive knowledge graph for Navi servers to attempt to find related pages. This bot will run once through all the Wikipedia pages to create an initial set of links and then will run about once a week to get any new links from updated pages.

Discussion

Will this bot need to work on any project besides the English Wikipedia? If so the apihighlimits-requestor group may be more appropriate. — xaosflux ^Talk 00:18, 18 October 2016 (UTC)[reply]

This bot will only be working in the English Wikipedia. The main purpose of using a bot instead of an unregistered user was to be able to use apihighlimits. Would I still need a bot request if that is the case? — Navi-darwin ^Talk 13:48, 18 October 2016 (UTC)[reply]

If you just need high speed api, especially if it is for multiple projects, you would request that over on meta-wiki. — xaosflux ^Talk 13:55, 18 October 2016 (UTC)[reply]

Yes, it's mostly to be able to access more pages and links per api call. Can you point me in the direction of where I should be making the request? I thought it would be a bot request from looking at this table. — Navi-darwin (talk) 14:07, 18 October 2016 (UTC)[reply]

If this is only for the English Wikipedia then this is the right place, if it is for many wiki's then it would be meta:Steward requests/Global permissions. — xaosflux ^Talk 18:03, 18 October 2016 (UTC)[reply]

Yes this will only be for the English Wikipedia. — Navi-darwin (talk) 20:23, 18 October 2016 (UTC)[reply]

Have you evaluated gathering your information from database dumps? — xaosflux ^Talk 00:19, 18 October 2016 (UTC)[reply]

I have considered using database dumps but since I am only looking for a small subset of information for each page it was concluded that it would be easier to just go through the API. — Navi-darwin ^Talk 13:48, 18 October 2016 (UTC)[reply]

What is your read rate expected to be (reads per hour, per day, per week)? — xaosflux ^Talk 18:03, 18 October 2016 (UTC)[reply]

In the beginning, since we will be trying to create an initial database we will try to do as many reads as quickly as possible. An optimistic guess would be 300,000 reads an hour for about 3 days but it will probably be slower than that. We are trying to get through about 24,000,000 pages in total so the total time will be how fast we can read per hour. After the initial read, we may decide to read once a week over any pages with updates to get any new links but that is still to be determined. — Navi-darwin (talk) 20:23, 18 October 2016 (UTC)[reply]

That sounds like you're planning to make a lot of requests very quickly. Please review and follow mw:API:Etiquette, particularly #Request limit, if for some reason you can't parse a dump instead. Note there are dumps of the pagelinks and pages SQL tables, which may be just the information you need. Anomie ⚔ 00:33, 19 October 2016 (UTC)[reply]

Yup, I have read the API Etiquette page and we will only be making requests in series as well as trying to group as many titles together as possible which is why I wanted to use a bot so that I can increase the number of titles I can group together. We have considered the dump but we believe it will be easier and faster to search through the database through the API as we need to match our own data with the Wikipedia pages. Navi-darwin (talk) 13:31, 19 October 2016 (UTC)[reply]

Are you really thinking of making ~80 API calls a second? That is about 2% of the total API load, including all internal API calls. If you actually plan on serializing your API calls, then API latency will not let you get anywhere near that speed. —RP88 ^(talk) 15:06, 21 October 2016 (UTC)[reply]

I think they're hoping for averaging 80-some pages per second, by querying multiple pages in one request. Anomie ⚔ 20:00, 21 October 2016 (UTC)[reply]

Yes, we are. Sorry for the confusion. We made this estimate based on the fact that we can request 5000 pages at a time with a bot account. The limitation being we need to use the continue parameters slowing us down. Navi-darwin (talk) 13:29, 24 October 2016 (UTC)[reply]

Looking at the source you've provided it doesn't look like the bot complies with the User-Agent policy. It also looks like you don't support maxlag. Bots that don't support maxlag should limit their requests to 10 per minute. You might consider using one of the existing API access libraries rather than creating your own. —RP88 ^(talk) 15:06, 21 October 2016 (UTC)[reply]

Sorry, I missed the User-Agent policy and I will add that in. I was looking at maxlag and I thought that it was only necessary for making edits on Wikipedia but I can add that in as well if it is necessary. The reason that I wanted to make my own bot was because these api access libraries generally only returns full pages rather than a list of all the links that are on the pages which is what I want. Navi-darwin (talk) 13:43, 24 October 2016 (UTC)[reply]

Withdrawn by operator. — Preceding unsigned comment added by Navi-darwin (talk • contribs)

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.