Wikipedia talk:Version 1.0 Editorial Team/Article selection/Archive2

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Next phase of testing

I have moved the page from MartinBotII to SelectionBot, and revamped the main page. Part of the revamping has been to write an algorithm for importance (for option B) based on four separate parameters - correction factors have been removed for now - we really just need to get a rough test going first, I think. I've also added in a table of sample articles, so we can see how these fare with the different methods - please expand this list to cover other topics. Walkerma 04:19, 30 October 2007 (UTC)

I have a working demo for the multiplicative rating system [1]. In order to make a demo for the additive system, I will need to know the various parameters for the included projects: Math, Chemistry, Medicine, Physics. I think we should add some humanities subject to the demo, to have a broader sample. Also, there is a "vital articles" system that might allow us to give those articles a boost in their score - has that been considered?
Testing the additive rating system will have to be delayed until I get a toolserver account. I applied several weeks ago but haven't heard back yet. — Carl (CBM · talk) 13:01, 30 October 2007 (UTC)
I also registered User:SelectionBot to use for this project. — Carl (CBM · talk) 13:11, 30 October 2007 (UTC)
This looks like a good idea. Certainly, some of the major projects can do importance assessments, at least for the Top and High level, fairly easily. Maybe we could get experts from the various articles in a subject (like, say religion or video games) to come up with a collaborative list of what they collectively think are the most important articles in their field? John Carter 23:03, 30 October 2007 (UTC)
What I'd like to do is to get a working bot, and get it to give us what we think is a decent selection. Then we should consult the WikiProjects in those areas and say, "How does this list look." In many cases, though, the WikiProject's own importance ranking will be a significant part of the judgement. The main reasons for not relying on that alone are (a) that some projects don't assess for importance and (b) for it to work we also have to rank the projects in terms of their importance, which isn't trivial. Walkerma 04:20, 4 November 2007 (UTC)
I am hoping that my toolserver account comes through soon (it should be this week). Once that is in place, I will start investigating how to do things like count backlinks and interwikis in a sufficiently efficient manner. — Carl (CBM · talk) 13:52, 4 November 2007 (UTC)
Great! You just made my week! Please keep us posted. Thanks, Walkerma 02:39, 5 November 2007 (UTC)

More progress in data collection

I have made more progress in gathering data; as can be seen at User:VeblenBot/version1.0/Demotable, I now have means to collect data on interwiki links, hit counts, and backlink counts for articles, as well as project ratings. The real question is what to do with the data. I don't expect to have a chance to look at it again for a week (at least), but I think some planning isneeded before more useful code can be written. — Carl (CBM · talk) 19:12, 20 November 2007 (UTC)

Is there an explanation somewhere about the meanings of the columns? What are the hit count and the additive score? --Itub (talk) 11:02, 21 November 2007 (UTC)
Take a look over the article space associated with this page, where the additive score is described; the hit count is how many views the page has in a given time. Sorry it's a bit heavy going! Walkerma (talk) 16:04, 21 November 2007 (UTC)

Preliminary output from SelectionBot

The SelectionBot importance tests have been carried out (thanks, CBM!) and the raw data are listed below. (see the article page for full explanation).

I'm pretty sure that the hitranking data are VERY rough, and indeed some articles are listed as having zero hits. This is an anomaly of the measurement system. We hope to start using better data soon from here. The lists below indicate that iron(III) chloride received 168,000 hits in a year compared to aluminium chloride getting only 6,000, yet this and this suggest that these articles get a pretty similar no. of hits. Of course, there may have been a major news story distorting the nos., and certainly Raney nickel got a big boost from being on the main page. I believe that overall these #hits are reading a little low in many case, but we'll have to wait and see the new data to confirm that.

Parameters used

Quality
  • WikiProject's quality assessment (FA=500, A/GA=400, B=300, Start=150)
Importance

Based on four parameters, combined to give a single importance score I.

  • WikiProject's importance assessment (P1 uses Top=400, High=300, Mid=200, Low=100)
  • No. of hits, mainly using the original hit data from our dump (but a few are from Henrik's site). I believe it excludes redirects, but hopefully we can improve on that in time.
  • No. of mainspace links-in to the page.
  • No. of interwikis (IW) (other language versions of the articles).

Formulae tested

We have been trying to weight the various factors by just the right amount. Note that a formula that may work well in one subject area may not work well in another. The goal was to make 1000 points the threshold; this can (and will) be modified, however, depending on the size of the release. What matters is that the articles should appear in decreasing order of importance-quality, with importance carrying more weight than quality.

The formulae we are actively looking at right now include:

Score3= quality + + hits/1640 + linksin(max of 400) + 2*(interwikis)

Score5= quality + WikiProjectImportanceRating + 50*log_{10}(hits) + 200*log_{10}(linksin) + 200*log_{10}(interwikis)

Score7= quality + WikiProjectImportanceRating + 50*log_{10}(hits) + 100*log_{10}(linksin) + 250*log_{10}(interwikis)

Example spreadsheets of SelectionBot data

These are given in OpenOffice Calc format. I'm having trouble uploading the Excel versions, but I'll keep trying.

Selection of around 40 projects - unmodified.

Numeric fields are pagelinks, interwikis, and hitcount estimate, in that order.

American Animation
Australia
Chemicals (original hit data)
Chemicals (Henrik's hit data)

Using Henrik's hit data for Jan 2008

French Military History
Mathematics

Note: Any reference to Score6 in these spreadsheets is in fact equivalent to Score5. Score9 uses the formula:

Psychology
Sociology

Comments

  • I like weighting #3, but this is still a work in progress. Walkerma (talk) 17:47, 23 January 2008 (UTC)
    • It seems mostly correct, but there are many articles that I have never heard of before. You need to exclude low-importance articles. Eyu100(t|fr|Version 1.0 Editorial Team) 17:09, 28 January 2008 (UTC)
  • Can you be more specific? Which low-importance articles are getting a high score, and why? How can we adjust the algorithm to avoid these? Note that the collection of articles used for testing includes a few mid-importance articles, mostly FAs, as long as they had a multiplicative score of 27 or greater. —Preceding unsigned comment added by Walkerma (talk Walkerma (talk) 06:29, 2 February 2008 (UTC)
I played with the chemicals spreadsheet a bit and found a scoring function that I like: assessment + project rating + 50*log(hits) + 200*log(links) + 200*log(interwiki), where log is the base-10 logarithm. With this function, the range for the articles in the spreadsheet goes from about 1200 to 2100; I would cut off somewhere around 1600, which selects 31 out of the 45 articles. My reasoning for this function is that: 1) IMO, the most important variables are the number of links and interwikis (just ranking by either of them alone already produces a pretty reasonable list. I might even suggest a weight greater than 200). 2) the range of values for these two variables, and for the number of hits, spans several orders of magnitude (from a few to many thousands), so the values need to be "compressed" to avoid large numbers dominating the scoring function too much. The approach I saw in the spreadsheets above was to truncate the number of links at 400; I believe that a fairer approach is to use a logarithm, which has the effect of "compressing" the large values without truncating them. This is especially important for the number of hits, which is often very large and which I don't trust that much for deciding which articles are most essential. --Itub (talk) 13:17, 31 January 2008 (UTC)
I like this a lot, and I've done a couple of tables with it for Chemicals (Score5 and Score6), shown above. Thanks a lot! Walkerma (talk) 06:32, 2 February 2008 (UTC)
I like the and the options, but I'm not sure about whether they should have the same importance. Interwiki links are usually better indicators of global interest; while internal links can be skewed by en.wikipedia overlinking. So, I'd put something more like this:
Maybe that's just me, but I'd like to see the results. Titoxd(?!? - cool stuff) 03:21, 5 February 2008 (UTC)
It's not just you - I think I largely agree, but I wanted to see others' thoughts before I said anything. There can be a danger with interwikis - a US baseball player or Indian cricketer may be important in parts of the English-speaking world, but largely unknown in non-Anglophone countries. But, a high no. for interwikis is 100, but the equivalent for no. of linksin would be perhaps 1500, so these two do need to be better balanced. I'll try and do some new tables tomorrow with your formula. I also heard from CBM, he says that this type of log formula should work fine with his bot. Walkerma (talk) 03:54, 5 February 2008 (UTC)
That is a good point and certainly the weights can be refined further. I only tried a few combinations before presenting my formula here. Just one thing I noticed: there is one case of an extremely important chemistry article that has very few interwiki links due to the way it was split here: Water (molecule) has only 11 wikilinks, while the main article, Water, has a gazillion. The article is still selected by most formulas, but ranks much lower than one would expect. --Itub (talk) 10:38, 5 February 2008 (UTC)
Yes, very true, H2O is a tricky one, and there are thousands of other similar ones (such as Shakespeare's plays, with ZERO interwikis - does this mean those plays are unimportant?!). This shows, though, the advantages of a system based on four parameters - Water (molecule) still ranks fairly high (though still below methanol and toluene) because of the "Top" rank by WP:Chem, as well as lots of hits and linksin. Throw in the fact that an important topic is likely to be B-Class or better, and we should be able to catch most of these, and I think your log formula really helps with that. Walkerma (talk) 05:08, 6 February 2008 (UTC)
Interestingly enough, the cutoffs need to be adjusted for both log formulae. Titoxd(?!? - cool stuff) 05:41, 6 February 2008 (UTC)
Do you mean the cutoff for inclusion? (If not, then please clarify.) Yes, but that's no problem, it will vary from release to release anyway. But the ordering of the articles looks pretty good, and the weighting of the factors are good too? Or do you see an even better formula? Walkerma (talk) 07:49, 6 February 2008 (UTC)
I looked through the mathematics ratings a little while this evening. Both of the log based scores seem better than the purely additive one, but I had a hard time deciding between those two. The overall ordering seemed good, with no incidents where a very important article was rated much lower than less important articles. — Carl (CBM · talk) 01:39, 7 February 2008 (UTC)
I think that the logarithmic (= multipliative) ones work fine, but that the hits are given far too much weight. I'd prefer something like
CRGreathouse (t | c) 23:19, 8 February 2008 (UTC)
I definitely like the logarithmic formula, especially with varying weights for the factors. The five factor system is a really good basis for selection. MahangaTalk 01:31, 9 February 2008 (UTC)
  • I've looked at the online (zoho) version of ratings for the mathematics articles, but was left a bit puzzled. How was the order determined? It doesn't seem to correspond to the order of any particular column. Certain columns are easy to guess, but others are unclear (include a brief description in the table). For the purposes of testing, it may be a good idea to break up the results according to the subject areas (on the Math ratings template), and then order them from the top down according to each of your algorithms. I'd be happy to give an actual assessment once the results are presented in a friendlier format. Arcfrk (talk) 02:05, 9 February 2008 (UTC)
I've completely revamped this page, and explained what all the formulae mean. Also, I think one of the Zoho pages had a mistake in it, which might explain your problem. Thanks, Walkerma (talk) 07:09, 16 February 2008 (UTC)
  • My background is in psychology, and it seems like score 3 does the best job with the list (although some of the major fields are still stuck in the middle of the list like Cognitive psychology). Also, on a random tangent, are categories factored in at all? If an article has a lot of categories might it mean that it is broad enough and thus important enough to get a score bump? I have no idea how much water that might hold, it was just a thought. JoeSmack Talk 01:03, 29 February 2008 (UTC)
Thanks! Most of the psychology articles on this list would make it onto the DVD, only the last 2-4 articles would not qualify, so coginitive psychology is still OK. The categories idea is interesting, I'll have to think about that. I don't think that more categories necessarily equals more important, but there may still be something useful we can use there - as you say, it may indicate a broader topic or an interdisciplinary one. Cheers, Walkerma (talk) 03:11, 29 February 2008 (UTC)
Regarding how to weight the different log factors, I notice that for most Math articles logHits is around 3 to 6 times the logIW number, and around 2 to 4 times the logLinksin number. If we want a balance between all three, then probably Score7 is the most balanced. One needs to bear in mind the limitations of allowing one parameter to dominate:
  • A high interwiki does indicate an important article, but a low interwiki does not necessarily mean low importance - consider Shakespeare's plays which has ZERO interwikis, but should be included in V0.7 IMHO.
  • A high Linksin rank may simply mean that it is a minor article that makes it onto one or two popular or populous templates, as with this, my old constituency.
  • A high hit rank indicates popularity, and we certainly want to include popular articles in our releases. However, there is a bias towards pornography and popular culture in such a selection, and we need to ensure that our selection is not swamped by such articles or we will look silly - though we shouldn't be intellectual snobs, either. Example: List of gay porn stars received 113768 hits last month, whereas Napoleon Bonaparte only received 26941. Hits can also be severely distorted temporarily by appearing on the WP front page - look at this example!
That's why we need all three parameters, and why I don't want to depend too much on any single one of them. Walkerma (talk) 03:53, 29 February 2008 (UTC)
First, I must say that the log-weighted scores are all fairly similar, so any preference between them must be small. But I think that the extreme weighing I suggested (score9) is my favorite, giving (compred to score7) higher rankings to number, function (mathematics), and normal distribution. Unfortunately it also demotes prime number and game theory, which seems regrettable, but on the whole I prefer it. CRGreathouse (t | c) 04:58, 29 February 2008 (UTC)

Updated algorithm

I updated the main page to describe the algorithm currently used by the selection script.

  • I learned from Walkerma today that the bot was not giving the desired points to articles with unassessed importance. I changed the code to give them 200 points, the same as articles from projects that don't use the importance system. It's possible that the "assessed importance" points need to be computed some other way.
  • Walkerma and I are working on the WikiProject importance ratings.
  • I'm planning to run the selection script again once I download updated data. — Carl (CBM · talk) 19:16, 17 July 2008 (UTC)