Wikipedia talk:Category intersection

From Wikipedia, the free encyclopedia
Jump to: navigation, search

About this proposal[edit]

This proposal was started by User:Rick Block and User:SamuelWantman. The initial discussions leading to the proposal are in Archive 1. There is also discussion about the different options here.

Please leave comments about the proposal on this page. Thank you for your input.

A working category intersection today[edit]

I wanted to share a new development from the Obi-wan labs. Take a look at Category:Singaporean poets. What I've done is, created an easy way for users to do category intersections. The steps were as follows:

  1. I've taken all of the poets in Category:Singaporean poets
  2. I added them to Category:Singaporean men or Category:Singaporean women - top level, generic cat
  3. Then I created a pre-populated link to do category intersections on "Singaporean poet" + "Singaporean woman" or "Singaporean poet" + "Singaporean people of Chinese descent" + "LGBT people from Singapore" - using the WP:CATSCAN tool developed by Magnus Manske.

Using this technique, we could make a big chunk of this problem go away, at least in the short term, while we're waiting for wikidata to get fully set up.

For any given category, we could just move people up to top-level national men/women/gay/straight/black/white categories (I'm thinking by-country, since that would reduce the search space somewhat), and then stick them all in non-gendered, non-ethnic, non-religious jobs, cities, what have you. And at the top of each category, editors could create pre-populated links to their favorite intersections, essentially replicating the categories that used to exist below. The cats on each bio would become vastly simplified.

Researchers would be able to intersect to their heart's content - something which is actually hard right now with hard-coded ethnic/gendered categories.

In the meantime, we would be bit by bit getting rid of *all* of the ethnic/gendered/sexuality categories, except at the very top level, because they aren't really necessary in the wikidata world - in a way we'd be priming the pump for wikidata by simplifying our bio categorization structure entirely. For now this is v0.1, but please take a look and let me know your thoughts as a possible approach to fix this mess. If all of those commenting just focused on being gnomes on this issue, we could change the whole world in a few weeks I bet, and show the outside world that we're changing, we're doing something about it, immediately.

In cases of more complex categories, we could also add links to recursively enumerate all subcategories, or even to enumerate all subcategories with a particular gender/ethnicity/etc. For example, show me all African-American men atheletes, no matter where they are categorized in the Category:American track and field athletes tree - this would be trivial if you have the right categories set up to start with.

Finally, many thanks to Mangus who wrote the catscan tool. If we did this, we'd want to reach out to him, to make the interface/display a little nicer for the newbies. But as a hack, I think it's not a bad start.

----> Category:Singaporean poets <----

cheers, --Obi-Wan Kenobi (talk) 00:38, 2 May 2013 (UTC)


A place to store links to other relevant conversations:

  • [1] Posting which suggests tagging/etc with wikidata in the works, but no concrete plans on the horizon.
  • [2] nice posting on the challenges of categorization

Known issues[edit]

  • The search is slow. We need to find out what the performance might be like for much larger categories. Any search time > 10 seconds is probably way too long.
  • The UI doesn't look like wikipedia. I am chatting with the developer to see if he can address this.

Bandaid category intersection discussion[edit]

  • I think this is a great idea! (as nom). :p --Obi-Wan Kenobi (talk) 00:41, 2 May 2013 (UTC)
  • (I get about half of this--the intersecting and all that, I get it, but the technical details are outside of my domain): go for it, please. Drmies (talk) 03:05, 3 May 2013 (UTC)
  • a few more thoughts - some tasks that need to be done to start this:
  1. Set up a template - one of the first tasks to make this workable, would be to create a template, {{cat-intersect}}, such that authors could easily add the template to the top of a category, specify the cats they want intersected (and whether they want to recurse through subcats, or not, and if so, how deep) - that way the user wouldn't need to mess with catscan URLs. So we need someone who knows how to build templates. The template should allow multiple different intersections to be added, and perhaps have collapsible sections for less frequently used intersections.
  2. Refine the catscan UI - so that results come back looking a bit friendlier, and a bit more like a wikipedia page. I plan to get in touch with the developer on this point.
  3. Choose a tree - we should choose a category tree of biographies (perhaps the American novelists tree since it's gotten so much attention? :) ), but I rather think a much smaller tree like Category:American poets would be a better start, and then implement the final solution fully, as a full-blown prototype, and see what feedback we get.
  4. Decide on the top-level categorizations - Should we categorize by nationality + gender / nationality + sexuality, nationality + job, etc, e.g. Category:American men, Category:American catholics, Category:African-Americans, Category:European Americans, Category:American politicians, Category:American journalists, etc? It would reduce the size of the search space and perhaps increase performance. I believe on the German wiki, they put all men into the "Man" category - but with such a big database enwiki that might have major performance implications. The other advantage is, we could keep many bios right where they are - they'd just need to be added to some of the top level ethnic/gender/religion/sexuality cats (e.g. Category:American men, Category:American women, Category:American intersex people, etc.) We won't even have to do LGBT anymore - that can be split out if needed.
  5. Just do it Then we need to go and start re-categorizing articles, and decide on the first set of intersections - will there be a somewhat 'standard' set of intersections proposed (this could even baked into the template)? For example, always intersecting cat + men, cat + women? One challenge will be, since we have these sort of pre-formatted intersections, is the old rules of categorization are now gone - anyone will be able to propose any intersection. So epic debates may still continue...but even if someone loses, they can still do it, they just have to do it manually.
  6. Write a bot Once we're comfortable with the approach, someone could write a bot which would automatically de-populate gendered/ethnic cats and stick the bios in the appropriate high-level categories.
  7. Delete the old Once we've cleaned out the lower-level gendered/ethnic cats, we could delete them, and dance on their grave, celebrating the new dawn of category intersections. Won't it be fun when we delete Category:American women novelists, and yet still have the ability to pull up the list with a click, and also see all of the American women novelists sitting next to their men colleague in the same exact non-gendered category? We should invite Amanda to that party.
  8. Get feedback see what the broader community thinks. Tweak it, and then roll it out more.
  9. Change the guidance Once we have a workable pilot, several guidance pages will need to be changed to reflect the new approach.

The best part is, I believe we can do all of this ourselves - and we can salvage much of the existing category tree (we would just delete intersects like Category:African-American women poets and Category:LGBT writers from the United States, and it probably doesn't require much special technical skill or back-end database hacking. Wikidata is coming, but this hack can be here tomorrow.--Obi-Wan Kenobi (talk) 05:54, 2 May 2013 (UTC)

Comment. I like the idea in general, but: the toolserver (which runs catscan) has been quite shaky over the last months, and we should probably not expect it to become reliably stable again until everything is ported over to Wikimedia Labs. Now the problem with this proposal is that anything except for the top level categories will only be accessible if and when the toolserver is available. I'm not saying that kills the idea, but we should be aware of it when using the toolserver to replace rather than complement a functionality we now use directly on mediawiki. — HHHIPPO 07:09, 2 May 2013 (UTC)
Thanks, it's a good point - and I fully agree this would only be a complement. That's why I'm proposing to use it first and foremost just for the specific ethnic/gender/etc categories - the rest of the structure would remain, and I wasn't suggesting we'd replace *all* of the cats - it would really be an enhancement to the cats, especially gender/ethnicity/sexuality biographic cats (we'd delete most of those) - you've probably seen the recent drama around Category:American women novelists. So if the toolserver was down, people for example wouldn't be able to easily find African-American women poets or American women novelists, but they could find all American poets easily for example. I'm hoping this tradeoff is worth the benefits.--Obi-Wan Kenobi (talk) 07:17, 2 May 2013 (UTC)
Ah, OK. Yes, I've seen the drama, but I try to stay out of it :-). I didn't know the expected outcome is to delete those intersection categories. In that case this sounds indeed like a good test case for your suggestion. Of course on the long run it would be preferable to have that function integrated in MediaWiki itself, so one has more familiar formatting, internal links, related changes and all that, but until then this seems like a nice addition. — HHHIPPO 18:27, 2 May 2013 (UTC)
A technical note: Catscan should be reimplemented on labs. This can't happen yet, but it will be possible later in the year I think. Labs is much more stable than toolserver ever was, although it is more complex for the people developing the tools. I think this would work even better with a MediaWiki extension ... — This, that and the other (talk) 11:32, 3 May 2013 (UTC)
Ok - can you give more details? what does this mean? And more importantly, do we have to worry about it? If we create a template with a URL for catscan, could we just update the template once a newer/better version is in Labs? --Obi-Wan Kenobi (talk) 21:18, 3 May 2013 (UTC)
You can look at mw:Wikimedia Labs, but that page is quite technical in nature. The Labs project badly needs some simple, accessible, clear documentation, but I am not going to be the one to write it!
You shouldn't have to worry about Labs, but if you are making something that depends on the flaky Toolserver, my advice would be to wait until Labs is a viable choice. If the only way of viewing certain "categories" as we now know them is via a semi-broken server, people will not be terribly happy about it. — This, that and the other (talk) 02:07, 4 May 2013 (UTC)
Ok, thanks. I do think it's worth trying, at least at small scale within part of the tree, so we can prototype and get a feel for whether people like it in general or not. What would it take to move the cat scan tool over? Should we ask the developer if he's planning on moving it over to labs?--Obi-Wan Kenobi (talk) 05:18, 4 May 2013 (UTC)
That would be a good start. — This, that and the other (talk) 03:47, 5 May 2013 (UTC)
note I have refined the prototype, am now using Category:Singaporean poets as a better/more complex example. --Obi-Wan Kenobi (talk) 20:47, 5 May 2013 (UTC)

As the co-author of this page, I spent many months of my life on this topic, and it is nice to see that interest is still there in making it happen. Something like this proposal was tried in the past, and if I recollect correctly, more than once. We tried to populate the larger categories to illustrate how intersections would work, but also because there is a value in having large categories (they function as an index). Each time it was tried, it was quickly reverted, even if the categories were labeled as being temporary or demonstrations. Each time it got harder to implement, as the categorizations of the articles in the large categories were removed almost as fast as they were added by well meaning editors recognizing that the categorizations were incompatible with current best practice. I talked at one point with Brion Vibber to create invisible categories so we could try to do this experiment under the radar. Invisible categories were created quickly, but then all attempts to use them this way were quashed by the masses.

If you want this to happen, I'd suggest that the developers add a parallel categorization scheme in a new namespace, which I'd call tags. They would be created and populated exactly the same way that categories are, but their function would be different. The developers wouldn't have to create the means to create tag intersections from the outset, that could come later. The tags should be extremely broad -- Men, Women, British, French, Writers, Actors, etc... The guidelines for creating these tags should be well thought out in advance, along with the criteria for applying them, with the understanding that intersection would be implemented at some point in the near future. Then the developers can announce that this new namespace has been created, along with the guidelines for how they should work.

If it is not possible to create dynamic tag intersections on the fly, then the tag intersections could either be created manually or with a bot. Perhaps there is some way to cache the intersections so that they are only updated once a day, or once a week. This might greatly reduce the server load. The update timing could be dynamically adjusted as needed.

I don't think it is possible to get this to work using the current categorization system. It will be too frustrating, and ultimately fail. It might work the way I described. I'd be happy to help in the effort. -- SamuelWantman 01:45, 8 May 2013 (UTC)

Thanks Sam - I appreciate your insights and would love your help. I think trying to get new significant dev work started on this might be a non-starter, as the future is apparently wikidata, which has some thoughts of doing something like this (but much more sophisticated) in the future - but I have no idea how far off that is - but in any case I'm not sure if WMF would be willing to throw dev resources behind it. I've asked at several places, but have been told to go to the mailing list, which I may do next. For me, one question is, what can we do that doesn't require major changes to media wiki software, today? Do you think, if we piloted this in a significant tree - say British novelists for example - and got buy-in from the users there, could we do a community-wide RFC to get approval to do this at scale wherever editors felt it would be useful (esp on bios, I mean). We could update the WP:Categorization guidelines to reflect this consensus. Then, you could point to that RFC + guidance as evidence of community consensus to de-genderize/de-ethnicize various categories. I've also just been reading a bit about how they deal with this in italian wikipedia - it seems they have a bio template, and this template generates categories automatically for the people. I've also been speaking with the developer of the catscan tool, and he has some ideas on UI improvements that might make it faster and feel more integrated vs going out to a separate page that looks different. In any case, I see your point about reverts - I wonder if the hidden category trick would work again? If you could also point me to past debates/discussions on this issue, that would be useful. Cheers,Obi-Wan Kenobi (talk) 02:59, 8 May 2013 (UTC)

IEG proposal on the category system in the English Wikipedia[edit]

I have submitted a proposal for an Individual Engagement Grant for the first phase of a project looking at the category systems in Wikimedia wikis. In this first phase I will research the nature of the English Wikipedia's category system, as the first step in designing ways to optimize category systems throughout WMF wikis. In later phases, I plan to

  • Research how readers and editors utilize the category system in the English Wikipedia.
  • Investigate the category systems in other language Wikipedias and in other WMF projects.
  • Explore the value and feasibility of using Wikidata as the basis for the category system across WMF wikis. If deemed appropriate by the community, work with the community to develop and implement this.
  • Utilize user-centered design methodologies to prototype various enhancements to the category system, including category intersection, to improve the user experience. If deemed appropriate by the community, work with the community to develop and implement such enhancements.

If you would like to endorse this proposal, you can do so here. I would also appreciate any other feedback, pro or con, which can be posted here. Thanks! Libcub (talk) 06:21, 7 April 2014 (UTC)