Jump to content

User:Tazerdadog/sandbox 2

From Wikipedia, the free encyclopedia

Hello everyone. I have developed a way to procedurally generate short descriptions for the 387,816 articles in Category:Articles with 'species' microformats. My method, while not perfect, generally produces a better short description than the Wikidata description. It operates by copying a snippet from the first sentence of the lede that is suitable as the short description. This method relies on the relatively systematic way articles in this category are written - use caution before expanding this beyond this category.

Pseudocode[edit]

The purpose of this is to explain exactly how a program would generate these summaries:

Download the wikitext for an article in Category:Articles with 'species' microformats

Run the regex string (?<=(. a | an ))(.*?)(?=(\.|,| which | known | found | describe|<ref|\(| native | grow| that | within | from | cause)) on the wikitext

Take the first match generated by this regex, ignore/discard the rest.

This produces the basic short description now we need to clean it up.

Start loop

Run the regex \[\[[^\]]*\| to identify the left side of piped links.

If any matches were found, remove the matched text and repeat the loop

End loop

Run the above loop three more times, replacing the regex lines with the lines below to strip out links, bold, and Italics

Run the regex \[

Run the regex \]

Run the regex [']{2,}

If the string is "Gram-negative" replace it with "Gram-negative bacteria"

Check whether there's a space in the string

if there is not

Add the article to a list for carbon-based intelligence to deal with, then skip the article

Check the length of the string

if length in characters > 70

Add the article to a list for carbon-based intelligence to deal with, then skip the article

else, add {{shortdescription|(the remaining regex match)}} to the article. Include attribution in the edit summary.

How it works[edit]

The regex looks for a string that is immediately preceded by "Any character, space, lowercase a, space" or "Space, lowercase a, lowercase n, space" The any character is there because the lookbehinds must be of the same length (four characters in this case). It then matches any number of characters (the short description) until the string immediately in front of it is one of several stop codes.

All links and bolding/italics are then stripped out of the short description. If the short description is longer than 70 characters, it is left for a human. If the short description contains no spaces, it's left for a human. Otherwise it is posted at the top of the article in the shortdescription template.

Results[edit]

I used Random page in category Articles with 'species' microformats to generate a sample of articles. The article, my procedurally generated short description, and the wikidata description are included.

Article Procedurally generated description Wikidata description Notes
Profundiconus pacificus species of sea snail species of mollusc A good example of the improvements achievable over a wikidata import.
Catocala caesia moth of the Erebidae family species of insect
Pterostylis daintreana species of orchid endemic to eastern Australia species of plant endemic should probably be added to the stop codes
Sewa taiwana moth of the Drepanidae family species of insect
Lactobacillus pontis (skipped, added to human list) species of prokaryote Algorithm produced "rod-shaped", which gets kicked for a lack of spaces. Bacteria articles are hard on my algorithm. Is there a subcategory I can skip?
Ross seal true seal species of mammal
Turner's thick-toed gecko species of gecko species of reptile
Coleophora sylvaticella moth of the Coleophoridae family species of insect
Solirubrobacter pauli mesophilic Gram-positive and aerobic bacterium (none) 46 characters, algorithm got lucky here.
Leucotabanus ambiguus species of horse flies in the subfamily Tabaninae species of insect
Chersodromus genus of snakes of the family Colubridae genus of reptiles
Artedius harringtoni (skipped, added to a list for humans to parse) species of fish Algorithm reterned "demersal" which is rejected for lack of spaces.
Mitrella blanda species of sea snail species of mollusc
Givira aregentipuncta moth in the Cossidae family species of insect
Medicorophium genus of amphipod crustaceans genus of crustaceans
Scrophularia ningpoensis perennial plant of the family Scrophulariaceae species of plant
Anadasmus sororia moth of the Depressariidae family species of insect
Hakea flabellifolia shrub of the genus Hakea species of plant
Shrew small mole-like mammal classified in the order Eulipotyphla family of mammals 58 characters
Gascoyne's Scarlet English cultivar of domesticated apple apple
Barred thicklip species of fish belonging to the wrasse Family species of fish Need to switch order in which I strip links and check regex.
Moluccan scops owl owl found in Indonesia species of owl

Decisions[edit]

The thinks we need to decide:

  1. Are the short descriptions generated in this way good enough for semiautomatic posting? For automatic posting? Semi-auto posting at one a second is still a 100 hour job.
  2. Do we bias towards shorter summaries by adding " in ", and " belonging " to the ending criteria?
  3. What is an acceptable "Fail Rate" wherein the bot posts something in the short description that is inappropriate for the short description?

Moving Forward[edit]

To move this forward, we need to do a couple of things:

  1. Develop a strong consensus that adding these summaries is a good thing, and that the occasional mistakes are worth it.
  2. Refine this process more. There's still some low-hanging fruit for improvement
  3. Find a bot operator willing to implement this and make the runs. I could probably do it, but it will be difficult for me, as this would be my first bot.

I'm inviting comments on this now - if it looks good, we can get a consensus for it and I'll start refining it.

Cheers, Tazerdadog (talk) 05:38, 3 June 2018 (UTC)