User:Tazerdadog/sandbox 2
Hello everyone. I have developed a way to procedurally generate short descriptions for the 387,816 articles in Category:Articles with 'species' microformats. My method, while not perfect, generally produces a better short description than the Wikidata description. It operates by copying a snippet from the first sentence of the lede that is suitable as the short description. This method relies on the relatively systematic way articles in this category are written - use caution before expanding this beyond this category.
Pseudocode[edit]
The purpose of this is to explain exactly how a program would generate these summaries:
Download the wikitext for an article in Category:Articles with 'species' microformats
Run the regex string (?<=(. a | an ))(.*?)(?=(\.|,| which | known | found | describe|<ref|\(| native | grow| that | within | from | cause)) on the wikitext
Take the first match generated by this regex, ignore/discard the rest.
This produces the basic short description now we need to clean it up.
Start loop
Run the regex \[\[[^\]]*\| to identify the left side of piped links.
If any matches were found, remove the matched text and repeat the loop
End loop
Run the above loop three more times, replacing the regex lines with the lines below to strip out links, bold, and Italics
Run the regex \[
Run the regex \]
Run the regex [']{2,}
If the string is "Gram-negative" replace it with "Gram-negative bacteria"
Check whether there's a space in the string
if there is not
Add the article to a list for carbon-based intelligence to deal with, then skip the article
Check the length of the string
if length in characters > 70
Add the article to a list for carbon-based intelligence to deal with, then skip the article
else, add {{shortdescription|(the remaining regex match)}} to the article. Include attribution in the edit summary.
How it works[edit]
The regex looks for a string that is immediately preceded by "Any character, space, lowercase a, space" or "Space, lowercase a, lowercase n, space" The any character is there because the lookbehinds must be of the same length (four characters in this case). It then matches any number of characters (the short description) until the string immediately in front of it is one of several stop codes.
All links and bolding/italics are then stripped out of the short description. If the short description is longer than 70 characters, it is left for a human. If the short description contains no spaces, it's left for a human. Otherwise it is posted at the top of the article in the shortdescription template.
Results[edit]
I used Random page in category Articles with 'species' microformats to generate a sample of articles. The article, my procedurally generated short description, and the wikidata description are included.
Article | Procedurally generated description | Wikidata description | Notes |
---|---|---|---|
Profundiconus pacificus | species of sea snail | species of mollusc | A good example of the improvements achievable over a wikidata import. |
Catocala caesia | moth of the Erebidae family | species of insect | |
Pterostylis daintreana | species of orchid endemic to eastern Australia | species of plant | endemic should probably be added to the stop codes |
Sewa taiwana | moth of the Drepanidae family | species of insect | |
Lactobacillus pontis | (skipped, added to human list) | species of prokaryote | Algorithm produced "rod-shaped", which gets kicked for a lack of spaces. Bacteria articles are hard on my algorithm. Is there a subcategory I can skip? |
Ross seal | true seal | species of mammal | |
Turner's thick-toed gecko | species of gecko | species of reptile | |
Coleophora sylvaticella | moth of the Coleophoridae family | species of insect | |
Solirubrobacter pauli | mesophilic Gram-positive and aerobic bacterium | (none) | 46 characters, algorithm got lucky here. |
Leucotabanus ambiguus | species of horse flies in the subfamily Tabaninae | species of insect | |
Chersodromus | genus of snakes of the family Colubridae | genus of reptiles | |
Artedius harringtoni | (skipped, added to a list for humans to parse) | species of fish | Algorithm reterned "demersal" which is rejected for lack of spaces. |
Mitrella blanda | species of sea snail | species of mollusc | |
Givira aregentipuncta | moth in the Cossidae family | species of insect | |
Medicorophium | genus of amphipod crustaceans | genus of crustaceans | |
Scrophularia ningpoensis | perennial plant of the family Scrophulariaceae | species of plant | |
Anadasmus sororia | moth of the Depressariidae family | species of insect | |
Hakea flabellifolia | shrub of the genus Hakea | species of plant | |
Shrew | small mole-like mammal classified in the order Eulipotyphla | family of mammals | 58 characters |
Gascoyne's Scarlet | English cultivar of domesticated apple | apple | |
Barred thicklip | species of fish belonging to the wrasse Family | species of fish | Need to switch order in which I strip links and check regex. |
Moluccan scops owl | owl found in Indonesia | species of owl |
Decisions[edit]
The thinks we need to decide:
- Are the short descriptions generated in this way good enough for semiautomatic posting? For automatic posting? Semi-auto posting at one a second is still a 100 hour job.
- Do we bias towards shorter summaries by adding " in ", and " belonging " to the ending criteria?
- What is an acceptable "Fail Rate" wherein the bot posts something in the short description that is inappropriate for the short description?
Moving Forward[edit]
To move this forward, we need to do a couple of things:
- Develop a strong consensus that adding these summaries is a good thing, and that the occasional mistakes are worth it.
- Refine this process more. There's still some low-hanging fruit for improvement
- Find a bot operator willing to implement this and make the runs. I could probably do it, but it will be difficult for me, as this would be my first bot.
I'm inviting comments on this now - if it looks good, we can get a consensus for it and I'll start refining it.
Cheers, Tazerdadog (talk) 05:38, 3 June 2018 (UTC)