User:Mill 1/Wikipedia queries for death dates
Context
[edit]During the first phase of Chaining back the Years I relied on a Excel tool that used macro's to semi-automate the huge list of tedious tasks. One of them was to resolve the list of bio's whose subject had died on the date I was processing. In order to realise that, I searched Wikipedia for the death date in particular formats by using regular expressions. This process was automated by firing them in a series of http requests at the website. The tool would the parse the response into results I could use for further processing. See an excerpt of the actual code at the end of this article.
Automated queries
[edit]The Excel application would look for the date of death only in the infobox of the person's bio. Reasoning being that the presence of an infobox would act as a first filter: an article without even an infobox should not be added to a Deaths in Year-list. A decision I came to regret later (indicator).[1]
Per date 12 different regular expressions are applied. For instance: Who died on 4 August 2009? Regexp:
- August insource:/[Dd]eath date and age\|2009\|0?8\|0?4\|/
- August insource:/[Dd]eath date and age\|df=yes\|2009\|0?8\|0?4\|/
- August insource:/[Dd]eath date and age\|df=y\|2009\|0?8\|0?4\|/
- August insource:/[Dd]eath date and age\|mf=yes\|2009\|0?8\|0?4\|/
- August insource:/[Dd]eath date and age\|mf=y\|2009\|0?8\|0?4\|/
- August insource:/[Dd]eath date\|2009\|0?8\|0?4\|/
- August insource:/death_date[ ]*=[ ]*August 4, 2009/
- August insource:/death_date[ ]*=[ ]*4 August 2009/
- August insource:/[Dd]eath-date and age\|August 4, 2009\|/
- August insource:/[Dd]eath-date and age\|4 August 2009\|/ f.i.: {{death-date and age|4 August 2009|6 May 1957}} Mbah Surip
- August insource:/d-da\|August 4, 2009\|/ Robert Mitsuhiro Takasugi
- August insource:/d-da\|4 August 2009\|/ Gunnar Håkansson
Manual queries
[edit]If automatic searches yielded unsatisfactory results I would resort to performing manual searches using next templates:
Inside the infobox
[edit]"August" insource:/\|2009\|0?8\|0?4/
Then search in the (max 500) results for death_date
. Works great in Google Chrome!
Outside the infobox
[edit]- August insource:/August 11, 1997\)/ August insource:/-[ ]*August 11, 1997\)/ August insource:/ndash;[ ]*August 11, 1997\)/
- August insource:/August 11, 1997\)/ August insource:/-[ ]*August 11, 1997\)/ August insource:/ndash;[ ]*August 11, 1997\)/
If all this yields no results one can also use these queries that looks for additional hits:
- August insource:/August 4, 2009\)/
- August insource:/-[ ]*August 4, 2009\)/
- August insource:/ndash;[ ]*August 4, 2009\)/
- "Died August 4, 2009)"
- "Died 4 August 2009)"
- Google the site: "died on August 4, 2009"
- Google the site: "died August 4, 2009"
- Google the site: "died on 4 August 2009"
- Google the site: "died 4 August 2009"
Code excerpt
[edit]This is a small example of the VBA code in the Excel-tool. The subroutine sends the request containing a specific regular expression and passes the response for further processing.
Private Sub ProcessRegularExpression(sRegExpBase As String, sRegExpText As String, sRegExpDoD As String, ByRef iRow As Integer, eDateType As DateType) Dim sResponse As String Dim sXML As String Dim lPos As Long sXML = sRegExpBase & F_SLASH & sRegExpText & sRegExpDoD & F_SLASH Debug.Print URL_BASE & sXML Me.Range("Result") = "Checking expression " & GetSearchTerm(sXML) & "..." DoEvents 'Send http request and receive response sResponse = SendRequest(URL_BASE & sXML, sXML, True) Do lPos = InStr(lPos + 1, sResponse, "data-serp-pos") If lPos > 0 Then Call ProcessDeceased(sResponse, lPos, iRow, sRegExpText & sRegExpDoD, eDateType) End If Loop While lPos > 0 End Sub