Jump to content

User:Mill 1/Wikipedia queries for death dates

From Wikipedia, the free encyclopedia

Context

[edit]

During the first phase of Chaining back the Years I relied on a Excel tool that used macro's to semi-automate the huge list of tedious tasks. One of them was to resolve the list of bio's whose subject had died on the date I was processing. In order to realise that, I searched Wikipedia for the death date in particular formats by using regular expressions. This process was automated by firing them in a series of http requests at the website. The tool would the parse the response into results I could use for further processing. See an excerpt of the actual code at the end of this article.

Automated queries

[edit]
Screenshot of the Excel tool which generated most of the wikitext

The Excel application would look for the date of death only in the infobox of the person's bio. Reasoning being that the presence of an infobox would act as a first filter: an article without even an infobox should not be added to a Deaths in Year-list. A decision I came to regret later (indicator).[1]

Per date 12 different regular expressions are applied. For instance: Who died on 4 August 2009? Regexp:

  1. August insource:/[Dd]eath date and age\|2009\|0?8\|0?4\|/
  2. August insource:/[Dd]eath date and age\|df=yes\|2009\|0?8\|0?4\|/
  3. August insource:/[Dd]eath date and age\|df=y\|2009\|0?8\|0?4\|/
  4. August insource:/[Dd]eath date and age\|mf=yes\|2009\|0?8\|0?4\|/
  5. August insource:/[Dd]eath date and age\|mf=y\|2009\|0?8\|0?4\|/
  6. August insource:/[Dd]eath date\|2009\|0?8\|0?4\|/
  7. August insource:/death_date[ ]*=[ ]*August 4, 2009/
  8. August insource:/death_date[ ]*=[ ]*4 August 2009/
  9. August insource:/[Dd]eath-date and age\|August 4, 2009\|/
  10. August insource:/[Dd]eath-date and age\|4 August 2009\|/ f.i.: {{death-date and age|4 August 2009|6 May 1957}} Mbah Surip
  11. August insource:/d-da\|August 4, 2009\|/ Robert Mitsuhiro Takasugi
  12. August insource:/d-da\|4 August 2009\|/ Gunnar Håkansson

Manual queries

[edit]

If automatic searches yielded unsatisfactory results I would resort to performing manual searches using next templates:

Inside the infobox

[edit]

"August" insource:/\|2009\|0?8\|0?4/
Then search in the (max 500) results for death_date. Works great in Google Chrome!

Outside the infobox

[edit]
  1. August insource:/August 11, 1997\)/     August insource:/-[ ]*August 11, 1997\)/    August insource:/ndash;[ ]*August 11, 1997\)/
  2. August insource:/August 11, 1997\)/    August insource:/-[ ]*August 11, 1997\)/    August insource:/ndash;[ ]*August 11, 1997\)/

If all this yields no results one can also use these queries that looks for additional hits:

  1. August insource:/August 4, 2009\)/
  2. August insource:/-[ ]*August 4, 2009\)/
  3. August insource:/ndash;[ ]*August 4, 2009\)/
  4. "Died August 4, 2009)"
  5. "Died 4 August 2009)"
  6. Google the site: "died on August 4, 2009"
  7. Google the site: "died August 4, 2009"
  8. Google the site: "died on 4 August 2009"
  9. Google the site: "died 4 August 2009"

Code excerpt

[edit]

This is a small example of the VBA code in the Excel-tool. The subroutine sends the request containing a specific regular expression and passes the response for further processing.

Private Sub ProcessRegularExpression(sRegExpBase As String, sRegExpText As String, sRegExpDoD As String, ByRef iRow As Integer, eDateType As DateType)
    Dim sResponse As String
    Dim sXML As String
    Dim lPos As Long
    
    sXML = sRegExpBase & F_SLASH & sRegExpText & sRegExpDoD & F_SLASH
    
    Debug.Print URL_BASE & sXML
    
    Me.Range("Result") = "Checking expression " & GetSearchTerm(sXML) & "..."
    DoEvents
    
    'Send http request and receive response
    sResponse = SendRequest(URL_BASE & sXML, sXML, True)
    Do
        lPos = InStr(lPos + 1, sResponse, "data-serp-pos")
        
        If lPos > 0 Then
            Call ProcessDeceased(sResponse, lPos, iRow, sRegExpText & sRegExpDoD, eDateType)
        End If      
    Loop While lPos > 0
End Sub

References

[edit]
  1. ^ The application also implemented some other result filters. They are mentioned here