Wikidata: the new Rosetta Stone
With more than fifteen million items compiled in the space of just three years, Wikidata is set to become the main open data repository worldwide. The eagerly awaited promise of linked open data seems to have finally arrived: a multilingual, totally open database in the public domain, which can be read and edited by both humans and machines. A lot more free information, accessible to many more people, in their own language. The structure of the Wikidata information system and the open format allows us to make complex, dynamic queries, such as: what are largest cities in the world with a female lord mayor or the number of ministers who are themselves the children of ministers, to name just two of innumerable examples. Wikidata is a new step forward in the democratisation of access to information, which is why the most important thing right now is the questions we ask ourselves: what information do we want to compile? How can we contextualise it? How does this new tool affect knowledge management?
With the introduction of the Internet, we now assume that information is just a click away. Thousands of people around the world post their creations online without expecting anything in return: guide books, manuals, photos, videos, tutorials, encyclopaedias and databases. All of it information at our fingertips. To ensure that the sum of all this knowledge reaches all human beings in their own language, free of charge, the Wikimedia Foundation runs many projects, free of charge, with one of the most successful being Wikipedia. The English version of Wikipedia reached five million entries in October 2015. But this version is culturally biased, with an over-representation of Western culture. In fact, it only includes 30% of the items entered in the other 287 languages that form part of the Wikipedia project, which now has a total of more than 34 million articles. Many of the articles that refer to a particular culture only exist in the language of that culture, as can be seen just by looking at the maps of geolocated items. There is a lot of work to be done: it is estimated that in order to cover all human knowledge, an encyclopaedia today should have over 100 million articles. Now that we know that it is possible and that everything is just a click away, we want to have the biographies of all the Hungarian writers available in a language that we understand, and we want it now. Local wiki communities around the world try to compile their own culture in their own language as best they can, but they often have limited capacity to influence the main body of the overall project. There are thousands of articles about Catalans in the Catalan version of Wikipedia, but there are not so many in the Spanish version, much less the French, and much, much less the English version. How can we disseminate our culture internationally if we’re still trying to compile it in our own language? How can we access information that is not written in any of the languages that we are fluent in? The defense of online multilinguism entails as many challenges as opportunities.
Data is beautiful. Data is information.
For this reason among many others, in 2012 the Wikimedia Foundation created Wikidata: a collaborative, multilingual database that aims to provide a common source for certain types of data such as dates of birth, coordinates, names, and authority records, managed collaboratively by volunteers around the world. This means that when a change of government occurs, for example, simply updating the corresponding element on Wikidata will automatically update all the applications that are linked to it, be it Wikipedia or any other third-party application. It means that we do not have to constantly reinvent the wheel. This collaborative model helps to reduce the effects of the existing cultural diglossia, given that small communities can have a greater global impact in a more efficient manner. In the medium term, all Wikidata queries will include data from all over the world, not just from the cultures or historical communities with greater power to influence. A search for “doctors who graduated before they turned 20”, for example, will not only display French and English doctors, but also doctors from Taiwan and Andorra.
This project opens up a whole new world of possibilities, for collaboration and for using the data: the Wikidata game allows users to make thousands of small contributions while playing, even from a mobile phone while waiting for a bus. Inventaire allows people to share their favourite books, and Histropedia offers a new way of visualising history through timelines. Meanwhile, scientists from around the world are uploading their research databases, and the cultural sector is building a database of paintings from all over the world . All of these projects run on the Wikidata engine, which is becoming a new international standard.
And why Wikidata and not some other project? Internet standards do not necessarily become accepted because of their ability to generate authority, but because of their capacity to generate traffic, or their capacity to be updated. The winner is not the best, but the one that can assemble the greatest number of users and be updated more quickly. This is one of the strengths of the Wikidata project, given that thousands of volunteers are constantly updating the information. As a result, any application or project based on big data can take advantage of all of this structured knowledge, and do so free of charge. All of this means that we have to reconsider the role that traditional agents of knowledge (universities, research centres, cultural institutions) want to play, and the role or the possible role of the repositories of authorities around the world, now that new tools are mixing and matching and creating a new centrality.
Cultural institutions, for example, have to deal with the challenge of the lack of standard matching criteria used to document artworks in their catalogues, such as for example: dimensions with frame, without frame, with or without passe-partout, descriptions in text format, number fields… institutions have to bring order to their own data at home before opening up to the world. Being open means interoperability. Many institutions are already adapting: authority file managers such as VIAF are openly collaborating with Wikidata, and the Museum of Modern Art has also started using it in its catalogue. In Catalonia, Barcelona University, in collaboration with Amical Wikimedia, is behind one of groundbreaking projects in this field, which aims to create an open database of all works of Catalan Modernism.
Data is not knowledge. Data is not objective.
Data in itself is not knowledge. It is information. With the emergence of a new, very dense ecology of data that is accessible to everybody, we run the risk of trying to over-simplify the world: a description, no matter how detailed, will not necessarily make us understand something. Knowing that Dostoyevsky was born in 1821 and died in 1881 and that he was an existentialist is not the same as understanding Dostoyevsky or existentialism. Now more than ever, we need tools that will help us to contextualise information, to develop our own point of view, and to generate knowledge based on this information, in order to promote a society with a strong critical spirit. And we shouldn’t forget that data in itself is not objective either, even though it supposedly purports to be neutral. Data selection is a bias in itself. The decision of whether or not to analyse the gender, origin, religion, height, eye colour, political position, or nationality of a human group can condition the subsequent analysis. Codifying or failing to codify a particular item of information within a data set can both inform and disguise a particular reality. Data is useless without interpretation.
The impact of the emergence of Wikipedia on traditional print encyclopaedias is common knowledge. What will be the impact of Wikidata? In line with the wiki philosophy, the work is done collaboratively in an asymmetric but ongoing process. We can all collaborate in the creation and maintenance of the content, but also of the vocabulary, of the properties of different items, and of the taxonomies used to classify the information. We are deciding how to organise existing information about the world, and we are doing it in an open, participatory manner, as an example of the potential of technology. We know that human knowledge evolves cumulatively, and that Western culture is essentially inherited. Our reality is determined, in a sense, through the technological, social, political, and philosophical advances of those who came before us. This means that today’s generations don’t have to discover electricity all over again, for example. We enjoy the fruits of the efforts of our ancestors. But the Internet, for the first time, allows us to be involved in a phenomenon that will mark human history: we are defining and generating a new information ecosystem that will become the foundation for a possible cognitive revolution. And we are lucky to be able to participate, question, and improve it as it evolves. Together, we can participate in a historic project on a par with humanity’s greatest advances. We can create a new Rosetta Stone that can serve as an open, transparent key to unlock the secrets of today’s world, and perhaps as a documentary source for future generations or civilisations. Let us take responsibility for it.
This article originally appeared on the CCCB Lab blog of the Centre de Cultura Contemporània de Barcelona and is reprinted here with the permission of the author.
Discuss this story
How old was someone, knowing that he was born in 1821 and died in 1881? Maybe 1881-1821=60 years old. But born 1821-01-01, died 1881-12-31 gives 61 years old, while born 1821-12-31, died 1881-01-01 gives 59 years old. But there are countries where the birth of a child is her first anniversary. But there are lunar years. And what remains is something between 58 and 63 years old. When someone is reported as 1821--1881, this is even worse. And therefore, the question is not about what is written in the database, but about the confidence we can give to the way the data were collected to build the database. E.g. what says Wikidata about the death of Kim Hong-do ? Pldx1 (talk) 08:25, 30 November 2015 (UTC)Reply[reply]
“doctors who graduated before they turned 20” – How would this query look like?--Kopiersperre (talk) 15:41, 30 November 2015 (UTC)Reply[reply]
Wikimedia-l discussion, Slate article
There is an ongoing discussion about Wikidata's quality issues and their wider implications on the Wikimedia-l mailing list: http://www.gossamer-threads.com/lists/wiki/foundation/654001
A key fact here is that at present, only about 20% of Wikidata content is referenced to a reliable source. About half is unreferenced, and about a third is only referenced to a Wikipedia. 
For wider context, see yesterday's article in Slate exploring the links between Wikidata and Google's Knowledge Graph: "Why Does Google Say Jerusalem Is the Capital of Israel?" Andreas JN466 15:54, 1 December 2015 (UTC)Reply[reply]
Wikidata has some way to go but has the potential to be a massive help to building and maintaining Wikipedia. For me, the biggest advantage is the ability to store information in once place that's referenced in many Wikipedia articles, and updated suddenly. The example was given of election results; I'm still finding many articles that list incorrect members of parliament or local councillors because they haven't been updated and there's no central reference of which articles contain such information. Another prime example is census data; many UK geography articles still list the population as at the 2001 census, not the (more recent) 2011 census or any of the subsequent population estimates from the Office for National Statistics.
Working through articles that find such information to update them is time consuming and mindnumbingly dull. Because we prefer to write information in prose, writing a bot to do it isn't really an option; using templates could work but would be much harder to update than Wikidata's slick user interface is. Out of date governance and demographic information is a big problem in geographical articles and Wikidata solves that problem for us; that alone is reason enough to embrace it and welcome it with open arms. Yes, it has flaws, but let's remember it's in its infancy. When someone views an article and sees a population figure that's 14 years out of date, it doesn't make us look good. So I say let's put the effort in to make WikiData work for us. WaggersTALK 11:26, 4 December 2015 (UTC)Reply[reply]