Big data ethics

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Big data ethics also known as simply data ethics refers to systemizing, defending, and recommending concepts of right and wrong conduct in relation to data, in particular personal data.[1] Since the dawn of the Internet the sheer quantity and quality of data has dramatically increased and is continuing to do so exponentially. Big data describes this large amount of data that is so voluminous and complex that traditional data processing application software is inadequate to deal with them. Recent innovations in medical research and healthcare, such as high-throughput genome sequencing, high-resolution imaging, electronic medical patient records and a plethora of internet-connected health devices have triggered a data deluge that will reach the exabyte range in the near future. Data Ethics is of increasing relevance as the quantity of data increases because of the scale of the impact.

Big data ethics is different from information ethics because the focus of information ethics is more concerned with issues of intellectual property and concerns relating to librarians, archivists, and information professionals, while big data ethics is more concerned with collectors and disseminators of structured or unstructured data such as data brokers, governments, and large corporations.


Data ethics is concerned with the following principles:[original research?]

  1. Ownership - Individuals own their own data.
  2. Transaction transparency - If an individuals personal data is used, they should have transparent access to the algorithm design used to generate aggregate data sets
  3. Consent - If an individual or legal entity would like to use personal data, one needs informed and explicitly expressed consent of what personal data moves to whom, when, and for what purpose from the owner of the data.
  4. Privacy - If data transactions occur all reasonable effort needs to be made to preserve privacy.
  5. Currency - Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
  6. Openness - Aggregate data sets should be freely available


Who owns data? Ownership involves determining rights and duties over property. The concept of data ownership is linked to one's ability to exercise control over and limit the sharing of their own data. If one person records their observations on another person who owns those observations? The observer or the observed? What responsibilities do the observer and the observed have in relation to each other? Since the massive scale and systematisation of observation of people and their thoughts as a result of the Internet, these questions are increasingly important to address. Slavery, the ownership of a person, is outlawed in all recognised countries. The question of personal data ownership falls into an unknown territory in between corporate ownership, intellectual property, and slavery. Who owns a digital identity?

European laws, the General Data Protection Regulation, indicate that individuals own their own personal data. [2]

Personal data refers to data sets describing a person ranging from physical attributes to their preferences and behaviour. Examples of personal data include: Genome data, GPS location, written communication, spoken communication, lists of contacts, internet browsing habits, financial transactions, supermarket spending, tax payments, criminal record, laptop and mobile phone camera lens recording, device microphone recordings, driving habits via car trackers, mobile and health records, fitness activity, nutrition, substance use, heartbeat, sleep patterns and other vital signs. The collective of one individual's personal data forms a digital identity (or perhaps digital alter ego is more fitting). A digital identity encompasses all of our personal data shadowing, representing and connected to our physical and ideological self. The distinction between data categories is not always clear cut. For example, health data and banking data are intertwined because behaviour and lifestyle can be inferred through banking data and is hugely valuable for predicting risk of chronic disease. Therefore, banking data is also health data. Health data can indicate how much an individual spends on healthcare, therefore health data is also banking data. These overlaps exist in between other data categories too, for example, location data, Internet browsing data, tax data are essentially all about individuals.

The protection of the moral rights of an individual is based on the view that personal data is a direct expression of the individual's personality: the moral rights are therefore personal to the individual, and cannot be transferred to another person except by testament when the individual dies. Moral rights include the right to be identified as the source of the data and the right to object to any distortion or mutilation of the data which would be prejudicial to his or her honour or reputation. These moral rights to personal data are perpetual.

A key component of personal data ownership is unique and controlled access i.e. exclusivity. Ownership implies exclusivity, particularly with abstract concepts like ideas or data points. It is not enough to simply have a copy of one's own data. Others should be restricted in their access to what is not theirs. Knowing what data others keep is a near-impossible task. The simpler approach would be to cloak oneself in nonsense information. To ensure that corporations or institutions do not have a copy, it is possible to send noise to confuse the data that they have. For example, a robot could randomly search terms that are usually used making the data obtained by the search engine useless through confusion (see: Track Me Not by New York University).

Ownership puts emphasis on the ability to conveniently move data from one service to another i.e. portability. When personal data is owned by the individual they have the option to simply remove it and take it to another site if they become dissatisfied with the service. Individuals should be offered a high degree of convenient portability allowing one to switch to alternatives without losing historic data collections describing product preferences and personal conversations. For example, one may choose to switch to an alternative messaging app, and this should be possible without losing the record of previous conversations and contacts. Giving individuals the option to switch services without the inconveniences of losing historical data means that the services need to keep customers happy by providing good services rather than locking them in by means of incompatibility with alternatives.

For portability, data expression must be standardised in such a way that this can happen seamlessly. For example, describing the unit as "kilograms" rather than "kg" means that robots recognise them as different, although they are the same. These small variations can result in messy data that cannot easily be combined or transferred into a new system that cannot recognise them. Currently, Apple states that they provide privacy services, however, it is difficult to extract data from Apple systems making it difficult to migrate to an alternative. In the personal data trading framework, the data expression would be standardised for easy portability with the click of a button. Standardisation would also facilitate the setting up of mechanisms to clean data necessary to install checks and balances validating the quality of the data. By joining multiple sources, one would be able to identify erroneous or falsely entered data.

Who owns data today? Today data is being controlled, and therefore owned by the owner of the sensors. The individual making the recording or the entity owning the sensor controls what happens to that data by default. For example, banks control banking data, researchers control research data, and hospitals control health record data. Due to historical reasons, the current scenario is such that research institutions hold data about a fragment of data describing part of an individual. Health research data in Europe exist in a fragmented manner controlled by different institutions. Data categories often describe more about who controls that data and where it is stored rather than what the data is describing or the application it could be applied to. While the Internet is not owned by anyone, corporations have come to control much of the personal data, creating value by making use of data collection, search engines and communication tools.[3] By default, as a side effect to owning the intellectual property making up the Internet tools, these corporations have been collecting our digital identities as raw material for the services delivered to other companies at a profit. Most of the data collected via Internet services is personal data describing individuals. Traditionally, medicine organises data around the individual because it enables an understanding of health. When studying epidemiology, the data of groups is still organised around the individual. Many of the processes that are being made more efficient concern individuals and group dynamics. However, data is not necessarily organised around the individual, rather, data is being controlled by the owner of the sensors.

In China, the government largely owns data. In one Chinese province data was used to generate a social index score per person based on online and offline individual behaviour, such as jaywalking and amount of toilet paper used in a public lavatory. The social index determines access to particular public services.

Transaction transparency[edit]

Concerns have been raised around how biases can be integrated into algorithm design resulting in systematic oppression.[4] The algorithm design should be transparently disclosed. All reasonable efforts should be made to take into account the differences between individuals and groups, without losing sight of equality. Algorithm design needs to be inclusive.

In terms of governance, big data ethics is concerned with which types of inferences and predictions should be made using big data technologies such as algorithms.[5]

Anticipatory governance is the practice of using predictive analytics to assess possible future behaviours.[6] This has ethical implications because it affords the ability to target particular groups and places which can encourage prejudice and discrimination[6] For example, predictive policing highlights certain groups or neighbourhoods which should be watched more closely than others which leads to more sanctions in these areas, and closer surveillance for those who fit the same profiles as those who are sanctioned.[3]

The term "control creep" refers to data that has been generated with a particular purpose in mind but which is repurposed.[6] This practice is seen with airline industry data which has been repurposed for profiling and managing security risks at airports.[6]

In regard to personal data, the individual has the right to know:

  1. Why the data is being collected?
  2. How it is going to be used?
  3. How long it will be stored?
  4. How it can be amended by the individual concerned?

Examples of ethical uses of data transaction include:

  • Statutory purposes: All collection and use of personal data by the state should be completely transparent and covered by a formal license negotiated prior to any data collection. This civil contract between the individual and the responsible authorities sets out the conditions under which the individual licenses the use of his/her data to responsible authorities, in accordance with the above transparency principles
  • Social purposes: All uses of individual data for social purposes should be opt-in, not opt-out. They should comply with the transparency principles.
  • Crime: For crime prevention an explicit set of general principles for the harvesting and use of personal data should be established and widely publicised. The governing body of the state should consider and approve these principles.
  • Commerce: Personal data used for commercial purposes belongs to the individual and may not be used without a license from the individual setting out all permitted uses. This includes data collected from all websites, page visits, transfers from site to site, and other Internet activity. Individuals have the right to decide how and where and if their personal data is used for commercial purposes, on a case-by-case or category basis.
  • Research: personal data used for research purposes belongs to the individual and must be licensed from the user under the terms of a personal consent form which fulfils all the transparency principles outlined above.
  • Extra-legal purposes: Personal data can only be used for extra-legal purposes with the explicit prior consent of the rights holder.


If an individual or legal entity would like to use personal data, one needs informed and explicitly expressed consent of what personal data moves to whom, when, and for what purpose from the subject of the data. The subject of the information has the right to know how their data has been used.

The data transaction cannot be used as a bargaining chip for an unrelated or superfluous issue of consent, for example, improve marketing recommendations while trying to ring contact a relative. While there are services in which data sharing is needed, these transactions should not be exaggerated and should be held within context. For example, an individual needs to share data to receive adequate medical recommendations, however, that medical data does not automatically need to go to a health insurance provider. It is ultimately come down upon the individual to make the decision about their data. These are separate data transactions which should be dealt with as such. Implied consent of accepting the transfer of data ownership because a chat application is used is not considered valid.

The full scope and extent of the transaction needs to be explicitly detailed to the individual who has to be given a reasonable opportunity to engage in the process of evaluating whether they would like to engage. Timing is critical. i.e.. these issues should be dealt with in a calm moment with time to reflect, not in the moment an urgent purchase is being made or a medical emergency is occurring.

The permission needs to be given in a format that is explicit, not implied. Just because an application has been chosen to chat does not mean that access to a list of contacts is needed. The button that is clicked to give permission should not be designed in such a way that the automatic behaviour is opting in. For example, in binary choices if one button is smaller than the other, or if one button is hidden in the design and the other jumps out, or if one button requires multiple clicks whereas the other is a single click.

While a person could give consent on a general topic to be continuous, it should always be possible to retract that permission for future transactions. Similarly, to consent for sexual activity, retraction of past consent for data transactions is not feasible. For example, it would be possible for an individual to give consent to use their personal data for any cause advancing the treatment of cardiovascular disease until further notice. Until the human changes their mind, these transactions can continue to occur seamlessly without the involvement of the human.

Dynamic consent in the context of health and genomic research might provide a more appropriate consent approach than once-off or broad informed consent, in terms of the issues outlined above.


If data transactions occur all reasonable effort needs to be made to preserve privacy.

"No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks." - United Nations Declaration of Human Rights Article 12.

Why does privacy matter? Data is useful to make systems more efficient; however, defining the end goal of this efficiency is essential in assessing how ethical data usage is.

The use of data monitoring by government to observe citizens needs explicit authorization by appropriate judicial process. Possibly it would even be more efficient to observe the relatively small number of criminals manually rather than track the relatively large population. Blanket observation of inhabitants by national governments and corporations is a slippery slope to an Orwellian style of governance. Privacy is a not about keeping secrets, it is about choice, human rights, freedom, and liberty. For example, sharing medical data with a doctor under the understanding that it will be used to improve health is ethically sound, even when the doctor reveals that data to another doctor. However, when that same data is shared with a marketing agency as just happened with the British national health system and Google's DeepMind artificial intelligence company the ethical implications are more uncertain (Google DeepMind and healthcare in an age of algorithms by Julia Powles and Hal Hodson). Privacy is about choosing the context; what data is shared, with who, for which purpose, and when. Privacy is currently not being implemented possibly because the personal power and wealth gain from not doing so is acting as a disincentive for both private companies and governments. Also, using data to measure actual social impact could reveal inefficiency which would be inconvenient to the politicians involved or the companies’ claims.

The public debate on privacy is often unfairly obscured to an over-simplistic binary choice between privacy and scientific progress. The marketing campaigns have even dismissed critics of centralized data collection as resisting progress and holding on to the past. However, the benefits from scientific progress through data can be achieved in a manner consistent with privacy values as has historically been the case in epidemiological research. The extraction of value from data without compromising identity privacy is certainly possible technologically; e.g., by utilizing homomorphic encryption and algorithmic design which makes reverse engineering difficult.

Homomorphic encryption allows the chaining together of different services without exposing the data to each of the services. Even the software engineers working on the software would not be able to override the user. Homomorphic encryption schemes are malleable by design meaning they can be used in a cloud computing environment while ensuring the confidentiality of processed data. The technique allows analytical computations to be carried out on cipher text, therefore generating encrypted results which, when decrypted, match the results of operations performed in plain-text.

The results of analytics can be presented in such a way as to be fit for purpose without compromising identity privacy. For example, a data sale stating that "20% of Amsterdam eats muesli for breakfast" would transmit the analytical value of data without compromising privacy, whereas saying that "Ana eats muesli for breakfast" would not maintain privacy. Algorithmic design and the size of the sample group is critical to minimize the capacity to reverse engineer statistics and track targeted individuals. One technical solution to reverse engineering of aggregate metrics is to introduce fake data points that are about made up people which do not alter the end result, for example the percentage of a group that eats muesli.

Privacy has been presented as a limitation to data usage which could also be considered unethical.[7] For example, the sharing of healthcare data can shed light on the causes of diseases, the effects of treatments, an can allow for tailored analyses based on individuals' needs.[7] This is of ethical significance in the big data ethics field because while many value privacy, the affordances of data sharing are also quite valuable, although they may contradict one's conception of privacy. Attitudes against data sharing may be based in a perceived loss of control over data and a fear of the exploitation of personal data.[7] However, it is possible to extract the value of data without compromising privacy.

Some scholars such as Jonathan H. King and Neil M. Richards are redefining the traditional meaning of privacy, and others to question whether or not privacy still exists.[5] In a 2014 article for the Wake Forest Law Review, King and Richard argue that privacy in the digital age can be understood not in terms of secrecy but in term of regulations which govern and control the use of personal information.[5] In the European Union, the Right to be Forgotten entitles EU countries to force the removal or de-linking of personal data from databases at an individual's request if the information is deemed irrelevant or out of date.[8] According to Andrew Hoskins, this law demonstrates the moral panic of EU members over the perceived loss of privacy and the ability to govern personal data in the digital age.[9] In the United States, citizens have the right to delete voluntarily submitted data.[8] This is very different from the Right to be Forgotten because much of the data produced using big data technologies and platforms are not voluntarily submitted.[8]


The business models driving tech giants have uncovered the possibility of making the human identity the product to be consumed. While the tech services including search engines, communication channels and maps are provided for free, the new currency that has been uncovered in the process is personal data.

There is a variety of opinion about whether it is ethical to receive money in exchange for having access to personal data. Parallels have been drawn between blood donations, where the rate of infectious blood donated decreases when there is no financial transaction for the blood donor. Additional questions arise around who should receive the profit from a data transaction?

How Much is Data Worth?[edit]

What is the exchange rate of personal data to money? Data is valuable because it allows users to act more efficiently than when they are guessing or operating using trial and error. There are two elements of data that have value: trends and real-time. Build-up of historical data allows us to make future predictions based on trends. Real-time data gives value because actions can be made instantaneously.

How much are tech services such as a search engine, a communications channel and a digital map actually worth, for example in dollars? The difference in value between the services facilitated by tech companies and the equity value of these tech companies is the difference in the exchange rate offered to the citizen and the 'market rate' of the value of their data. Scientifically there are many holes to be picked in this rudimentary calculation: the financial figures of tax-evading companies are unreliable, would revenue or profit be more appropriate, how is a user defined, a large number of individuals are needed for the data to be valuable, would there be a tiered price for different people in different countries, not all Google revenue is from Gmail, etc. Although these calculations are undeniably crude, the exercise serves to make the monetary value of data more tangible. Another approach is to find the data trading rates in the black market. RSA publishes a yearly cybersecurity shopping list that takes this approach.[10] The examples given only cover specific cases, but if we extend profits from data sales to other areas such as healthcare the monthly profit per individual would increase.

This raises the economic question of whether free tech services in exchange for personal data is a worthwhile implicit exchange for the consumer. In the personal data trading model, rather than companies selling data, an owner can sell their personal data and keep the profit.[11] Personal data trading is a framework that gives individuals the ability to own their digital identity and create granular data sharing agreements via the Internet. Rather than the current model which tolerates companies selling personal data for profit, in personal data trading, individuals would sell their personal data to known parties of their choice and keep the profit. At the core is an effort to re-decentralise the Internet. Rather than the current model which tolerates companies selling personal data for profit, in personal data trading, individual human beings would directly own and consciously sell their personal data to known parties of their choice and keep the profit. Personal data trading adds a fourth mechanism for wealth distribution, the other three being salaries via jobs, property ownership, and company ownership. The ultimate goals of the personal data trading model are: More equitable global resource distribution and a more balanced say in allocation of global resources. Personal data trading by individuals in the proposed framework would result in distributed profits amongst the population but also can have radical consequences on societal power structures. It is now widely acknowledged that the current centralised data design exacerbates ideological echo chambers and has far-reaching implications on seemingly unrelated decision-making processes such as elections. The data exchange rate is not only monetary, it is ideological. Do institutional processes have to be compromised by the centralised use of communication tools guided by freely harvested personal data?

While initially it is realistic to assume that data would be traded for money, it is possible to imagine a future where data would be traded for data. The "I’ll show you yours if you show my mine" scenario could replace money altogether. Importantly, this is a future scenario and the first step is to focus on exchanging personal data for existing monetary currency.


The idea of open data is centred around the argument that data should be freely available and should not have restrictions that would prohibit its use, such as copyright laws. As of 2014 many governments had begun to move towards publishing open datasets for the purpose of transparency and accountability.[12] This movement has gained traction via "open data activists" who have called for governments to make datasets available to allow citizens to themselves extract meaning from the data and perform checks and balances themselves.[12][5] King and Richards have argued that this call for transparency includes a tension between openness and secrecy.[5]

Activists and scholars have also argued that because this open-sourced model of data evaluation is based on voluntary participation, the availability of open datasets has a democratizing effect on a society, allowing any citizen to participate.[13] To some, the availability of certain types of data is seen as a right and an essential part of a citizen's agency.[13]

The Open Knowledge Foundation (OKF) lists several dataset types that should be provided by governments in order for them to truly be open.[14] The OFK has a tool called The Global Open Data Index (GODI) which is a crowd-sourced survey for measuring the openness of governments,[14] according to the Open Definition. The aim of the GODI is to provide a tool for providing important feedback to governments about the quality of their open datasets.[15]

Willingness to share data varies from person to person. Preliminary studies have been conducted into the determinants of the willingness to share data. For example, some have suggested that baby boomers are less willing to share data than millennials.[16]

The role of institutions[edit]

Nation states[edit]

Data sovereignty refers to a government's control over the data that is generated and collected within a country.[17] The issue of data sovereignty was heightened when Edward Snowden leaked US government information about a number of governments and individuals whom the US government was spying on.[17] This prompted many governments to reconsider their approach to data sovereignty and the security of their citizens' data.[17]

J. De Jong-Chen points out how the restriction of data flow can hinder scientific discovery, to the disadvantage of many but particularly, developing countries.[17] This is of considerable concern to big data ethics because of the tension between the two important issues of cybersecurity and global development.


The banks hold a position in society as the keeper of value. Their data policy should not compromise the trust relationship with their clients as keeper of value. For example, in a bank shares data about one butcher with another butcher, this could compromise their trust relationship due to the revelation of data to competitors.

Relevant news items about data ethics[edit]

The Edward Snowden revelations on June 5, 2013 marked a turning point in the data ethics public debate. The ongoing publication of leaked documents has revealed previously unknown details of global surveillance apparatus run by the United States NSA in close cooperation with three of its Five Eyes partners: Australia's ASD, the UK's GCHQ, and Canada's CSEC.

In the Netherlands, ING Bank made a public statement about their intentions around data usage.

The Facebook-Cambridge Analytica data scandal involves the collection of personal data of up to but most possibly more than 87 million Facebook users in an attempt to influence voter opinion. Both the 2016 Brexit vote and the 2015/6 campaigns of US politicians Donald Trump and Ted Cruz paid Cambridge Analytica to use information from the data breach to influence voter opinion.

Relevant legislation about data ethics[edit]

On 26 October 2001 the Patriot Act came into force in the US, in response to the broad concern felt among Americans from the September 11 attacks. Broadly speaking the Patriot Act laid the path for allowing security forces to surveil citizens suspected of involvement with terrorist acts.

On 25 May 2018 the General Data Protection Regulation 2016/679 (GDPR) came into effect across the European Union. GDPR addresses issues of transparency from data controllers towards individuals, referred to as data subjects, and a need for permission from data subjects to handle their personal data.

See also[edit]


  1. ^ Kitchin, Rob (August 18, 2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. SAGE. p. 27. ISBN 9781473908253.
  2. ^ van Ooijen, I.; Vrabec, Helena U. (December 11, 2018). "Does the GDPR Enhance Consumers' Control over Personal Data? An Analysis from a Behavioural Perspective". Journal of Consumer Policy. 42 (1): 91–107. doi:10.1007/s10603-018-9399-7. ISSN 0168-7034. S2CID 158945891.
  3. ^ a b Zwitter, A. (2014). "Big Data Ethics". Big Data & Society. 1 (2): 4. doi:10.1177/2053951714559253.
  4. ^ O'Neil, Cathy (2016). Weapons of Math Destruction. Crown Books. ISBN 978-0553418811.
  5. ^ a b c d e Richards and King, N. M. and J. H. (2014). "Big data ethics". Wake Forest Law Review. 49: 393–432. SSRN 2384174.
  6. ^ a b c d Kitchin, Rob (2014). The Data Revolution: Big Data, Open Data Infrastructure and Their Consequences. SAGE Publications. pp. 178–179.
  7. ^ a b c Kostkova, Patty; Brewer, Helen; de Lusignan, Simon; Fottrell, Edward; Goldacre, Ben; Hart, Graham; Koczan, Phil; Knight, Peter; Marsolier, Corinne; McKendry, Rachel A.; Ross, Emma; Sasse, Angela; Sullivan, Ralph; Chaytor, Sarah; Stevenson, Olivia; Velho, Raquel; Tooke, John (February 17, 2016). "Who Owns the Data? Open Data for Healthcare". Frontiers in Public Health. 4: 7. doi:10.3389/fpubh.2016.00007. PMC 4756607. PMID 26925395.
  8. ^ a b c Walker, R. K. (2012). "The Right to be Forgotten". Hastings Law Journal. 64: 257–261.
  9. ^ Hoskins, Andrew (November 4, 2014). "Digital Memory Studies |". Retrieved November 28, 2017.
  10. ^ RSA (2018). "2018 Cybersecurity Shopping List" (PDF).
  11. ^ László, Mitzi (November 1, 2017). "Personal Data trading Application to the New Shape Prize of the Global Challenges Foundation". online: Global Challenges Foundation. p. 27. Archived from the original on June 20, 2018. Retrieved June 20, 2018.
  12. ^ a b Kalin, Ian (2014). "Open Data Policy Improves Democracy". SAIS Review of International Affairs. 34 (1): 59–70. doi:10.1353/sais.2014.0006. S2CID 154068669.
  13. ^ a b Baack, Stefan (December 27, 2015). "Datafication and empowerment: How the open data movement re-articulates notions of democracy, participation, and journalism". Big Data & Society. 2 (2): 205395171559463. doi:10.1177/2053951715594634. S2CID 55542891.
  14. ^ a b Knowledge, Open. "Methodology - Global Open Data Index". Retrieved November 23, 2017.
  15. ^ Knowledge, Open. "About - Global Open Data Index". Retrieved November 23, 2017.
  16. ^ Emerce. "Babyboomers willen gegevens niet delen". Retrieved May 12, 2016.
  17. ^ a b c d de Jong-Chen, J. (2015). "Data Sovereignty, Cybersecurity, and Challenges for Globalization". Georgetown Journal of International Affairs: 112–115. ProQuest 1832800533.