[Summary: Preliminary notes on open data, privacy and AI]
At the heart of open data is the idea that when information is provided in a structured form, freely accessible, and with permission granted for anyone to re-use it, latent social and economic value within it can be unlocked.
Privacy positions assert the right of individuals to control their information and data, and data about them, and to have protection from harms that might occur through exploitation of their data.
Artificial intelligence is a field of computing concerned with equipping machines with the ability to perform tasks that many previously have required human intelligence, including recognising patterns, making judgements, and extracting and analysing semi-structured information.
Around each of these concepts vibrant (and broad based) communities exist: advocating respectively for policy to focus on openness, privacy and the transformative use of AI. At first glance, there seem to be some tensions here: openness may be cast as the opposite of privacy; or the control sought in privacy as starving artificial intelligence models of the data they could use for social good. The possibility within AI of extracting signals from messy records might appear to negate the need to construct structured public data, and as data-hungry AI draws increasingly upon proprietary data sources, the openness of data on which decisions are made may be undermined. At some points these tensions are real. But if we dig beneath surface level oppositions, we may find arguments that unite progressive segments of each distinct community – and that can add up to a more coherent contemporary narrative around data in society.
This was the focus of a panel I took part in at RightsCon in Toronto last week, curated by Laura Bacon of Omidyar Network, and discussing with Carlos Affonso Souza (ITS Rio) and Frederike Kaltheuner (Privacy International) – and the first in a series of panels due to take place over this year at a number of events. In this post I’ll reflect on five themes that emerged both from our panel discussion, and more widely from discussions I had at RightsCon. These remarks are early fragments, rather than complete notes, and I’m hoping that a number may be unpacked further in the upcoming panels.
The historic connection of open data and AI
The current ‘age of artificial intelligence’ is only the latest in a series of waves of attention the concept has had over the years. In this wave, the emphasis is firmly upon the analysis of large collections of data, predominantly proprietary data flows. But it is notable that a key thread in advocacy for open government data in the late 2000s came from Artificial Intelligence and semantic web researchers such as Prof. Nigel Shadbolt, whose Advanced Knowledge Technologies (AKT) programme was involved in many early re-use projects with UK public data, and Prof. Jim Hendler at TWC. Whilst I’m not aware of any empirical work that explores the extent to which open government data has gone on to feed into machine-learning models, in terms of bootstrapping data-hungry research, there is a connection here to be explored.
There also an argument to be made that open data advocacy, implementation and experiences over the last ten years have played an important role in contributing to growing public understandings of data, and in embedding cultural norms around seeking access to the raw data underlying decisions. Without the last decade of action on open data, we might be encountering public sector AI based purely on proprietary models, as opposed to now navigating a mixed ecology of public and private AI.
(Some) open data is getting personal
Its not uncommon to hear open data advocates state that open data only covers ‘non-personal data’. It’s certainly true that many of the datasets sought through open data policy, such as bus timetables, school rankings, national maps, weather reports and farming statistics don’t contain an personally identifying information (PII). Yet, whilst we should be able to mark a sizable teritory of the open data landscape as free from privacy concerns, there are increasingly blurred lines at points where ‘public data’ is also ‘personal data’.
In some cases, this may be due to mosaic effects: where the combination of multiple open datasets could be personally identifying. In other cases, the power to AI to extract structured data from public records about people raises interesting questions about how far permissive regimes of access and re-use around those documents should also apply to datasets derived from them. However, there are also cases where open data strategies are being applied to the creation of new datasets that directly contain personally identifying information.
In the RightsCon panel I gave the example of Beneficial Ownership data: information about the ultimate owners of companies that can be used to detect ilicit use of shell companies for money laundering or tax evasion, or that can support better due dilligence on supply chains. Transparency campaigners have called for beneficial ownership registers to be public and available as open data, citing the risk that restricted registers will be underused and will much less effective than open registers, and drawing on the idea of a social contract that means the limited liability conferred by a company comes with the responsibility to be identified as party to that company. We end up then with data that is both public (part of the public record), but also personal (containing information about identified individuals).
Privacy is not secrecy: but consent remains key
Frederike Kaltheuner kicked off our discussions of privacy on the panel by reminding us that privacy and secrecy are not the same thing. Rather, privacy is related to control: and the ability of individuals and communities to excercise rights over the presentation and use of their data. The beneficial ownership example highlights that not all personal data can or should be kept secret, as taking an ownership role in a company comes with a consequent publicity requirement. However, as Ann Cavoukian forcefully put the point in our discussions, the principle of consent remains vitally important. Individuals need to be informed enough about when and how their personal information may be shared in order to make an informed choice about entering into any relationship which requests or requires information disclosure.
When we reject a framing of privacy as secrecy, and engage with ideas of active consent, we can see, as the GDPR does, that privacy is not a binary choice, but instead involves a set of choices in granting permissions for data use and re-use. Where, as in the case of company ownership, the choice is effectively between being named in the public record vs. not taking on company ownership, it is important for us to think more widely about the factors that might make that choice trickier for some individuals or groups. For example, as Kendra Albert expained to me, for trans-people a business process that requires current and former names to be on the public record may have substantial social consequences. This highlights the need for careful thinking about data infrastructures that involve personal data, such that they can best balance social benefits and individual rights, giving a key place to mechanisms of acvice consent: and avoiding the creation of circumstances in which individuals may find themselves choosing uncomfortably between ‘the lesser of two harms’.
Is all data relational?
One of the most challenging aspects of the receny Cambridge Analytica scandal is the fact that even if individuals did not consent at any point to the use of their data by Facebook apps, there is a chance they were profiled as a result of data shared by people in their wider network. Whereas it might be relatively easy to identify the subject of a photo, and to give that individual rights of control over the use and distribution of their image, an individual ownership and rights framework is difficult can be difficult to apply to many modern datasets. Much of the data of value to AI analysis, for example, concerns the relationship between individuals, or between individuals and the state. When there are multiple parties to a dataset, each with legitimate interests in the collection and use of the data, who holds the rights to govern its re-use?
Strategies of regulation
What unites the progressive parts of the open data, privacy and AI communities? I’d argue that each has a clear recognition of the power of data, and a concern with minimising harm (albeit with a primary focus in individual harm in privacy contexts, and with the emphasis placed on wider social harms from corruption or poor service delivery by open data communities)*. As Martin Tisné has suggested, in a context where harmful abuses of data power are all around us, this common ground is worth building on. But in charting a way forward, we need to more fully unpack where there are differences of emphasis, and different preferences for regulatory strategies – produced in part by the different professional backgrounds of those playing leadership roles in each community.
(*I was going to add ‘scepticism about centralised power’ (of companies and states) to the list of common attributes across progressive privacy, open data and AI communities, but I don’t have a strong enough sense of whether this could apply in an AI context.)
In our RightsCon panel I jotted down and shared five distinct strategies that may be invoked:
- Reshaping inputs – for example, where an AI system is generated biased outputs, work can take place to make sure the inputs it recieves are more representative. This strategy essentially responds to negative outcomes from data by adding more, corrective, data.
- Regulating ownership – for example, asserting that individuals have ownership of their data, and can use ownership rights to make claims of control over that data. Ownership plays an important role in open data licensing arrangements, but runs up against the ‘relational data problem’ in many cases, where its not clear who has ownership rights.
- Regulating access – for example, creating a dataset of company ownership only available to approved actors, or keeping potentially disclosive AI training datasets from being released.
- Regulating use – for example, allowing that a beneficial ownership register is public, but ensuring that uses of the data to target individuals is strictly prohibited, and prohibitions are enforced.
- Remediating consequences – for example, recognising that harm is caused to some groups by the publicity of certain data, but judging that the net public benefit is such that the data should remain public, but the harm should be redressed by some other aspect of policy.
By digging deeper into questions of motivations, goals and strategies my sense is we will better be able to find the points where AI, privacy and open data intersect in a joint critical engagement with todays data environment.
I’m looking forward to exploring these themes more, both attending the next panel in this series at the Open Government Partnership meeting in Tblisi in July, and through the State of Open Data project.