How might a Data Pledge function?

[Summary: Reflections on the design of ITU Data Pledge project]

The ITU, under their “Global Initiative on AI and Data Commons have launched a process to create a ‘Data Pledge’, designed as a mechanism to facilitate increased data sharing in order to support “response to humanity’s greatest challenges” and to ”help support and make available data as a common global resource.”.

Described as complementary to existing work such as the International Open Data Charter, the Pledge is framed as a tool to ‘collectively make data available when it matters’, with early scoping work discussing the idea of conditional pledges linked to ‘trigger events’, such that an organisation might promise to make information available specifically in a disaster context, such as the current COVID-19 Pandemic. Full development of the Pledge is taking place through a set of open working groups.

This post briefly explores some of the ways in which a Data Pledge could function, and considers some of the implications of different design approaches.

[Context: I’ve participated in one working group call around the data pledge project in my role as Project Director of the Global Data Barometer, and this is written up in a spirit of open collaboration. I have no formal role in the data pledge project..]

Governments, civil society or private sector

Should a pledge be tailored specifically to one sector? Frameworks for governments to open data are already reasonably well developed, as our mechanisms that could be used for governments to collaborate on improving standards and practices of data sharing.

However, in the private sector (and to some extent, in Civil Society), approaches to data sharing for the public good (whether as data philanthropy, or participation in data collaboratives are much less developed – and are likely the place in which a new initiative could have the greatest impact.

Individual or collective action problems

PledgeBank, a MySociety project that ran from 2005 to 2015, explored the idea of pledging as a solution to collective action problems. Pledges of the form: “I’ll do something, if a certain number of people will help me” are now familiar in some senses through crowdfunding sites and other online spaces. A Data Pledge could be modelled on the same logic – focussing on addressing those collective action problems either where:

  • A single firm doesn’t want to share certain data because doing so, when no-one else is, might have competitive impacts: but if a certain share of the market are sharing this data, it no longer has competitive significance, and instead it’s public good value can be realised.
  • The value of certain data is only realised as a result of network effects, when multiple firms are sharing similar and standardised data – but the effort of standardising and sharing data is non-negligible. In these cases, a firm might want to know that there is going to be a Social Return on Investment before putting resources into sharing the data.

However, this does introduce some complexity into the idea of pledging (and the actions pledged) and might, as PledgeBank found, lead also to lots of unrealised potential.

Pledging can also be approached as a means of solving individual motivational problems: helping firms to overcome inertia that means they are not sharing data which could have social value. Here, a pledge is more about making a statement of intent, which garners positive attention, and which commits the firm to a course of action that should eventually result in shared data.

Both forms of pledging can function as useful signalling – highlighting data that might be available in future, and priming potential ecosystems of intermediaries and users.

An organisational or dataset-specific pledge

Should a Pledge be about a general principle of data sharing for social good? Or about sharing a specific dataset? It may be useful to think about the architecture of the Data Pledge involving both: or at least, optionally involving data-specific pledges, under a general pledge to support data sharing for social good.

Think about organisational dynamics. Individual teams in a large organisation may have lots of data they could safely and appropriately share more widely for social good uses, but they do not feel empowered to even start thinking about this. A high-level organisational pledge (e.g. “We commit to share data for social good whenever we can do so in ways that do not undermine privacy or commercial position”) that sets an intention of a firm to support data philanthropy, participate in data collaboratives, and provide non-competitive data as open data, could provide the backing that teams across the organisation need to take steps in that direction.

At the same time, there may be certain significant datasets and data sources that can only be shared with significant high-level leadership from the organisation, or where signalling the specific data that might be released, or purposes it might be released for, can help address the collective action issues noted above. For these, dataset specific pledging (e.g. “We commit to share this specific dataset for the social good in circumstance X ”) can have significant value.

Triggers as required or optional

Should a pledge be structured to place emphasis on ‘trigger conditions’ for data sharing? Some articulations of the Data Pledge appear to think of it as a bank of data that could be shared in particular crisis situations. E.g. “We’ll share detailed supply chain information for affected areas if there is a disaster situation.”.  There are certainly datasets of value that might not be listed as a Pledge unless trigger conditions can be described, but it’s important that the design of a pledge does not present triggers as essentially shifting any of the work on data sharing to some future point. Preparing for data to be used well and responsibly in a crisis situation requires work in advance of the trigger events: aligning datasets, identifying how they might be used, and accounting carefully for possible unintended consequences that need to be mitigated against.

There are also many global crisis we face that are present and ongoing: the climate crisis, migration, and our collective failure to be on track against the Sustainable Development Goals.

Brokering and curating

Data is always about something, and different datasets exist within (and across) different data communities and cultures. To operationalise a pledge will involve linking actors pledging to share data into relevant data communities: where they can understand user needs in more depth, and be able to publish with purpose.

The architecture of a Data Pledge, and of any supporting initiative around it, will need to consider how to curate and connect the many organisations that might engage – building thematic conversations, spotting thematic spaces where a critical mass of pledges might unlock new social value, or identifying areas where there are barriers stopping pledges turning into data flows.

Incorporating context, consent and responsible data principles

Increased data sharing is not an unalloyed good. Approaching data for the public good involves balancing openness and sharing, with robust principles and practices of data protection and ethics, including attention to data minimisation, individual rights, group data privacy, indigenous data sovereignty and dataset bias. Data should also be shared with clear documentation of it’s context, allowing an understanding of its affordances and limitations, and supporting debate over how data ecosystems can be improved in service of social justice.

A Pledge has an opportunity to both set the bar for responsible data practice, and to incentivise organisational thinking about these issues, by including terms that require pledging organisations to uphold high standards of data protection, only sharing personal data with clear informed consent or personal-derived data after clear processes that consider privacy, human rights and bias impacts of data sharing. Similarly, organisations could be asked to commit to putting their data in context when it is shared, and to engaging collaboratives with data users.

There may also be principles to incorporate here about transparency of data sharing arrangements – supporting development of norms about publishing clearly (a) who data is shared with and for what purpose; and (b) the privacy impact assessments carried out in advance of such shares.

Conditional on capacity?

Should pledging organisations be able to signal that they would need resources in order to make certain data available? I.e. We have Dataset X which has a certain social value: but we can’t afford to make this available with our internal resources? For low-resource organisations, including SMEs or organisations operating in low income economies, this could be a way to signal to philanthropic projects like data.org a need for support. But it could also be used by higher-resource organisations to put a barrier in front of data sharing. However, if a Pledge targets civil society pledgees, then allowing some way to indicate capacity needs if data is to be shared is likely to be particularly important.

A synthesis sketch

Whilst ideologically, I’d favour a focus on building and governing data commons, more directly addressing the modern ‘enclosure’ of data by private firms, and not forgetting the importance of proper taxation of data-related businesses to finance provision of public goods, if it’s viable to treat a data pledge as a pragmatic tool to increase availability for data for social good uses, then I’d sketch the following structure:

  • Target private sector organisations
  • A three part pledge
    • 1. A general organisational commitment to treat data as a resource for the public good;
    • 2. A linked organisational commitment to responsible data practices whenever sharing data;
    • 3. An optional set of dataset specific pledges, each with optional trigger conditions
  • A platform allowing pledging organisations to profile their pledges, detail contact points for specific datasets and contact points for organisation-wide data stewards, and to connect with potential data users;
  • A programme of work to identify pre-work needed to allow data to be effectively used if trigger conditions are met ;

Inclusive AI needs inclusive data standards

[Summary: following the Bellagio Center thematic month on AI last year, I was asked to write up some brief notes on where data standards fit into contemporary debates on AI governance. The below article has just been published in the Rockefeller ‘notebook’ AI+1: Shaping our Integrated Future*]

Copy of the AI+1 Publication, open at this chapter

Modern AI was hailed as bringing about ‘the end of theory’. To generate insight and action no longer would we need to structure the questions we ask of data. Rather, with enough data, and smart enough algorithms, patterns would emerge. In this world trained AI models would give the ‘right’ outcomes, even if we didn’t understand how they did this. 

Today this theory-free approach to AI is under attack. Scholars have called out the ‘bias in, bias out’ problem of machine-learning systems, showing that biased datasets create biased models — and, by extension, biased predictions. That’s why policy makers now demand that if AI systems are used to make public decisions, their models need to be ‘explainable’, offering justifications for the predictions they make. 

Yet, a deeper problem is rarely addressed. It is not just the selection of training data, or the design of algorithms, that embeds bias and fails to represent the world we want to live in. The underlying data structures and infrastructures on which AI is founded were rarely built with AI uses in mind, and the data standards — or lack thereof — used by those datasets place hard limits on what AI can deliver. 

Questionable assumptions

From form fields for gender that only offer a binary choice, to disagreements over whether or not a company’s registration number should be a required field when applying for a government contract, data standards define the information that will be available to machine-learning systems. They set in stone hidden assumptions and taken-for-granted categories that make possible certain conclusions, while ruling others out, before the algorithm even runs. Data standards tell you what to record, and how to represent it. They embody particular world views, and shape the data that shapes decisions. 

For corporations planning to use machine-learning models with their own data, creating a new data field or adapting available data to feed the model may be relatively easy. But for the public good uses of AI, which frequently draw on data from many independent agencies, individuals or sectors, syncing data structures is a challenging task. 

Opening up AI infrastructure

However, there is hope. A number of open data standards projects have launched since 2010. 

They include the International Aid Transparency Initiative (IATI) — which works with international aid donors to encourage them to publish project information in a common structure — and HXL, the Humanitarian eXchange Language, which offers a lightweight approach to structure spreadsheets with ‘Who, What, Where’ information from different agencies engaged in disaster response activities. 

When these standards work well, they allow a broad community to share data that represents their own reality, and make data interoperable with that from others. But for this to happen, standards must be designed with broad participation so that they avoid design choices that embed problematic cultural assumptions, create unequal power dynamics, or strike the wrong balance between comprehensive representation of the world and simple data preparation. Without the right balance certain populations may drop out of the data sharing process altogether. 

To use AI for the public good, we need to focus on the data substrata on which AI systems are built. This requires a primary focus on data standards, and far more inclusive standards development processes. Even if machine learning allows us to ask questions of data in new ways, we cannot shirk our responsibility to consciously design data infrastructures that make possible meaningful and socially just answers.

 

*I’ve only got print copies of the publication right now: happy to share locally in Stroud, and will update with a link to digital versions when available. Thanks to Dor Glick at Rockefeller for the invite and brief for this piece, and to Carolyn Whelan for editing.

Algorithmic systems, Wittgenstein and Ways of Life

I’m spending much of this October as a resident fellow at the Bellagio Centre in Italy, taking part in a thematic month on Artificial Intelligence (AI). Besides working on some writings about the relationship between open standards for data and the evolving AI field, I’m trying to read around the subject more widely, and learn as much as I can from my fellow residents. 

As the first of a likely series of ‘thinking aloud’ blog posts to try and capture reflections from reading and conversations, I’ve been exploring what Wittgenstein’s later language philosophy might add to conversations around AI.

Wittgenstein and technology

Wittgenstein’s philosophy of language, whilst hard to summarise in brief, might be conveyed through reference to a few of his key aphorisms. §43 of the Philosophical Investigations makes the key claim that: ”For a large class of cases–though not for all–in which we employ the word ‘meaning’ it can be defined thus: the meaning of a word is its use in the language.” But this does not lead to the idea that words can mean anything: rather, correct use of a word depends on its use being effective, and that in turn depends on a setting, or, as Wittgenstein terms it, a ‘language game. In a language game participants have come to understand the rules, even if the rules are not clearly stated or entirely legible: we engage successfully in language games through learning the techniques of participation, acquired through a mix of instruction and of practice. Our participation in these language games is linked to the idea of ‘forms of life, or, as it is put in §241 of the Philosophical Investigations, “It is what human beings say that is false and true; and they agree in the language they use. That is not agreement in opinions but in form of life.”.

As I understand it, one of the key ideas here can be expressed by stating that meaning is essentially social, and it is our behaviours and ways of acting, constrained by wider social and physical limits, that determine the ways in which meaning is made and remade.

Where does AI fit into this? Well in Wittgenstein as a Philosopher of Technology: Tool Use, Forms of Life, Technique, and a Transcendental Argument, Coeckelbergh & Funk (2018) draw on Wittgenstein’s tool metaphors (and professional history as an engineer as well as philosopher) to show that we can apply a Wittgensteinian analysis to technologies, explaining that: that “we can only understand technologies in and from their use, that is, in technological practice which is also culture-in-practice.” (p 178) . At the same time, they point to the role of technologies in constructing the physical and material constraints upon plausible forms of life:

Understanding technology, then, means understanding a form of life, and this includes technique and the use of all kinds of tools—linguistic, material, and others. Then the main question for a Wittgensteinian philosophy of technology applied to technology development and innovation is: what will the future forms of life, including new technological developments, look like, and how might this form of life be related to historical and contemporary forms of live?  [sic] (p 179)

It is important though to be attentive to the different properties of  different kinds of tools in use (linguistic, material, technological) within any form of life. Mass digital technologies, in particular, appears to spread in less negotiable ways: that is, some new technology introduced, whilst open to be embedded in forms of life in some subtly different ways, often has core features presented only on a take-it-or-leave-it basis, and, once introduced, can be relatively brittle and resistant to shaping by its users.

So – as new technologies are introduced, we may find that they reconfigure the social and material bounds of our current forms of life, whilst also introducing new language games, or new rules to existing games into our social settings. And with contemporary AI technologies in particular, a number of specific concerns may arise.

AI Concerns and Critical Responses

Before we consider how AI might affect our forms of life, a few further observations (and statements of value):

  • The plural of ‘forms’ is intentional. There are variations in the forms of life lived across our planet. Social agreements in behaviour and action vary between cultural settings, regions or social strata. Many humans live between multiple forms of life, translating in word and behaviour between the different meanings each requires. Multiple forms are not strictly dichotomous: different forms of life may have many resemblances, but their distinctions matter and should be valued (this is an explicit political statement of value on my part).
  • There have been a number of social projects to establish certain universal forms of life over past centuries. For example, the development of consensus on human rights frameworks is one of these. seeking equitable treatment of all (I also personally subscribe to the view that a high level of respect for universal human rights should feature as a constraint to  all forms of life).
  • Within this trend, there are also a number of significant projects seeking to establish greater acceptance of different ways of living, including action to reverse the victorian imposition of certain normative family structures, work to afford individuals greater autonomy in defining their own identities, and activity to embed much more ecological models of thinking about human society.

These trends (or ongoing social struggles if you like) seeking to make our ways of living more tolerant, open,  inclusive and sustainable are important to note when we consider the rise of AI systems. Such systems are frequently reliant on categorised data, and on a reductive modelling of the human experience based on past, rather than prospective, data.

This noted, it appears then that we might point to two distinct forms of concern about AI:

(A) The use of algorithmic systems, built on reductive data, risks ossifying past ways of life (with their many injustices), rather than supporting struggles for social justice that involve ongoing efforts to renegotiate the meaning of certain categories and behaviours.

(B) Algorithmic systems may embody particular ways of life that, because of the power that can be exercised through their pervasive operation, cause those forms of life to be imposed over others. This creates pressure for humans to adapt their ways of life to fit the machine (and its creators/owners), rather than allowing the adaptation of the machine to fit into different human ways of life.

Brief examples

Gender detection software is AI trained to judge  the gender of a person from an image (or from analysing names, text or some other input). In general, such systems define gender using a male-female binary. Such systems are being widely used in research and industry. Yet, at the same time the task of judging gender is being passed from human to machine, there are increasingly present ways of life that reject the equation of gender and sex identity, and the idea of a fixed gender-binary. The introduction of AI here risks the ossification of past social forms.

Predictive text tools are increasingly being embedded in e-mail and chat clients to suggest one-click automatic responses, instead of requiring the human to craft a written response. Such AI-driven features are at once a tool of great convenience, but also an imposed shift in our patterns of social interaction.

Such forms of ‘social robot’ are addressed by Coeckelbergh & Funk when they write: “These social robots become active systems for verbal communication and therefore influence human linguistic habits more than non-talking tools.” (p 185). But note the material limitations of these robots: they can’t construct a full sentence representative of their user. Instead, they push conversation towards the quick short response, creating a pressure to change patterns of human interaction.

Auto-replies suggested by Google Mail based on a proprietary algorithm.

The examples above suggested by gmail for me to use in reply to a recent e-mail might follow terms I’d often use, but push towards a form of e-mail communication that, at least in my experience, represents a particularly capitalist and functional form of life, in which speed of communication is of the essence, rather than social communication and exploration of ideas.

Reflections and responses

Wittgenstein was not a social commentator, but it is possible to draw upon his ideas to move beyond conversations about AI bias, to look at how the widespread introduction of algorithmic and machine-learning driven systems may interact with different contemporary forms of living.

I’m always interested though in the critical leading to the practical, and so below I’ve started to sketch out possible responses the analysis above leads me to consider. I also strongly suspect that these responses, and justification for them, can be elaborated much more directly and accessibility without getting here via Wittgenstein. Writing that may be a task for later, but as I came here via the Wittgensitinian route, I’ll stick with it.

(1) Find better categories

If we want future algorithmic systems to represent the forms of live we want to live, not just those lived in the past, or imposed upon populations, we need to focus on the categories and data structured used to describe the world and train machine-learning systems.

The question of when we can develop global categories that have meaning that is ‘good enough’ in terms of alignment in use across different settings, and when it is important to have systems that can accommodate more localised categorisations, is one that requires detailed work, and that is inherent political.

(2) Build a better machine

Some objects to particular instances of AI may be because it is, ultimately, too blunt in its current form. Would my objection to the predictive text tools be the same if they could express more complete sentences, more in line with the way I want to communicate? For many critiques of algorithmic systems, there may be a plausible response to suggest that a better designed or trained system could address the problem raised.

I’m sceptical however, of whether it is plausible for most current instantiations of machine-learning to be adaptable enough to different forms of life: not least on the grounds that for some ways of living the sample-size may be too small to gather enough data points to construct a good model, or the collection of the data required may be too expensive or intrusive for theoretical possibilities of highly adaptive machine-learning systems to be practically feasible or desirable.

(3) Strategic rejection

Recognising the economic and political power embedded in certain AI implementations, and the particular form of life it embodies, may help us to see technologies we want to reject outright. If a certain tool makes moves in a language game that are at odds with the game we want to be playing, and only gains agreement of action through its imposition, then perhaps we should not admit it at all.

To put that more bluntly (and bringing in my own political stance), certain AI tools embody a late-capitalist form of life, rooted in cultures and practices of a small strata of Silicon Valley. Such tools should have no place in shaping other ways of life, and should be rejected not because they are biased, or because they have not adequately considered issues of privacy, but simply because the form of life they replicate undermines both equality and ecology.

Where next

Over my time here at Bellagio, I’ll be particularly focussed on the first of these responses – seeking better categories, and understanding how processes of standardisation interact with AI. My goal is to do that with more narrative, and less abstraction, but we shall see…