Rhodes must fall

17 years ago I was an undergraduate at Oriel College, Oxford. I lived for my first year in the ‘Rhodes building’ – not many metres from the statue of Cecil Rhodes that adorns the front of the building.

The only narrative of Rhodes I recall from that time, was one of the college’s proud connection to its alumni and benefactor. To my shame, whilst with student campaigners I was active against contemporary donations to the University that appeared to buy naming rights and launder the reputations of questionable modern day donors – I left unexplored how the ongoing honouring of past donors had allowed them to ‘buy’ a ‘controversial reputation instead of the condemnation their actions deserve. Nor did I consider then how the memorialisation of Rhodes plays a part (even if small compared to other factors) in perpetuating the continued exclusion of marginalised communities from Oxford, and in reinforcing barriers to people from (Oxford) minorities taking greater ownership over the institutions of the University. 

The college has a (belated) opportunity to make the right statement with the removal of the Rhodes statue. Leave it there, and Rhodes remains a ‘controversial figure’ and the college an institution concerned only with reproducing “an educated ruling class” (to quote from the college’s essay on Rhodes). Move it to a museum where it belongs, and the conversation with every undergraduate can be about our importance of questioning and learning from history – using education as a means of creating a more just future. The teachable moment will be all the stronger when the statue’s niche stands empty. 

Rhodes must fall.

Open Contracting & Inclusion – notes from an online discussion

[Summary: Exploring inclusion impacts of data and standards in response to a paper on Open Contracting & inclusion]

Yesterday I had the pleasure of joining a call hosted by HIVOS, and chaired by ILDA’s Ana Sofia Ruiz, to discuss a recent paper from Michael Canares and François van Schalkwyk on “Open Contracting and Inclusion”. The paper is well worth a read, and includes a review of five cases against a theoretical framework looking across data flows, opportunities for action, infomediary presence, and through to inclusion outcomes (see table below for example of how these play out in a few of the cases reviewed)

Table 2: Summary of conditions met by the cases with regard to open contracting and social inclusion

After the discussion, we were asked to summarise some of our inputs – hopefully feeding into a wide write-up. However, in case what I’ve written up doesn’t really fit the format of that, I’m posting a cleaned up and slightly expanded version of the remarks I made below:

This paper, and the discussion around it, raises a number of valuable questions – drawing on a rich theoretical landscape to post them. 

Firstly, it asks us “How are data flows being disrupted?”. This question is important, because in many ‘open contracting’ projects it is rarely explicitly asked. We’re living in a time of mass disruption, yet open contracting is often ‘sold’ as a kind of reform. One of the widely used success stories for work on open contracting data comes from Ukraine, where there was a true disruption in data flows – using the moment of revolution to reconfigure patterns of procurement, and to create data infrastructures that enabled those new more open practices. 

Secondly, this paper calls on us to question “what is the value of data in bringing about inclusion?” In the past we’ve talked about whether open data is either necessary or sufficient to create change. The answer I take from this paper is that increased accessibility of information and data is ‘a very useful, but nowhere near sufficient’ condition for inclusive change. 

Thirdly, the use of Castell’s framework from Communication Power of ‘network power’ (shaping the information that can be transmitted), ‘networking power’ (gatekeeping which information is transmitted), and ‘networked power’ (control by one node in the network of others), and ideas of ‘programming the network’ and ‘reprogramming the network’, raise some critical questions about the role of data standards. Often treated as neutral artefacts, standards are in fact sites of power, and of the negotiation of network and networking power. A standard defines what can be expressed, and its implementation involves choosing what will be expressed. Standards can be at once tools that cross contexts, taking with them the potential of inclusion and exclusion (network power), and at the same time, have that potential left inert if the localised networking power decides not to take up inclusion oriented features. 

To put this more concretely (if still a little complex I fear), the Open Contracting Data Standard was explicitly designed with a technical architecture that permits data about any given contracting process to be published by any actor, not only the ‘official’ information provider, and with a mechanism for extensions, supporting new fields of data to be attached to a contracting process. The ‘protocol’ sought to be inclusive. However, in practice, most tools have not been built to exploit this feature – meaning that in practice, the ‘platforms’ that exist don’t support inclusion of alternative perspectives on the state of a contracting process. This highlights that even at the level of the technical infrastructures, these are not made once, but have to be constantly remade, and their inclusive potential reinforced.

Fourth, the paper calls for a renewed focus on both governance context, and on intermediaries. Whilst technical artefacts can cross contexts, intermediary capacity building needs significant investment setting-by-setting. Equally, the discussion brought into view that this cannot be a short-term process. Intermediaries need not only skills, but also stocks of trust, in order to broker connections and communication. One of the evaluation team who had worked on a case covered by the paper discussed how it was individuals’ ability to maintain trusted relationships across different stakeholder groups that was critical to connecting information and empowerment. The importance of this cannot be overstated. 

Fifth, and finally, in his opening statement, Michael Canares challenged us to consider whether Open Contracting is different from other public sector reforms? After all – there have been decades of procurement reform. To this, I’m prepared to advance an answer: There is a meaningful qualitative difference with government reforms that start from the premise of openness. When a commitment to being open by default is put into practice, the configuration of actors involved in creating change is different, and conventional patterns of bureaucratic reform can be disrupted. Whether they are disrupted or not depends still on individual internal and external actors, and on whether the culture, as well as the practice, of openness has been brought into play. Nevertheless, Open Contracting has certain potential that is simply absent from past procurement reforms – and that is something to continue to build on.

The challenge ahead now is to work out what to do with these questions. We’re starting to unpack the complexity of open contracting practice – and the nuances for each individual setting. But, if all we have are critical questions, we risk inaction rather than advances in inclusion. During the early development of the Open Contracting Data Standard we often turned to the mantra that we should not let the perfect be the enemy of the good. This carries forward: as we avoid the perfect being the enemy of making things better. I’d contend that we need to continue turning our learning into tooling – whether technical tools, evaluation frameworks, to simple planning tools for new initiatives. Only then can we be part of taking on the large scale reforms that this time of disruption needs. 

Inclusive AI needs inclusive data standards

[Summary: following the Bellagio Center thematic month on AI last year, I was asked to write up some brief notes on where data standards fit into contemporary debates on AI governance. The below article has just been published in the Rockefeller ‘notebook’ AI+1: Shaping our Integrated Future*]

Copy of the AI+1 Publication, open at this chapter

Modern AI was hailed as bringing about ‘the end of theory’. To generate insight and action no longer would we need to structure the questions we ask of data. Rather, with enough data, and smart enough algorithms, patterns would emerge. In this world trained AI models would give the ‘right’ outcomes, even if we didn’t understand how they did this. 

Today this theory-free approach to AI is under attack. Scholars have called out the ‘bias in, bias out’ problem of machine-learning systems, showing that biased datasets create biased models — and, by extension, biased predictions. That’s why policy makers now demand that if AI systems are used to make public decisions, their models need to be ‘explainable’, offering justifications for the predictions they make. 

Yet, a deeper problem is rarely addressed. It is not just the selection of training data, or the design of algorithms, that embeds bias and fails to represent the world we want to live in. The underlying data structures and infrastructures on which AI is founded were rarely built with AI uses in mind, and the data standards — or lack thereof — used by those datasets place hard limits on what AI can deliver. 

Questionable assumptions

From form fields for gender that only offer a binary choice, to disagreements over whether or not a company’s registration number should be a required field when applying for a government contract, data standards define the information that will be available to machine-learning systems. They set in stone hidden assumptions and taken-for-granted categories that make possible certain conclusions, while ruling others out, before the algorithm even runs. Data standards tell you what to record, and how to represent it. They embody particular world views, and shape the data that shapes decisions. 

For corporations planning to use machine-learning models with their own data, creating a new data field or adapting available data to feed the model may be relatively easy. But for the public good uses of AI, which frequently draw on data from many independent agencies, individuals or sectors, syncing data structures is a challenging task. 

Opening up AI infrastructure

However, there is hope. A number of open data standards projects have launched since 2010. 

They include the International Aid Transparency Initiative (IATI) — which works with international aid donors to encourage them to publish project information in a common structure — and HXL, the Humanitarian eXchange Language, which offers a lightweight approach to structure spreadsheets with ‘Who, What, Where’ information from different agencies engaged in disaster response activities. 

When these standards work well, they allow a broad community to share data that represents their own reality, and make data interoperable with that from others. But for this to happen, standards must be designed with broad participation so that they avoid design choices that embed problematic cultural assumptions, create unequal power dynamics, or strike the wrong balance between comprehensive representation of the world and simple data preparation. Without the right balance certain populations may drop out of the data sharing process altogether. 

To use AI for the public good, we need to focus on the data substrata on which AI systems are built. This requires a primary focus on data standards, and far more inclusive standards development processes. Even if machine learning allows us to ask questions of data in new ways, we cannot shirk our responsibility to consciously design data infrastructures that make possible meaningful and socially just answers.

 

*I’ve only got print copies of the publication right now: happy to share locally in Stroud, and will update with a link to digital versions when available. Thanks to Dor Glick at Rockefeller for the invite and brief for this piece, and to Carolyn Whelan for editing.

Aligning Insight: standardisation and data collection for COVID-19 responses

[Summary: a brain-dump of thoughts on approaches to data standardisation relevant in the current coronavirus context.]

Over the last few weeks I’ve talked with a number of initiatives that are seeking to bring greater coherence to data collection on the impacts that coronavirus is having on their constituencies. Thousands of organisations, from chambers of commerce, to charity networks, and international agencies, are sending out surveys, or soliciting inputs, to help them understand the social, economic, organisational and operational impacts of the current pandemic – and to start charting ways forward in response.

This has led to a number of conversations asking how data standards could help. Common fears of wasted effort in duplicate data collection, missed insights from siloed data, or confusion created by incompatible categorisations, are all being compounded by the rapid data collection needs in this crisis. Yet, creating new standards can be a time-consuming process: involving in-depth negotiation of different user needs and capacities, careful drafting of definitions, and rigorous testing of schemas, in order to develop something that can function as an equitable tool for long-term communication and collaboration. That doesn’t mean, however, that it’s not possible to iterate towards more aligned and standardised data right now.

In this post I’ll try and set out a few (non-exhaustive) considerations on where some of the data standardisation practices I’ve engaged with over recent years fit in the current landscape, and some approaches to move towards aligning data collection initiatives.

Documentation, documentation, documentation

There are a couple of different parts of a data standard, including definitions that describe what the data should cover, and what each field is about and schemas that determine how the data should be encoded, serialised and shared. But it is documentation that brings these together, and makes them widely usable.

Good documentation should allow people designing data collection instruments (surveys, studies etc.) to quickly identify the building blocks of standardisation that they can draw upon, and should make following the standard the path of least resistance, rather than an uphill struggle.

Ideally documentation should be clearly versioned, and, if intended for global use, published in ways that support language translation.

Start from user needs

It’s easy to fall into the trap of being ‘data driven’, and trying to work out ways to bring together ’all the data’ by imposing top-down structures on data collection or aggregation. But, in working out where to prioritise alignment of definitions and structures it’s crucial to be driven user need. In a crisis context, it may help to identify the primary user need that data pipelines are being built to meet (e.g. a dashboard for operational decision making), and secondary user needs that is is desirable to meet too (e.g. evaluating whether support has been provided equitably; gathering baselines for future research; supporting advocacy for funding certain needs). This will help guide decisions on…

…’just enough standardisation’

Standards are about the distribution of costs and benefits between data producers, intermediaries and data users. Without any standards, data users wanting to draw on data from different sources have to do all the work of reconciling differences and inconsistencies – and sometimes find different datasets are simply irreconcilable. Where multiple datasets have compatible definitions, but different schemas, if may be possible for intermediaries to do the work of creating a consistent dataset by standardising non-standard data. Where data produces are made responsible for data standardisation, they have to do the work of reconciling their own business needs and local definitions, with the definitions and structures provided by a standard.

In the early stages of a crisis, the focus should be on what intermediaries can do: keeping the burden on data producers and users as low as possible, and focussing only on essential standardisation (guided by an understanding of user needs). By seeking to reconcile data from different sources, intermediaries will quickly learn which gaps in data alignment or standardisation are most costly to creating interoperable datasets.

Whilst adopting standards like the Open Contracting Data Standard or Beneficial Ownership Data Standard involves working with organisations over many months and even years to align their data (and in some cases, underlying business processes) with a shared model – in a crisis response, data producers need light-weight building blocks that make their job easier – giving them content to copy and paste into surveys, or data structures that can be easily implemented.

One well-developed approach for alignment in a crisis context comes from HXL – the Humanitarian eXchange Language which provides a simple approach to mark-up columns in spreadsheets using a collection of known # hash-tags, and then provides tools to combine and filter tagged data.

(For more on ‘just enough’ thinking see Rachel Coldicutt’s post on just enough internet)

(Critically) re-use existing standards

It’s rare that you will need to ‘invent’ any standards from scratch: standardisation is often an assembly job: working out which existing standards to align with and which pieces are aligned enough to work together. As a starting point I often turn to schema.org, the ad-hoc effort by search engines to create a common (and relatively loose) vocabulary of terms to describe everything from people, local businesses and books, to pandemic related data, or I look at conventions at use in existing datasets in the domain I’m helping create data models for.

Certain lower-level conventions, like using ISO Dates, unicode for text, and ISO language and country codes, are also worth encouraging and documenting: although in most cases as long as a data source is internally consistent in how it encodes countries, dates, languages and so-on, intermediaries will be able to more-or-less map the data to common codes over the short-term.

I say that one should ‘critically’ re-use existing standards, because, as the fantastic Data Feminism book underscores, definitions of data are about power: about whose lived experience and accounts of the world will be represented and shared. There is often a balance to strike between adopting common ways of representing the world, and challenging oppressive and problematic representations.

Particularly when building standards for use across national and cultural boundaries, this calls for an awareness of the many falsehoods embedded in data models, and consideration of the embedded assumptions in off-the-shelf data models. It can also call for a sensitivity to when standards, even in a crisis, should not take the path of least resistance, but should introduce some friction in deciding which categories to use, or how to disaggregate data. For example, where user needs (and here is where considering diverse secondary user needs can be important, as ‘primary user needs’ may often represent dominant power perspectives) require an understanding of how data varies by gender, or the ability to provide intersectional disaggregation, then standards should make clear how this should be recorded and shared.

Look for the keys

One way to lower the burden on data collectors is to look for the keys that unlock additional existing open datasets. For example:

  • Postcodes in many countries allow data to be geocoded, and allow you to integrate a range of local classifications and statistics. In the UK, collecting the postcode of where a service is delivered allows you to look up the socio-economic status of the are, the local authority responsible for service delivery there, and a whole host of other information. In other countries, location data may be possible to match with satellite observation data to infer other relevant classifications for a survey respondent.
  • Organisation identifiers – which, if collected and well validated, can be reconciled against public databases to find information on companies, charities and other entities. In the UK, a Charity number can be used to look up classification data on the organisation’s beneficiaries taken from annual charity returns. For many nations, company numbers can be reconciled against OpenCorporates to provide detailed corporate information.
  • URLS and Social Media IDs can be useful in some use-cases to crawl web pages and social network and find signals about the networks an organisation is part of, of the topics they work on.

Each sector and domain is also likely to have some of its own ‘keys’ that can hook into existing datasets (e.g. the Common Procurement Vocabulary for classifying public procurements in Europe). If you are lucky, they will be attached to relevant open datasets.

Care still needs to be taken to consider gaps in the lookup data (e.g. some countries lack open corporate register data; satellite data coverage varies; not all organisations have websites), and to avoid introducing biases through faulty assumptions (e.g. if assuming the ‘register office’ postcode of UK charities is where their beneficiaries are, then it looks like London gets more funding than it does). It’s also important to consider how easy it will be for those providing data to enter it. For example, do organisations know their registration number? (On the organisation identifiers point, this is one of the reasons I was involved in creating org-id.guide and there remains a lot still to do in this area).

Decide on your approach to categories

At the heart of many standardisation processes is classification: sorting needs, organisations, events or people into categories. Standardising categories can be notoriously difficult: and is often hard to do in a rush. You might find there are existing classification schemes you can draw upon, or you might find a need to create your own (or, as LandVoc has done, albeit over a number of years, to engage with an existing classification scheme to get the elements you need included).

Good documentation of the boundaries of a category (ideally with examples) is vital for them to be used in interoperable ways.

Many of the standards I’ve worked on have stepped back from settling categorisation debates, but representing classification elements in terms of:

  • A vocabulary – to allow different datasets to use different classification schemes
  • A code – that stays constant across languages
  • A label – that can be translated into local languages

This offers a way to at least avoid two people talking about different things with the same terms, but leaves the alignment problem to later.

In an ideal world, a rapid standardisation project might be able to provide ‘good enough’ categories for data collectors to start with, but then offer them some level of flexibility so that individual data collection exercises can address their local user needs by adapting core categorisations.

Semantic standards such as SKOS have a lot to offer to efforts to bring together data using heterogenous classification schemes: allowing not only hierarchical relationships (i.e. the ability to add a ‘narrower’ concept under a headline category), but also broad and narrow matches between neighbouring concepts. However, tools and skills for working well with this kind of data and classification structure are, in my experience, quite scarce.

Meta-data matters

One of the most important things to help intermediaries align different datasets is ‘data about the data’. Knowing who collected a dataset (ideally with ability to contact them), knowing when and where it was collected, and ideally having pointers to the survey forms or data collection instruments used can make the process of ingesting and reconciling disparate datasets at lot, lot easier.

Conventions like MetaTab provide an easy way to get started providing standardised meta-data when circulating spreadsheets, and there are well established standards for meta-data in most domains.

Meta-data should also include clear information on restrictions or permissions that apply to re-use of a dataset, which brings me onto:

Don’t forget standards of data governance

The first question to ask before making use of any dataset that might contain sensitive information from individuals or organisations is: do I have the right to use this data? Does using or sharing this data (or analysis based on it), put anyone at risk?

As the responsible data initiative puts it, there is a:

…collective duty to account for unintended consequences of working with data by:

1) prioritising people’s rights to consent, privacy, security and ownership when using data in social change and advocacy efforts,

2) implementing values and practices of transparency and openness.

Working out early on a set of shared procedures for assessing the need for, obtaining and recording consents from data subjects for data sharing and re-use can avoid hitting barriers later on. This might take a number of forms, such as:

  • Suggested privacy policy terms that describe how data might be shared and re-used;
  • Identifying the different states that consent might take (.e.g. consent for data to be ‘shared’ with identified partners, or consent for non-personal data to be ‘open’ – drawing on the ODI’s data spectrum and how these should be encoded in each relevant row of a dataset;
  • Adding a section to meta-data templates for those sharing data to indicate who else data can be shared with, and if any fields should be masked from an open version of a dataset.

Standards are about people

Lastly, but by no means least – it is important to think of standards as a process, not a product. That documentation I mentioned at the start? That’s not for users: that’s for you. Because most of the time people don’t read documentation: they don’t have the time, or don’t know where to start. In reality, most of the standards I’ve worked on require conversations, engagement and feedback to help people align their data with them.

If someone is designing a data collection survey, the prime opportunity for standardisation is between their first draft, and it going out in the field. If you can get into a conversation then, and provide prioritised feedback on how it can align more with the documented standard, how it could incorporate some ‘key fields’ that will unlock other data, or how the consent questions could be worded to be compatible with shared data governance, then you have a chance of the data that flows from that data collection will be possible to bring together as part of a wider aligned insight datasets.

In all the standards I’ve worked on, the ‘Helpdesk’ team have been as vital as the documentation and schema to making standards truly work as tools of coordination and collaboration.

 

 

2019 in Review

I started writing this just before the Christmas break, but got interrupted by both festivities and flu. So, below, a slightly belated look back at 2019: where yet again my blogging has been far too sporadic.

January – FOI & Javelin Park Protests

Last Christmas eve, I was pouring over the newly released details of a £100m+ cost increase in the contract for the Javelin Park incinerator I’ve written about before. Over Christmas, we put together calls for an Independent Inquiry into the project, and come January, I was outside the plant, taking part in protests at the price rise.

Since then, the County Council have been taken to court over the contract, putting the calls for an inquiry on hold (although questions were finally put to the Chief Executive of the Council in March, with updates on the court case expected in early 2020.

My other FOI adventures of 2019 have been less conclusive:

  • Gloucestershire’s refusal to provide prices and buyers of the public land they have sold off means the only way to piece this together would be by spending £100s on land registry records: something I’ve not had space to pursue. Promises that this information would be published proactively from September have been broken by Cabinet – and our experiment in using the Local Audit and Accountability Act in June to look at relevant documents didn’t appear to provide a full overview. It seems profoundly odd that there is so little transparency over how public assets are being disposed of.

February – Exploring Arts and Data

At the start of the year, I kicked off a part-time role as ‘Data Catalyst’ with Create Gloucestershire working on a number of fronts to support their internal data practices, but also to scope out ways to connect artists with debates around data. I shared some initial research back in February and in September had great fun co-facilitating a ‘Creative Lab’ at Atelier in Stroud, where we co-created a range of data-informed art works – from VR Design Teachers, to fabric chromatography creations that visualised data on school subject choice.

March – TicTec & The State of Open Data

Much of March was spent working on final editing of chapters for The State of Open Data, and then, late in the month, heading to Paris for The Impacts of Civic Technology (TicTec) conference to present initial finings with my co-editor, Mor. An evening reception and hearing about digital democracy and participation projects at French National Assembly was particularly inspiring.

April – Printing and Driving

2019 was supposed to be a bit of a sabbatical year (learning point: I’m not very good at sabbaticals), but in late March and April I did finally get round to my two main goals of: (a) learning a bit about printmaking; (b) passing my driving test.

A wonderful two day workshop with Rod Nelson had me exploring woodcut designs exploring field patterns and the Stroud landscape.

And Bob Waters got me through my test first time.

I’ve promptly failed to do any more printing or driving this year, but at least I now know a bit more about how to!

First ‘field patterns’ print drying on Rod Nelson’s studio

May – State of Open Data Book Tour and OGP

May took me to the US, for a few weeks of #slowTravel by train around the East Coast, and then up to Canada, for the full launch of the the State of Open Data book. It was a real pleasure to catch up with old friends, and to take part in some really stimulating workshops, including a fascinating Belfer Center session on ‘Data as Development’ which gave rise to this note on the idea of a ‘a data extraction transparency initiative.

Getting hold of physical copies of The State of Open Data book was a great moment: as at times the project has felt quite beyond delivery. I’m pretty pleased indeed with how it turned out – with contributions from 60+ authors, and many more reviewers and contributors.

I’ve still got a few hard copies that can go free to University or organisational libraries, so if you’ve read this far, and you would like one – do drop me a note.

At IDRC for book talk on State of Open Data
With the editors of State of Open Data sharing findings from the book at IDRC HQ.

June – Facilitation fun with IATI

In June I took another #slowTravel trip – heading to Copenhagen by train to facilitate a workshop for the International Aid Transparency Initiative’s technical community on the draft strategy.

This followed some online facilitation work for strategy dialogues earlier in the year. I’ve also had chance this year to co-facilitate an online dialogue for Land Portal: reminding me how much I enjoy this kind of blended online and offline facilitation work. Perhaps something to explore more in 2020.

July – Coast to Coast

In July, Rachel and I set out walking across the UK on Wainright’s Coast to Coast path – raising funds for  Footsteps Counselling and Care .

The weather and walk was stunning – and a real chance for reflection. Photos from the coast to coast walk

 

August – Impact Bonds and Waste Management

Besides the annual August pilgrimage to Greenbelt, it was a month of interesting UK projects – including work with the Government Outcomes Lab at the University of Oxford to scope out ways to improve transparency and data sharing around Social Impact Bonds, and contributing to a  (sadly unsuccessful) pitch by Open Data Manchester and Dsposal to secure innovation funding to build on their prototype KnoWaste standard.

September – Civic Media Observatory

In September, I had my first opportunity to work in-depth on a project with the fantastic Global Voices team – using AirTable to rapid prototype a database and workflow for tracking and analysing mainstream media, social media, and offline events through a local lens, and understanding the context and subtext of the media that platform moderators may be asked to make snap judgements over.

A three-day workshop in Skopje, Northern Macedonia, looking at coverage of the EU Accession talks, put the prototype to the test (and introduced me to some quite remarkable monumental architecture….). 

October – AI at Bellagio

I spent all of October in Italy, first as a residential fellow at the Rockefeller Bellagio Center in Italy, and then with a brief vacation in Verona, and quick trip to Rome to work with Land Portal.

Taking part in the Bellagio Center’s thematic month on Artificial Intelligence was quite simply a once in a lifetime opportunity. I didn’t write much about it at the time (as I was busy trying to pull together the outline of a new book proposal) and with an election called in the UK just as we were heading home, haven’t had the space to follow up. Hopefully some point next year I’ll be publishing a few outputs from the month.

However, I can’t leave my fellow resident’s work un-shared, so if I’ve not already signposted the below to you, do take time to:

I should also mention one of the other highlights of the residency: enjoying two shows, numerous tricks. and sage advice from ‘Magician in residence’ Brad Barton, Reality Thief – go see him if you are ever in the Bay Area!

November – Elections!

I returned from Italy right into the middle of the biggest General Election campaign Stroud District Green Party have ever run, for the fantastic Molly Scott Cato. It was a month both spent both on the doorstep, and juggling spreadsheets – exploring the reality of values-based volunteer-driven political campaigning in an era of data.

December – Global Data Barometer

Over November and December I was also working on the scoping for a potential new project – the Global Data Barometer – a successor to the Open Data Barometer study I helped create at the Web Foundation back in 2013. The goal is to explore how a 100+ country study could provide insight into patterns of ‘responsible re-use’ of data around the world – capturing both use of data as a resource for sustainable development – and efforts to manage the risks that the unregulated collection and processing of ever increasing quantities of data might create. I published the initial draft research framework just before Christmas, and will be exploring the project more in a workshop in Washington next week.

2020 plans

Over 2020 I’m looking forward to more work on the Global Data Barometer, and with the Open Ownership team, as well as some further facilitation projects, and, hopefully, a bit more writing time! We’ll see.

Algorithmic systems, Wittgenstein and Ways of Life

I’m spending much of this October as a resident fellow at the Bellagio Centre in Italy, taking part in a thematic month on Artificial Intelligence (AI). Besides working on some writings about the relationship between open standards for data and the evolving AI field, I’m trying to read around the subject more widely, and learn as much as I can from my fellow residents. 

As the first of a likely series of ‘thinking aloud’ blog posts to try and capture reflections from reading and conversations, I’ve been exploring what Wittgenstein’s later language philosophy might add to conversations around AI.

Wittgenstein and technology

Wittgenstein’s philosophy of language, whilst hard to summarise in brief, might be conveyed through reference to a few of his key aphorisms. §43 of the Philosophical Investigations makes the key claim that: ”For a large class of cases–though not for all–in which we employ the word ‘meaning’ it can be defined thus: the meaning of a word is its use in the language.” But this does not lead to the idea that words can mean anything: rather, correct use of a word depends on its use being effective, and that in turn depends on a setting, or, as Wittgenstein terms it, a ‘language game. In a language game participants have come to understand the rules, even if the rules are not clearly stated or entirely legible: we engage successfully in language games through learning the techniques of participation, acquired through a mix of instruction and of practice. Our participation in these language games is linked to the idea of ‘forms of life, or, as it is put in §241 of the Philosophical Investigations, “It is what human beings say that is false and true; and they agree in the language they use. That is not agreement in opinions but in form of life.”.

As I understand it, one of the key ideas here can be expressed by stating that meaning is essentially social, and it is our behaviours and ways of acting, constrained by wider social and physical limits, that determine the ways in which meaning is made and remade.

Where does AI fit into this? Well in Wittgenstein as a Philosopher of Technology: Tool Use, Forms of Life, Technique, and a Transcendental Argument, Coeckelbergh & Funk (2018) draw on Wittgenstein’s tool metaphors (and professional history as an engineer as well as philosopher) to show that we can apply a Wittgensteinian analysis to technologies, explaining that: that “we can only understand technologies in and from their use, that is, in technological practice which is also culture-in-practice.” (p 178) . At the same time, they point to the role of technologies in constructing the physical and material constraints upon plausible forms of life:

Understanding technology, then, means understanding a form of life, and this includes technique and the use of all kinds of tools—linguistic, material, and others. Then the main question for a Wittgensteinian philosophy of technology applied to technology development and innovation is: what will the future forms of life, including new technological developments, look like, and how might this form of life be related to historical and contemporary forms of live?  [sic] (p 179)

It is important though to be attentive to the different properties of  different kinds of tools in use (linguistic, material, technological) within any form of life. Mass digital technologies, in particular, appears to spread in less negotiable ways: that is, some new technology introduced, whilst open to be embedded in forms of life in some subtly different ways, often has core features presented only on a take-it-or-leave-it basis, and, once introduced, can be relatively brittle and resistant to shaping by its users.

So – as new technologies are introduced, we may find that they reconfigure the social and material bounds of our current forms of life, whilst also introducing new language games, or new rules to existing games into our social settings. And with contemporary AI technologies in particular, a number of specific concerns may arise.

AI Concerns and Critical Responses

Before we consider how AI might affect our forms of life, a few further observations (and statements of value):

  • The plural of ‘forms’ is intentional. There are variations in the forms of life lived across our planet. Social agreements in behaviour and action vary between cultural settings, regions or social strata. Many humans live between multiple forms of life, translating in word and behaviour between the different meanings each requires. Multiple forms are not strictly dichotomous: different forms of life may have many resemblances, but their distinctions matter and should be valued (this is an explicit political statement of value on my part).
  • There have been a number of social projects to establish certain universal forms of life over past centuries. For example, the development of consensus on human rights frameworks is one of these. seeking equitable treatment of all (I also personally subscribe to the view that a high level of respect for universal human rights should feature as a constraint to  all forms of life).
  • Within this trend, there are also a number of significant projects seeking to establish greater acceptance of different ways of living, including action to reverse the victorian imposition of certain normative family structures, work to afford individuals greater autonomy in defining their own identities, and activity to embed much more ecological models of thinking about human society.

These trends (or ongoing social struggles if you like) seeking to make our ways of living more tolerant, open,  inclusive and sustainable are important to note when we consider the rise of AI systems. Such systems are frequently reliant on categorised data, and on a reductive modelling of the human experience based on past, rather than prospective, data.

This noted, it appears then that we might point to two distinct forms of concern about AI:

(A) The use of algorithmic systems, built on reductive data, risks ossifying past ways of life (with their many injustices), rather than supporting struggles for social justice that involve ongoing efforts to renegotiate the meaning of certain categories and behaviours.

(B) Algorithmic systems may embody particular ways of life that, because of the power that can be exercised through their pervasive operation, cause those forms of life to be imposed over others. This creates pressure for humans to adapt their ways of life to fit the machine (and its creators/owners), rather than allowing the adaptation of the machine to fit into different human ways of life.

Brief examples

Gender detection software is AI trained to judge  the gender of a person from an image (or from analysing names, text or some other input). In general, such systems define gender using a male-female binary. Such systems are being widely used in research and industry. Yet, at the same time the task of judging gender is being passed from human to machine, there are increasingly present ways of life that reject the equation of gender and sex identity, and the idea of a fixed gender-binary. The introduction of AI here risks the ossification of past social forms.

Predictive text tools are increasingly being embedded in e-mail and chat clients to suggest one-click automatic responses, instead of requiring the human to craft a written response. Such AI-driven features are at once a tool of great convenience, but also an imposed shift in our patterns of social interaction.

Such forms of ‘social robot’ are addressed by Coeckelbergh & Funk when they write: “These social robots become active systems for verbal communication and therefore influence human linguistic habits more than non-talking tools.” (p 185). But note the material limitations of these robots: they can’t construct a full sentence representative of their user. Instead, they push conversation towards the quick short response, creating a pressure to change patterns of human interaction.

Auto-replies suggested by Google Mail based on a proprietary algorithm.

The examples above suggested by gmail for me to use in reply to a recent e-mail might follow terms I’d often use, but push towards a form of e-mail communication that, at least in my experience, represents a particularly capitalist and functional form of life, in which speed of communication is of the essence, rather than social communication and exploration of ideas.

Reflections and responses

Wittgenstein was not a social commentator, but it is possible to draw upon his ideas to move beyond conversations about AI bias, to look at how the widespread introduction of algorithmic and machine-learning driven systems may interact with different contemporary forms of living.

I’m always interested though in the critical leading to the practical, and so below I’ve started to sketch out possible responses the analysis above leads me to consider. I also strongly suspect that these responses, and justification for them, can be elaborated much more directly and accessibility without getting here via Wittgenstein. Writing that may be a task for later, but as I came here via the Wittgensitinian route, I’ll stick with it.

(1) Find better categories

If we want future algorithmic systems to represent the forms of live we want to live, not just those lived in the past, or imposed upon populations, we need to focus on the categories and data structured used to describe the world and train machine-learning systems.

The question of when we can develop global categories that have meaning that is ‘good enough’ in terms of alignment in use across different settings, and when it is important to have systems that can accommodate more localised categorisations, is one that requires detailed work, and that is inherent political.

(2) Build a better machine

Some objects to particular instances of AI may be because it is, ultimately, too blunt in its current form. Would my objection to the predictive text tools be the same if they could express more complete sentences, more in line with the way I want to communicate? For many critiques of algorithmic systems, there may be a plausible response to suggest that a better designed or trained system could address the problem raised.

I’m sceptical however, of whether it is plausible for most current instantiations of machine-learning to be adaptable enough to different forms of life: not least on the grounds that for some ways of living the sample-size may be too small to gather enough data points to construct a good model, or the collection of the data required may be too expensive or intrusive for theoretical possibilities of highly adaptive machine-learning systems to be practically feasible or desirable.

(3) Strategic rejection

Recognising the economic and political power embedded in certain AI implementations, and the particular form of life it embodies, may help us to see technologies we want to reject outright. If a certain tool makes moves in a language game that are at odds with the game we want to be playing, and only gains agreement of action through its imposition, then perhaps we should not admit it at all.

To put that more bluntly (and bringing in my own political stance), certain AI tools embody a late-capitalist form of life, rooted in cultures and practices of a small strata of Silicon Valley. Such tools should have no place in shaping other ways of life, and should be rejected not because they are biased, or because they have not adequately considered issues of privacy, but simply because the form of life they replicate undermines both equality and ecology.

Where next

Over my time here at Bellagio, I’ll be particularly focussed on the first of these responses – seeking better categories, and understanding how processes of standardisation interact with AI. My goal is to do that with more narrative, and less abstraction, but we shall see…

Creative Lab Report: Data | Culture | Learning

[Summary: report from a one day workshop with Create Gloucestershire bringing together artists and technologists to create artworks responding to data. Part 2 in a series with Exploring Arts Engagement with (Open) Data]

What happens when you bring together a group of artists, scientists, teachers and creative producers, with a collection of datasets, and a sprinkling of technologists and data analysts for a day? What will they create? What can we learn about data through the process?  

There has been a long trend of data-driven artworks, or individual artists incorporating responses to structured data in their work. But how does this work in the compressed context of a one-day collaborative workshop? These are all questions I had the opportunity to explore last Saturday in a workshop co-facilitated with Jay Haigh of Create Gloucestershire and hosted at Atelier in Stroud:  an event we ran under the title “Data | Create | Learning: Creative Lab”

The steady decline in education spending and increased focus on STEM subjects has impacted significantly on arts teaching and teachers. The knock on effect is observed in the take up of arts subjects at secondary, further and higher education level and, ultimately, impacting negatively on the arts and cultural sector in the UK. As such, Create Gloucestershire has been piloting new work in Gloucestershire schools to embed new creative curriculum approaches, supporting its mission to ‘make arts everyday for everyone’. The cultural education agenda therefore provided a useful ‘hook’ for this data exploration. 

Data: preparation

We started thinking about the idea of a ‘art and data hackathon’ at the start of this year, as part of Create Gloucestershire’s data maturity journey and decided to focus on questions around cultural education in Gloucestershire. However, we quickly realised an event could not be entirely modelled on a classic coding hackathon event, so, in April we brought together a group of potential participants for a short design meeting. 

Photo of preparation workshop

For this, we sought out a range of datasets about schools, arts education, arts teaching and funding for arts activities – and I worked to prepare Gloucestershire extracts of these datasets (slimming them down from hundreds of columns and rows) . Inspired by the Dataset Nutrition Project project, and using AirTable blocks to rapidly create a set of cards, we took along profiles of some of these datasets to help give participants at the planning meeting a sense of what might be found inside each of the datasets we looked at. 

Dataset labels: inspired by dataset nutrition project
Through this planning meeting we were able to set our expectations about the kind of analysis and insights we might get to from these datasets, and to think about placing the emphasis of the day on collaboration and learning, rather than being overly directive about the questions to be answered with data. We also decided that, in order to help collaborative groups form in the workshop, and to make sure we had materials prepared for particular art forms, we would invite a number of artists to act as anchor facilitators on the day.

Culture: the hackathon day 

Group photo of hackathon day

After an overview of Create Gloucestershire’s mission to bring about ‘arts everyday for everyone’, we began with introductions, going round the group and completing three sentences:

  • For me, data is…
  • For me, arts everyday is…
  • In Gloucestershire, is arts everyday….? 

For me, data is... (post-it notes)

Through this, we began to surface different experiences of engagement with data (everywhere; semi-transparent; impersonal; information; a goldmine; less well defined than art; complex; connective…), and with questions of access to arts (Arts everyday is: fun; making sense of the world; what you make of it; necessary; a privilege for some; an improbable dream; essential). 

We then turned briefly to look at some of the data available to explore these questions, before inviting our artists to explain the tools and approaches they had brought along to share:

  • Barney Heywood of Stand + Stare demonstrated use of touch-sensitive tape to create physical installations that respond to an audience with sound or visuals, as well as the Mayfly app that links stickers and sounds;
  • Illustrator and filmmaker, Joe Magee described the power of the pen, and how to sketch out responses to data;
  • Digital communications consultant and artist, Sarah Dixon described the use of textiles and paper to create work that mixes 2D and 3D; and
  • Architect Tomas Millar introduced a range of Virtual Reality technologies, and how tools from architecture and gaming could be adapted to create data-related artworks. 

To get our creative ideas flowing, we then ran through some rapid idea generation, with everyone rotating around our four artists groups, and responding to four different items of data (below) with as many different ideas as possible. From the 30+ ideas generated came some of the seeds of the works we then developed during the afternoon.

Slides showing: 38% drop in arts GCSE entries 2010 to 2019; Table of number and percentage of students a local secondary schools eligible for free school meals; Quantitative and qualitative data from a study on arts education in schools.

Following a short break, everyone had the chance to form groups and dig deeper into designing an artwork, guided by a number of questions:

  • What response to data do group members want to focus on? Collecting data? Data representation? Interpretation and response? Or exploring ‘missing data’?
  • Is there a story, or a question you want to explore?
  • Who is the audience for your creation?
  • What data do you need? Individual numbers; graphs; tables; geo data; qualitative data; network data or some other form? 
Example of sketches
Sketching early ideas

Groups then had around three hours to start making and creating prototype artworks based on their ideas, before we reconvened for a showcase of the creations.

The process was chaotic and collaborative. Some groups were straight into making: testing out the physical properties of materials, and then retrofitting data into their works later. Others sought to explore available datasets and find the stories amongst a wall of statistics. In some cases, we found ourselves gathering new data (e.g. lists of extracurricular activities taken from school websites), and in others, we needed to use exploratory data visualisation tools to see trends and extrapolate stories that could be explored through our artforms. People moved between groups to help create: recording audio, providing drawings, or sharing skills to stimulate new ways of increasing access to the stories within the data. Below is a brief summary of some of the works created, followed by some reflections on learning from the day. 

The artworks

Interactive audio: school subjects in harmony

Artwork: Barney Heywood and team | Photo credit: Kazz Hollick

Responding to questions about the balance of the school curriculum, and the low share of teaching hours occupied by the arts, the group recorded a four-part harmony audio clip, and set the volume of each part relative to the share of teaching time for arts, english, sciences and humanities. Through a collection of objects representing each subject, audiences could trigger individual parts, all four parts together, or a distorted version of the harmony. Through inviting interaction, and using volume and distortion, the piece invited reflection on the ‘right’ balance of school subjects, and the effect of loosing arts from the curriculum for the overall harmony of education. 

Fabric chromatography: creative combinations

Artwork: Sarah Dixon and team. Photo credit: Jay Haigh

 Picking up on a similar theme, this fabric based project sought to explore the mix of extracurricular activities available at a school, and how access to a range of activities can interact to support creative education. Using strips of fabric, woven in a grid onto a backcloth, the work immersed a dangling end of each strip in coloured ink, the mix of inks depending on the range of arts activities available at a particular school. As the ink soaked up vertical strands of the fabric, it also started to seep into horizontal strands, which could mix with other colours. The colours chosen reflected a chart representation of the dataset used to inform the work, establishing a clear link between data, information, and art work.

This work offered a powerful connection between art, data and science: allowing an exploration of how the properties of different inks, and different fabrics, could be used to represent data on ‘absorption’ of cultural education, and the benefits that may emerge from combining different cultural activities. The group envisaged works like this being developed with students, and then shown in the reception area of a school to showcase it’s cultural offer. 

The shrinking design teacher (VR installation)

Artwork: Tomas Millar & Pip Heywood. Photo credit: Jay Haigh

Using a series of photographs taken on a mobile phone, a 3D model of representation of Pip, a design teacher, was created in a virtual landscape. An audio recording of Pip describing the critical skill sets engendered through design teaching was linked to the model, which was set to shrink in size over the time of the recording reflecting 7-years of data on the reduction in design teaching hours in school.

Observed through VR goggles, the piece offered an emotive way to engage with a narrative on the power of art to encourage critical questioning of structures, and to support creative engagement with the world, all whilst – imperceptibly at first, and more clearly as the VR observer finds themselves looking down at the shrinking teacher – highlighting current trends in teaching hours. 

Arcade mechanicals

Artwork: Joe Magee and team. Photo credit: Jay Haigh

From the virtual to the physical, this sketch questioned the ‘rigged’ nature of grammar school and private education, imagining an arcade machine where the weight, size and shape of tokens were set according to various data points, and where the mechanism would lead to certain tokens having a better chance of winning. 

By exploring a data-informed arcade mechanisms, this idea captures the idea that statistical models can tell us something about potential future outcomes, but that outcomes are not entirely determined, and there are still elements of chance, or unpredictable interactions, in any individual story. 

Exclusion tags

Artwork: Joe Magee, Sarah Dixon and team. Photo: Jay Haigh

Building on data about different reasons for school exclusion, eight workshop participants were handed paper tags, marking them out for exclusion from the ‘classroom’. They were told to leave the room, where the images on their tags were scanned (using the Mayfly app) playing them a cold explanation of why they have been excluded and for how long.

The group were then invited to create a fabric based sculpture to represent the percentage of children excluded from school in Gloucestershire for the reasons indicated on their tag.  

The work sought to explore the subjective experience of being excluded, and to look behind the numbers to the individual stories – whilst also prototyping a possible creative yarn-bombing workshop that could be used with excluded young people to re-engage them with education.  

The team envisaged a further set of tags linked to personal narratives collected from young people excluded from school, bringing their voices into the piece to humanise the data story.

Library lights: stories from library users

This early prototype explored the potential VR to let an audience explore a space, shedding light on areas that are otherwise in darkness. Drawing on statistics about the fact that 33% of people use libraries, and on audio recordings – drawn from direct participant quotes collected by Create Gloucestershire during their 3-year Art of Libraries test programme describing how people benefitted from engagement with arts interventions in libraries across Gloucestershire – a virtual space was populated with 100 orbs – the percentage lit relating to those who use libraries. As the audience in VR approached a lit orb, an audio recording of an individual experience with a library would play. 

The creative team envisaged the potential to create a galaxy of voices: offseting negative comments about libraries from those that don’t use them (they were able to find a significant number of data sets showing negative perceptions about libraries, but few positive ones) with the good experiences of those that do.

Artwork: Tomas Millar and team (image to come)

Seeing our networks


Not so much an artwork, as a data visualisation, this piece took data gathered over the last five years by Create Gloucestershire to record attendance at Create Gloucestershire events. Adding in data on attendance at the Creative Lab, lists of people, events and event participation (captured and cleaned up using the vTiger CRM), were fed into Kumu, and used to build an interactive network diagram. The visual allows an identification of how, over time, CG events have both engaged with new people (out on the edge of the network), and have started to build ongoing connections. 

A note on naming

*One things we forgot to do (!) in our process was to ask each group to title their works, so the titles and descriptions above are given by the authors of this post. We will happily amend with input from each group. 

Learning

We closed our workshop reflecting on learning from the day. I was particularly struck by the way in which responding to dataset through the lens of artistic creation (and not just data visualisation) provided opportunities to ask new questions of datasets, and to critically question their veracity and politics: digging into the stories behind each data point, and powerfully combining qualitative and quantitative data to look not just at presenting data, but finding what it might mean for particular audiences. 

However, as Joe Magee framed it, it wasn’t always easy to find a route up the “gigantic data coalface”. Faced with hundreds of rows and columns of data, it was important to have access to tools and skills to carry out quick visualisations: yet knowing the right tools to use, or how to shape data so that it can be easily visualised, is not always straightforwards. Unlike a classic data hackathon, where there are often demands for the ‘raw data’, a data and art creative lab benefits from more work to prepare data extracts, and to provide access to layers of data (individual data points, a small set they belong in, the larger set they come from) . 

Our journey, however, took use beyond the datasets we had pre-prepared. One particular resource we came across was the UK Taking Part Survey which offers a range of analysis tools to drill down into statistics on participation in art forms by age, region and socio-economic status. With this dataset, and a number of others, our expectations were often confounded when, for example,  relationships we had expected to find between poverty and arts participation, or age and involvement, were not borne out in the data. 

This points to a useful symmetry: turning to data allowed us to challenge the assumptions that might otherwise be baked into an agenda-driven artwork, but engaging with data through an arts lens also allowed us to challenge the  assumptions behind data points, and behind the ways data is used in policy-making. 

We’ve also learnt more about how to frame an event like this. We struggled to describe it in advance and to advertise it. Too much text was the feedback from some! Now with images of this event, we can think about ways to provide a better visual story for future workshops of what might be involved. 

Given Create Gloucestershire’s commitment to arts everyday for everyone as a wholly inclusive statement of intent, it was exciting to see collaborators on the day truly engaging with data in a way they may not have done previously, and then expanding access to it by representing data in accessible and engaging forms which, additionally, could be explored by subjects of the data themselves.  What might have seemed “boring” or “troublesome” at the start of the day become a font of inspiration and creativity, opening up new conversations that may never have previously taken place and setting up the potential for new collaborations, conversations, advocacy and engagement.

Thanks

Thank you to the team at Create Gloucestershire for hosting the day, and particularly to Caroline, Pippa and Jay for all the organisation. Thanks to Kat at Atelier for hosting us, and to our facilitating artists: Barney, Sarah, Thomas and Joe. And thanks to everyone who gave up a Saturday to take part!

Photo credit where not stated: Jay Haigh

High value datasets: an exploration

[Summary: an argument for the importance of involving civil society, and thinking broad when exploring the concept of high value data (with lots of links to past research and the like smuggled in)]

On 26th June this year the European Parliament and Council published an update to the Public Sector Information (PSI) directive, now recast as Directive 2019/1024 “on open data and the re-use of public sector information.  The new text makes a number of important changes, including bringing data held by publicly controlled companies in utility and transport sectors into the scope of the directive, extending coverage of research data, and seeking to limit the granting of exclusive private sector rights to data created during public tasks, and increase the transparency when such rights are granted.

However, one of the most significant changes of all is the inclusion of Article 14 on High Value Datasets which gives the Commission power to adopt an implementing act “laying down a list of specific high-value datasets” that member states will be obliged to publish under open licenses, and, in some cases, using certain APIs and standards. The implementing acts will have the power to set out those standards. This presents a major opportunity to shape the open data ecosystem of Europe for decades to come.

The EU Commission have already issued a tender for a consultant to support them in defining a ‘List of High-value Datasets to be made Available by the Member States under the PSI-Directive’, and work looks set to advance at pace, particularly as the window granted by the directive to the Commission to set out a list of high value datasets is time-limited.

A few weeks back, a number of open data researchers and campaigners had a quick call to discuss ways to make sure past research, and civil society voices, inform the work that goes forward. As part of that, I agreed to draft a short(ish) post exploring the concept of high value data, and looking at some of the issues that might need to be addressed in the coming months. I’d hoped to co-draft this with colleagues, but with summer holidays and travel having intervened, am instead posting a sole authored post, with an invite to others to add/dispute/critique etc. 

Notably, whilst it appears few (if any) open-data related civil society organisations are in a position to lead a response to the current EC tender, the civil society open data networks built over the last decade in Europe have a lot to offer in identifying, exploring and quantifying the potential social value of specific open datasets.

What counts as high value?

The Commission’s tender points towards a desire for a single list of datasets that can be said to exist in some form in each member state. The directive restricts the scope of this list to six domains: geospatial, earth observation and environment, meteorological, statistical, company and company ownership, and mobility-related datasets. It also appears to anticipate that data standards will only be prescribed for some kinds of data: highlighting a distinction between data that may be high value simply by virtue of publication, and data which is high-value by virtue of it’s interoperability between states.

In the new directive, the definition of ‘high value datasets’ is put as:

“documents the re-use of which is associated with important benefits for society, the environment and the economy, in particular because of their suitability for the creation of value-added services, applications and new, high-quality and decent jobs, and of the number of potential beneficiaries of the value-added services and applications based on those datasets;” (§2.10)

Although the ordering of society, environment and economy is welcome, there are subtle but important differences from the definition advanced in a 2014 paper from W3C and PwC for the European Commission which described a number of factors for determining whether there was high value to making a dataset open (and standardising it in some ways). It focussed attention on whether publication of a dataset:

  • Contributes to transparency
  • Helps governments meet legal obligations
  • Relates to a public task
  • Realises cost reductions; and
  • Has some value to a large audience, or substantial value to a smaller audience.

Although the recent tender talks of identifying “socio-economic” benefits of datasets, overall it adopts a strongly economic frame, seeking quantification of these and asking in particular for evaluation of “potential for AI applications of the identified datasets;”. (This particular framing of open data as a raw material input for AI is something I explored in the recent State of Open Data book, where the privacy chapter also picked up on a brief exploration how AI applications may also create new privacy risks for release of certain datasets.)  But to keep wider political and social uses of open data in view, and to recognise that quantification of benefits is not a simple process of adding up the revenue of firms that use that data, any comprehensive method to explore high value datasets will need to consider a range of issues, including that:

  • Value is produced in a range of different ways
  • Not all future value can be identified from looking at existing data use cases
  • Value may result from network effects
  • Realising value takes more than data
  • Value is a two-sided calculation; and
  • The distribution of value matters as well as the total amount

I dig into each of these below.

Value is produced in different ways

A ‘raw material’ theory of change still pervades many discussions of open data, in spite of the growing evidence base about the many different ways that opening up access to data generates value. In ‘raw material’ theory, open data is an input, taken in by firms, processed, and output as part of new products and services. The value of the data can then be measured in the ‘value add’ captured from sales of the resulting product or service. Yet, this only captures a small part of the value that mandating certain datasets be made open can generate. Other mechanisms at play can include:

  • Risk reduction. Take, for example, beneficial ownership data. Quite asides from the revenue generated by ‘Know Your Customer (KYC)’ brokers who might build services off the back of public registers of beneficial ownership, consider the savings to government and firms from not being exposed to dodgy shell-companies, and the consumer surplus generated by supporting a clamp down on illicit financial flows into the housing market by supporting more effective cross-border anti-money laundering investigations. OpenOwnership are planning research later this year to dig more into how firms are using, or could use, beneficial ownership transparency data including to manage their exposure to risk. Any quantification needs to take into account not only value gained, but also value ‘not lost’ because a dataset is made open.
  • Internal efficiency and innovation. When data is made open, and particularly when standards are adopted, it often triggers a reconfiguration of data practices inside the data (c.f. Goëta & Davies), with the potential for this to support more efficient working, and enable innovation through collaboration between government, civil society and enterprise. For example, the open publication of contracting data, particularly with the adoption of common data standards, has enabled a number of governments to introduce new analytical tools, finding ways to get a better deal on the products and services they buy. Again, this value for money for the taxpayer may be missed by a simple ‘raw material’ theory.
  • Political and rights impacts. The 2014 W3C/PWC paper I cited earlier talks about identifying datasets with “some value to a large audience, or substantial value to a smaller audience.”. There may also be datasets that have low likelihood of causing impact, but high impact (at least for those affected) when they do. Take, for example, statistics on school admissions. When I first looked at use of open data back in 2009, I was struck by the case of an individual gaining confidence from the fact that statistics on school admission appeals were available (E7) when constructing an appeal case against a school’s refusal to admit their own child. The open availability of this data (not necessarily standardised or aggregated) had substantial value to empowering a citizen in securing their rights. Similarly, there are datasets that are important for communities to secure their rights (e.g. air quality data), or to take political action to either enforce existing policy (e.g. air quality limits), or to change policy (e.g. secure new air quality action zones). No only is such value difficult to quantify, but whether or not certain data generates value will vary between countries in accordance with local policies and political issues. The definition of EU-wide ‘high value datasets’ should not crowd out the possibility or process of defining data that is high-value in particular country. That said, there may at least be scope to look at datasets in the study categories that have substantial potential value in relation to EU social and environmental policy priorities.

Beyond the mechanisms above, there may also be datasets where we find a high intrinsic value in the transparency their publication brings, even without a clear evidence base that can quantifies their impact. In these cases, we might also talk of the normative value of openness, and consider which datasets deserve a place on the high-value list because we take the openness of this data to be foundational to the kind of societies we want to live in, just as we may take certain freedoms of speech and movement as foundational to the kind of Europe we want to see created.

Not all value can be found from prior examples

The tender cites projects like the Open Data Barometer (which I was involved in developing the methodology for) as potential inspirations for the design of approaches to assess “datasets that should belong to the list of high value datasets”. The primary place to look for that inspiration is not in the published stats, but in the underlying qualitative data which includes raw reports of cases of political, social and economic impact from open data. This data (available for a number of past editions of the Barometer) remains an under-explored source of potential impact cases that could be used to identify how data has been used in particular countries and settings. Equally, projects like the State of Open Data can be used to find inspiration on where data has been used to generate social value: the chapter on Transport is as case-in-point, looking at how comprehensive data on transport can support applications improving the mobility of people with specific needs.

However, many potential uses and impacts of open data are still to be realised, because the data they might work with has not heretofore been accessible. Looking only at existing cases of use and impact is likely to miss such cases. This is where dialogue with civil society becomes vitally important. Campaigners, analysts and advocates may have ideas for the projects that could exist if only particular data was available. In some cases, there will be a hint at what is possible from academic projects that have gained access to particular government datasets, or from pilot projects where limited data was temporarily shared – but in other cases, understanding potential value will require a more imaginative and forward-looking and consultative process. Given the upcoming study may set the list of high value datasets for decades to come – it’s important that the agenda is not be solely determined by prior publication precedent.

For some datasets, certain value comes from network effects

If one country provides an open register of corporate ownership, the value this has for anti-corruption purposes only goes so far. Corruption is a networked game, and without being able to following corporate chains across borders, the value of a single register may be limited. The value of corporate disclosures in one jurisdiction increase the more other jurisdictions provide such data. The general principle here, that certain data gains value through network effects, raises some important issues for the quantification of value, and will help point towards those datasets where standardisation is particularly important. Being able to show, for example, that the majority of the value of public transit data comes from domestic use (and so interoperability is less important), but the majority of value of, say, carbon emission or climate change mitigation financing data, comes from cross-border use, will be important to support prioritisation of datasets.

Value generation takes more than data

Another challenge of of the ‘raw material’ theory of change is that it often fails to consider (a) the underlying quality (not only format standardisation) of source data, and (b) the complementary policies and resources that enable use. For example, air quality data from low-quality or uncalibrated particulate sensors may be less valuable than data from calibrated and high quality sensors, particularly when national policy may set out criteria for the kinds of data that can be used in advancing claims for additional environmental protections in high-pollution areas. Understanding this interaction of ‘local data’ and the governance contexts where it is used is important in understanding how far, and under what conditions, one may extrapolate from value identified in one context, to potential value to be realised in another. This calls for methods that can go beyond naming datasets, to being able to describe features (not just formats) that are important for them to have. 

Within the Web Foundation hosted Open Data Research Network a few years back we spent considerable time refining a framework for thinking about all the aspects that go into securing impact (and value) from open data, and work by GovLab has also identified factors that have been important to the success of initiatives using open data. Beyond this, numerous dataset-specific frameworks for understanding what quality looks like may exist. Whilst recommending dataset-by-dataset measures to enhance the value realised from particular open datasets may be beyond the scope of the European Commission’s current study – when researching and extrapolating from past value generation in different contexts it is important to look at the other complementary factors that may have contributed that value realising alongside the simple availability of data.

Value is a two-sided calculation

It can be temping to quantify the value of a dataset simply by taking all the ‘positive’ value it might generate, and adding it up. But, a true quantification calculation also needs to consider potential negative impacts. In some cases, this could be positive economic value set against some social or ecological dis-benefit. For example, consider the release of some data that might increase use of carbon-intensive air and road transport. While this  could generate quantifiable revenue for haulage and airline firms, it might undermine efforts to tackle climate change, destroying long-term value. Or in other cases, there may be data that provides social benefit (e.g. through the release of consumer protection related data) but that disrupts an existing industry in ways that reduce private sector revenues. 

Recognising the power of data, involves recognising that power can be used in both positive and negative ways. A complete balance sheet needs to consider the plus and the minus. This is another key point where dialogue with civil society will be vital – and not only with open data advocates, but with those who can help consider the potential harms of certain data being more open. 

Distribution of value matters

Last but not least, when considering public investment in ‘high value’ datasets, it is important to consider who captures that value. I’ve already hinted at the fact that value might be captured as government surplus, consumer surplus or producer (private sector) surplus – but there are also relevant question to ask about which countries or industries may be best placed to capture value from cross-border interoperable datasets.

When we see data as infrastructure, then it can help us consider the potential to both provide infrastructure that is open to all and generative of innovation, but also to design policies that ensure those capturing value from the infrastructure are contributing to its maintenance.

In summary

Work on methodologies to identify high value datasets in Europe should not start from scratch, and stand to benefit substantially from engaging with open data communities across the region. There is a risk that a narrow conceptualisation and quantification of ‘high value’ will fail to capture the true value of openness, and to consider the contexts of data production and use. However, there is a wealth of research from the last decade (including some linked in this post, and cited in State of Open Data) to build upon, and I’m hopeful that whichever consultant or consortium takes on the EC’s commissioned study, they will take as broad a view as possible within the practical constraints of their project.

Linking data and AI literacy at each stage of the data pipeline

[Summary: extended notes from an unConference session]

At the recent data literacy focussed Open Government Partnership unConference day (ably facilitated by my fellow Stroudie Dirk Slater)  I acted as host for a break-out discussion on ‘Artificial Intelligence and Data Literacy’, building on the ‘Algorithms and AI’ chapter I contributed to The State of Open Data book.

In that chapter, I offer the recommendation that machine learning should be addressed within wider open data literacy building.  However, it was only through the unConference discussions that we found a promising approach to take that recommendation forward: encouraging a critical look at how AI might be applied at each stage of the School of Data ‘Data Pipeline’.

The Data Pipeline, which features in the Data Literacy chapter of The State of Open Data, describes seven stages for woking with data, from defining the problem to be addressed, through to finding and getting hold of relevant data, verifying and cleaning it, and analysing data and presenting findings.

Figure 2: The School of Data’s data pipeline. Source: https://schoolofdata.org/methodology/
Figure: The School of Data’s data pipeline. Source: https://schoolofdata.org/methodology/

 

Often, AI is described as a tool for data analysis (any this was the mental framework many unConference session participants started with). Yet, in practice, AI tools might play a role at each stage of the data pipeline, and exploring these different applications of AI could support a more critical understanding of the affordances, and limitations, of AI.

The following rough worked example looks at how this could be applied in practice, using an imagined case study to illustrate the opportunities to build AI literacy along the data pipeline.

(Note: although I’ll use machine-learning and AI broadly interchangeably in this blog post, as I outline in the State of Open Data Chapter, AI is a  broader concept than machine-learning.)

Worked example

Imagine a human rights organisation, using a media-monitoring service to identify emerging trends that they should investigate. The monitoring service flags a spike in gender based violence, encouraging them to seek out more detailed data. Their research locates a mix of social media posts, crowdsourced data from a harassment mapping platform, and official statistics collected in different regions across the country. They bring this data together, and seek to check it’s accuracy, before producing an analysis and visually impactful report.

As we unpack this (fictional) example, we can consider how algorithms and machine-learning are, or could be, applied at each stage – and we can use that to consider the strengths and weaknesses of machine-learning approaches, building data and AI literacy.

  • Define – The patterns that first give rise to a hunch or topic to investigate may have been identified by an algorithmic model.  How does this fit with, or challenge, the perception of staff or community members? If there is a mis-match – is this because the model is able to spot a pattern than humans were not able to see (+1 for the AI)? Or could it be because the model is relying on input data that reflects certain bias (e.g. media may under-report certain stories, or certain stories may be over-reported because of certain cognitive biases amongst reporters)?

  • Find – Search engine algorithms may be applying machine-learning approaches to identify and rank results. Machine-translation tools, that could be used to search for data described in other languages, are also an example of really well established AI. Consider the accuracy of search engines and machine-translation: they are remarkable tools, but we also recognise that they are nowhere near 100% reliable. We still generally rely on a human to sift through the results they give.

  • Get – One of the most common, and powerful, applications of machine-learning, is in turning information into data: taking unstructured content, and adding structure through classification or data extraction. For example, image classification algorithms can be trained to convert complex imagery into a dataset of terms or descriptions; entity extraction and sentiment analysis tools can be used to pick out place names, event descriptions and a judgement on whether the event described is good or bad, from free text tweets, and data extraction algorithms can (in some cases) offer a much faster and cheaper way to transcribe thousands of documents than having humans do the work by hand. AI can, ultimately, change what counts as structured data or not.  However, that doesn’t mean that you can get all the data you need using AI tools. Sometimes, particularly where well-defined categorical data is needed, getting data may require creation of new reporting tools, definitions and data standards.

  • Verify – School of Data describe the verification step like this: “We got our hands in the data, but that doesn’t mean it’s the data we need. We have to check out if details are valid, such as the meta-data, the methodology of collection, if we know who organised the dataset and it’s a credible source.” In the context of AI-extracted data, this offers an opportunity to talk about training data and test data, and to think about the impact that tuning tolerances to false-positives or false-negatives might have on the analysis that will be carried out. It also offers an opportunity to think about the impact that different biases in the data might have on any models built to analyse it.

  • Clean – When bringing together data from multiple sources, there may be all sorts of errors and outliers to address. Machine-learning tools may prove particularly useful for de-duplication of data, or spotting possible outliers. Data cleaning to prepare data for a machine-learning based analysis may also involve simplifying a complex dataset into a smaller number of variables and categories. Working through this process can help build an understanding of the ways in which, before a model is applied, certain important decisions have already been made.

  • Analyse – Often, data analysis takes the form of simple descriptive charts, graphs and maps. But, when AI tools are added to the mix, analysis might involve building predictive models, able, for example, to suggest areas of a county that might see future hot-spots of violence, or that create interactive tools that can be used to perform ongoing monitoring of social media reports. However, it’s important in adding AI to the analysis toolbox, not to skip entirely over other statistical methods: and instead to think about the relative strengths and weaknesses of a machine-learning model as against some other form of statistical model. One of the key issues to consider in algorithmic analysis is the ’n’ required: that is, the sample size needed to train a model, or to get accurate results. It’s striking that many machine-learning techniques required a far larger dataset that can be easily supplied outside big corporate contexts. A second issue that can be considered in looking at analysis is how ‘explainable’ a model is: does the machine-learning method applied allow an exploration of the connections between input and output? Or is it only a black box.

  • Present – Where the output of conventional data analysis might be a graph or a chart describing a trend, the output of a machine-learning model may be a prediction. Where a summary of data might be static, a model could be used to create interactive content that responds to user input in some way. Thinking carefully about the presentation of the products of machine-learning based analysis could support a deeper understanding of the ways in which such outputs could or should be used to inform action.

The bullets above give just some (quickly drafted and incomplete) examples of how the data pipeline can be used to explore AI-literacy alongside data literacy. Hopefully, however, this acts as enough of a proof-of-concept to suggest this might warrant further development work.

The benefit of teaching AI literacy through open data

I also argue in The State of Open Data that:

AI approaches often rely on centralising big datasets and seeking to personalise services through the application of black-box algorithms. Open data approaches can offer an important counter-narrative to this, focusing on both big and small data and enabling collective responses to social and developmental challenges.

Operating well in a datified world requires citizens to have a critical appreciation of a wide variety of ways in which data is created, analysed and used – and the ability to judge which tool is appropriate to which context.  By introducing AI approaches as one part of the wider data toolbox, it’s possible to build this kind of literacy in ways that are not possible in training or capacity building efforts focussed on AI alone.

The politics of misdirection? Open government ≠ technology.

[Summary: An extended write-up of a tweet-length critique]

The Open Government Partnership (OGP) Summit is, on many levels, an inspiring event. Civil society and government in dialogue together on substantive initiatives to improve governance, address civic engagement, and push forward transparency and accountability reforms. I’ve had the privilege, through various projects, to be a civil society participant in each of the 6 summits in Brasilia, London, Mexico, Paris, Tbilisi and now Ottawa. I have a lot of respect for the OGP Support Unit team, and the many government and civil society participants who work to make OGP a meaningful forum and mechanism for change. And I recognise that the substance of a summit is often found in the smaller sessions, rather than the set-piece plenaries. But, the summit’s opening plenary offered a powerful example of the way in which a continued embrace of a tech-goggles approach at OGP, and weaknesses in the design of the partnership and it’s events, misdirect attention, and leave some of the biggest open government challenges unresolved.

Trudeau’s Tech Goggles?

We need to call out the techno-elitism, and political misdirection, that mean  the Prime Minister of Canada can spend the opening plenary in an interview that focussed more on regulation of Facebook, than on regulation of the money flowing into politics; and more time answering questions about his Netflix watching, than discussing the fact that millions of people still lack the connectivity, social capital or civic space to engage in any meaningful form of democratic decision making. Whilst (new-)media inevitably plays a role in shaping patterns of populism, a narrow focus on the regulation of online platforms directs attention away from the ways in which economic forces, transportation policy, and a relentless functionalist focus on ‘efficient’ public services, without recognising their vital role in producing social-solidarity,  has contributed to the social dislocation in which populism (and fascism) finds root.

Of course, the regulation of large technology firms matters, but it’s ultimately an implementation detail that some come as part of wider reforms to our democratic systems. The OGP should not be seeking to become the Internet Governance Forum (and if it does want to talk tech regulation, then it should start by learning lessons from the IGFs successes and failures), but should instead be looking deeper at the root causes of closing civic space, and of the upswing of populist, non-participatory, and non-inclusive politics.

Beyond the ballot box?

The first edition of the OGP’s Global Report is sub-titled ‘Democracy Beyond the Ballot Box and opens with the claim that:

…authoritarianism is on the rise again. The current wave is different–it is more gradual and less direct than in past eras. Today, challenges to democracy come less frequently from vote theft or military coups; they come from persistent threats to activists and journalists, the media, and the rule of law.

The threats to democracy are coming from outside of the electoral process and our response must be found there too. Both the problem and the solution lie “beyond the ballot box.”

There appears to be a non-sequitur here. That votes are not being stolen through physical coercion, does not mean that we should immediately move our focus beyond electoral processes. Much like the Internet adage that ‘censorship is damage, route around it, there can be a tendency in Open Government circles to treat the messy politics of governing as a fundamentally broken part of government, and to try and create alternative systems of participation or engagement that seek to be ‘beyond politics’. Yet, if new systems of participation come to have meaningful influence, what reason do we have to think they won’t become subject to the legitimate and illegitimate pressures that lead to deadlock or ‘inefficiency’ in our existing institutions? And as I know from local experience, citizen scrutiny of procurement or public sending from outside government can only get us so far without political representatives willing to use and defend they constitutional powers of scrutiny.

I’m more and more convinced that to fight back against closing civic space and authoritarian government, we cannot work around the edges: but need to think more deeply about about how we work to get capable and ethical politicians elected: held in check by functioning party systems, and engaging in fair electoral competition overseen by robust electoral institutions. We need to go back to the ballot box, rather than beyond it. Otherwise we are simply ceding ground to the forces who have progressively learnt to manipulate elections, without needing to directly buy votes.

Globally leaders, locally laggards?

The opening plenary also featured UK Government Minister John Penrose MP. But, rather than making even passing mention of the UK’s OGP National Action Plan, launched just one day before, Mr Penrose talked about UK support for global beneficial ownership transparency. Now: it is absolutely great that that ideas of beneficial ownership transparency are gaining pace through the OGP process.

But, there is a design flaw in a multi-stakeholder partnership where a national politician of a member country is able to take the stage without any response from civil society. And where there is no space for questions on the fact that the UK government has delayed the extension of public beneficial ownership registries to UK Overseas Territories until at least 2023. The misdirection, and #OpenWashing at work here needs to be addressed head on: demanding honest reflections from a government minister on the legislative and constitutional challenges of extending beneficial ownership transparency to tax havens and secrecy jurisdictions.

As long as politicians and presenters are not challenged when framing reforms as simple (and cheap) technological fixes, we will cease to learn about and discuss the deeper legal reforms needed, and the work needed on implementation. As our State of Open Data session on Friday explored: data and standards must be the means not the ends, and more public scepticism about techno-determinist presentations would be well warranted.

Back, however, to event design. Although when hosted in London, the OGP Summit offered UK civil society at least, an action-forcing moment to push forward substantive National Action Plan commitments, the continued disappearance of performative spaces in which governments account for their NAPs, or  different stakeholders from a countries multi-stakeholder group share the stage, means that (wealthy, and northern) governments are put in control of the spin.

Grounds for hope?

It’s clear that very many of us understand that open government ≠ technology, at least if (irony noted) likes and RTs on the below give a clue. 

But we need to hone our critical instincts to apply that understanding to more of the discussions in fora like OGP. And if, as the Canadian Co-Chair argued in closing, “OGP is developing a new forms of multilateralism”, civil society needs to be much more assertive in taking control of the institutional and event design of OGP Summits, to avoid this being simply a useful annual networking shin-dig. The closing plenary also included calls to take seriously threats to civic space: but how can we make sure we’re not just saying this from the stage in the closing, but that the institutional design ensures there are mechanisms for civil society to push forward action on this issue. 

In looking to the future of OGP, we should consider how civil society spends some time taking technology off the table. Let it emerge as an implementation detail, but perhaps let’s see where we get when we don’t let techo-discussions lead?