The Human Gene Nomenclature Committee have just published new guidance on naming genes: the twists of DNA and RNA that express the traits of living things. The guidance contains one entry that has caught the interest of the data world. Under Scenarios that may Marit a symbol change the guidance includes:
Symbols that affect data handling and retrieval. For example, all symbols that autoconverted to dates in Microsoft Excel have been changed (for example, SEPT1 is now SEPTIN1; MARCH1 is now MARCHF1);
Automatic conversion of gene symbols to dates and floating-point numbers is a problematic feature of Excel software. The description of this problem and workarounds were first highlighted over a decade ago —nevertheless, we find that these errors continue to pervade supplementary files in the scientific literature.
This issue highlights two important consideration for anyone developing data standards:
standards and software interact – and you need to know your users and the tools they will use.
your data will end up in Excel. Accept it and design with this in mind.
Filed under: things to draw on when I finally get around to writing a book on data standards
[Summary: Reflections on the design of ITU Data Pledge project]
The ITU, under their “Global Initiative on AI and Data Commons” have launched a process to create a ‘Data Pledge’, designed as a mechanism to facilitate increased data sharing in order to support “response to humanity’s greatest challenges” and to ”help support and make available data as a common global resource.”.
Described as complementary to existing work such as the International Open Data Charter, the Pledge is framed as a tool to ‘collectively make data available when it matters’, with early scoping work discussing the idea of conditional pledges linked to ‘trigger events’, such that an organisation might promise to make information available specifically in a disaster context, such as the current COVID-19 Pandemic. Full development of the Pledge is taking place through a set of open working groups.
This post briefly explores some of the ways in which a Data Pledge could function, and considers some of the implications of different design approaches.
[Context: I’ve participated in one working group call around the data pledge project in my role as Project Director of the Global Data Barometer, and this is written up in a spirit of open collaboration. I have no formal role in the data pledge project..]
Governments, civil society or private sector
Should a pledge be tailored specifically to one sector? Frameworks for governments to open data are already reasonably well developed, as our mechanisms that could be used for governments to collaborate on improving standards and practices of data sharing.
However, in the private sector (and to some extent, in Civil Society), approaches to data sharing for the public good (whether as data philanthropy, or participation in data collaboratives are much less developed – and are likely the place in which a new initiative could have the greatest impact.
Individual or collective action problems
PledgeBank, a MySociety project that ran from 2005 to 2015, explored the idea of pledging as a solution to collective action problems. Pledges of the form: “I’ll do something, if a certain number of people will help me” are now familiar in some senses through crowdfunding sites and other online spaces. A Data Pledge could be modelled on the same logic – focussing on addressing those collective action problems either where:
A single firm doesn’t want to share certain data because doing so, when no-one else is, might have competitive impacts: but if a certain share of the market are sharing this data, it no longer has competitive significance, and instead it’s public good value can be realised.
The value of certain data is only realised as a result of network effects, when multiple firms are sharing similar and standardised data – but the effort of standardising and sharing data is non-negligible. In these cases, a firm might want to know that there is going to be a Social Return on Investment before putting resources into sharing the data.
However, this does introduce some complexity into the idea of pledging (and the actions pledged) and might, as PledgeBank found, lead also to lots of unrealised potential.
Pledging can also be approached as a means of solving individual motivational problems: helping firms to overcome inertia that means they are not sharing data which could have social value. Here, a pledge is more about making a statement of intent, which garners positive attention, and which commits the firm to a course of action that should eventually result in shared data.
Both forms of pledging can function as useful signalling – highlighting data that might be available in future, and priming potential ecosystems of intermediaries and users.
An organisational or dataset-specific pledge
Should a Pledge be about a general principle of data sharing for social good? Or about sharing a specific dataset? It may be useful to think about the architecture of the Data Pledge involving both: or at least, optionally involving data-specific pledges, under a general pledge to support data sharing for social good.
Think about organisational dynamics. Individual teams in a large organisation may have lots of data they could safely and appropriately share more widely for social good uses, but they do not feel empowered to even start thinking about this. A high-level organisational pledge (e.g. “We commit to share data for social good whenever we can do so in ways that do not undermine privacy or commercial position”) that sets an intention of a firm to support data philanthropy, participate in data collaboratives, and provide non-competitive data as open data, could provide the backing that teams across the organisation need to take steps in that direction.
At the same time, there may be certain significant datasets and data sources that can only be shared with significant high-level leadership from the organisation, or where signalling the specific data that might be released, or purposes it might be released for, can help address the collective action issues noted above. For these, dataset specific pledging (e.g. “We commit to share this specific dataset for the social good in circumstance X ”) can have significant value.
Triggers as required or optional
Should a pledge be structured to place emphasis on ‘trigger conditions’ for data sharing? Some articulations of the Data Pledge appear to think of it as a bank of data that could be shared in particular crisis situations. E.g. “We’ll share detailed supply chain information for affected areas if there is a disaster situation.”.There are certainly datasets of value that might not be listed as a Pledge unless trigger conditions can be described, but it’s important that the design of a pledge does not present triggers as essentially shifting any of the work on data sharing to some future point. Preparing for data to be used well and responsibly in a crisis situation requires work in advance of the trigger events: aligning datasets, identifying how they might be used, and accounting carefully for possible unintended consequences that need to be mitigated against.
There are also many global crisis we face that are present and ongoing: the climate crisis, migration, and our collective failure to be on track against the Sustainable Development Goals.
Brokering and curating
Data is always about something, and different datasets exist within (and across) different data communities and cultures. To operationalise a pledge will involve linking actors pledging to share data into relevant data communities: where they can understand user needs in more depth, and be able to publish with purpose.
The architecture of a Data Pledge, and of any supporting initiative around it, will need to consider how to curate and connect the many organisations that might engage – building thematic conversations, spotting thematic spaces where a critical mass of pledges might unlock new social value, or identifying areas where there are barriers stopping pledges turning into data flows.
Incorporating context, consent and responsible data principles
Increased data sharing is not an unalloyed good. Approaching data for the public good involves balancing openness and sharing, with robust principles and practices of data protection and ethics, including attention to data minimisation, individual rights, group data privacy, indigenous data sovereignty and dataset bias. Data should also be shared with clear documentation of it’s context, allowing an understanding of its affordances and limitations, and supporting debate over how data ecosystems can be improved in service of social justice.
A Pledge has an opportunity to both set the bar for responsible data practice, and to incentivise organisational thinking about these issues, by including terms that require pledging organisations to uphold high standards of data protection, only sharing personal data with clear informed consent or personal-derived data after clear processes that consider privacy, human rights and bias impacts of data sharing. Similarly, organisations could be asked to commit to putting their data in context when it is shared, and to engaging collaboratives with data users.
There may also be principles to incorporate here about transparency of data sharing arrangements – supporting development of norms about publishing clearly (a) who data is shared with and for what purpose; and (b) the privacy impact assessments carried out in advance of such shares.
Conditional on capacity?
Should pledging organisations be able to signal that they would need resources in order to make certain data available? I.e. We have Dataset X which has a certain social value: but we can’t afford to make this available with our internal resources? For low-resource organisations, including SMEs or organisations operating in low income economies, this could be a way to signal to philanthropic projects like data.org a need for support. But it could also be used by higher-resource organisations to put a barrier in front of data sharing. However, if a Pledge targets civil society pledgees, then allowing some way to indicate capacity needs if data is to be shared is likely to be particularly important.
A synthesis sketch
Whilst ideologically, I’d favour a focus on building and governing data commons, more directly addressing the modern ‘enclosure’ of data by private firms, and not forgetting the importance of proper taxation of data-related businesses to finance provision of public goods, if it’s viable to treat a data pledge as a pragmatic tool to increase availability for data for social good uses, then I’d sketch the following structure:
Target private sector organisations
A three part pledge
1. A general organisational commitment to treat data as a resource for the public good;
2. A linked organisational commitment to responsible data practices whenever sharing data;
3. An optional set of dataset specific pledges, each with optional trigger conditions
A platform allowing pledging organisations to profile their pledges, detail contact points for specific datasets and contact points for organisation-wide data stewards, and to connect with potential data users;
A programme of work to identify pre-work needed to allow data to be effectively used if trigger conditions are met ;
The only narrative of Rhodes I recall from that time, was one of the college’s proud connection to its alumni and benefactor. To my shame, whilst with student campaigners I was active against contemporary donations to the University that appeared to buy naming rights and launder the reputations of questionable modern day donors – I left unexplored how the ongoing honouring of past donors had allowed them to ‘buy’ a ‘controversial reputation’ instead of the condemnation their actions deserve. Nor did I consider then how the memorialisation of Rhodes plays a part (even if small compared to other factors) in perpetuating the continued exclusion of marginalised communities from Oxford, and in reinforcing barriers to people from (Oxford) minorities taking greater ownership over the institutions of the University.
The college has a (belated) opportunity to make the right statement with the removal of the Rhodes statue. Leave it there, and Rhodes remains a ‘controversial figure’ and the college an institution concerned only with reproducing “an educated ruling class” (to quote from the college’s essay on Rhodes). Move it to a museum where it belongs, and the conversation with every undergraduate can be about our importance of questioning and learning from history – using education as a means of creating a more just future. The teachable moment will be all the stronger when the statue’s niche stands empty.
[Summary: Exploring inclusion impacts of data and standards in response to a paper on Open Contracting & inclusion]
Yesterday I had the pleasure of joining a call hosted by HIVOS, and chaired by ILDA’s Ana Sofia Ruiz, to discuss a recent paper from Michael Canares and François van Schalkwyk on “Open Contracting and Inclusion”. The paper is well worth a read, and includes a review of five cases against a theoretical framework looking across data flows, opportunities for action, infomediary presence, and through to inclusion outcomes (see table below for example of how these play out in a few of the cases reviewed)
After the discussion, we were asked to summarise some of our inputs – hopefully feeding into a wide write-up. However, in case what I’ve written up doesn’t really fit the format of that, I’m posting a cleaned up and slightly expanded version of the remarks I made below:
This paper, and the discussion around it, raises a number of valuable questions – drawing on a rich theoretical landscape to post them.
Firstly, it asks us “How are data flows being disrupted?”. This question is important, because in many ‘open contracting’ projects it is rarely explicitly asked. We’re living in a time of mass disruption, yet open contracting is often ‘sold’ as a kind of reform. One of the widely used success stories for work on open contracting data comes from Ukraine, where there was a true disruption in data flows – using the moment of revolution to reconfigure patterns of procurement, and to create data infrastructures that enabled those new more open practices.
Secondly, this paper calls on us to question “what is the value of data in bringing about inclusion?” In the past we’ve talked about whether open data is either necessary or sufficient to create change. The answer I take from this paper is that increased accessibility of information and data is ‘a very useful, but nowhere near sufficient’ condition for inclusive change.
Thirdly, the use of Castell’s framework from Communication Power of ‘network power’ (shaping the information that can be transmitted), ‘networking power’ (gatekeeping which information is transmitted), and ‘networked power’ (control by one node in the network of others), and ideas of ‘programming the network’ and ‘reprogramming the network’, raise some critical questions about the role of data standards. Often treated as neutral artefacts, standards are in fact sites of power, and of the negotiation of network and networking power. A standard defines what can be expressed, and its implementation involves choosing what will be expressed. Standards can be at once tools that cross contexts, taking with them the potential of inclusion and exclusion (network power), and at the same time, have that potential left inert if the localised networking power decides not to take up inclusion oriented features.
To put this more concretely (if still a little complex I fear), the Open Contracting Data Standard was explicitly designed with a technical architecture that permits data about any given contracting process to be published by any actor, not only the ‘official’ information provider, and with a mechanism for extensions, supporting new fields of data to be attached to a contracting process. The ‘protocol’ sought to be inclusive. However, in practice, most tools have not been built to exploit this feature – meaning that in practice, the ‘platforms’ that exist don’t support inclusion of alternative perspectives on the state of a contracting process. This highlights that even at the level of the technical infrastructures, these are not made once, but have to be constantly remade, and their inclusive potential reinforced.
Fourth, the paper calls for a renewed focus on both governance context, and on intermediaries. Whilst technical artefacts can cross contexts, intermediary capacity building needs significant investment setting-by-setting. Equally, the discussion brought into view that this cannot be a short-term process. Intermediaries need not only skills, but also stocks of trust, in order to broker connections and communication. One of the evaluation team who had worked on a case covered by the paper discussed how it was individuals’ ability to maintain trusted relationships across different stakeholder groups that was critical to connecting information and empowerment. The importance of this cannot be overstated.
Fifth, and finally, in his opening statement, Michael Canares challenged us to consider whether Open Contracting is different from other public sector reforms? After all – there have been decades of procurement reform. To this, I’m prepared to advance an answer: There is a meaningful qualitative difference with government reforms that start from the premise of openness. When a commitment to being open by default is put into practice, the configuration of actors involved in creating change is different, and conventional patterns of bureaucratic reform can be disrupted. Whether they are disrupted or not depends still on individual internal and external actors, and on whether the culture, as well as the practice, of openness has been brought into play. Nevertheless, Open Contracting has certain potential that is simply absent from past procurement reforms – and that is something to continue to build on.
The challenge ahead now is to work out what to do with these questions. We’re starting to unpack the complexity of open contracting practice – and the nuances for each individual setting. But, if all we have are critical questions, we risk inaction rather than advances in inclusion. During the early development of the Open Contracting Data Standard we often turned to the mantra that we should not let the perfect be the enemy of the good. This carries forward: as we avoid the perfect being the enemy of making things better. I’d contend that we need to continue turning our learning into tooling – whether technical tools, evaluation frameworks, to simple planning tools for new initiatives. Only then can we be part of taking on the large scale reforms that this time of disruption needs.
[Summary: following the Bellagio Center thematic month on AI last year, I was asked to write up some brief notes on where data standards fit into contemporary debates on AI governance. The below article has just been published in the Rockefeller ‘notebook’ AI+1: Shaping our Integrated Future*]
Modern AI was hailed as bringing about ‘the end of theory’. To generate insight and action no longer would we need to structure the questions we ask of data. Rather, with enough data, and smart enough algorithms, patterns would emerge. In this world trained AI models would give the ‘right’ outcomes, even if we didn’t understand how they did this.
Today this theory-free approach to AI is under attack. Scholars have called out the ‘bias in, bias out’ problem of machine-learning systems, showing that biased datasets create biased models — and, by extension, biased predictions. That’s why policy makers now demand that if AI systems are used to make public decisions, their models need to be ‘explainable’, offering justifications for the predictions they make.
Yet, a deeper problem is rarely addressed. It is not just the selection of training data, or the design of algorithms, that embeds bias and fails to represent the world we want to live in. The underlying data structures and infrastructures on which AI is founded were rarely built with AI uses in mind, and the data standards — or lack thereof — used by those datasets place hard limits on what AI can deliver.
From form fields for gender that only offer a binary choice, to disagreements over whether or not a company’s registration number should be a required field when applying for a government contract, data standards define the information that will be available to machine-learning systems. They set in stone hidden assumptions and taken-for-granted categories that make possible certain conclusions, while ruling others out, before the algorithm even runs. Data standards tell you what to record, and how to represent it. They embody particular world views, and shape the data that shapes decisions.
For corporations planning to use machine-learning models with their own data, creating a new data field or adapting available data to feed the model may be relatively easy. But for the public good uses of AI, which frequently draw on data from many independent agencies, individuals or sectors, syncing data structures is a challenging task.
Opening up AI infrastructure
However, there is hope. A number of open data standards projects have launched since 2010.
They include the International Aid Transparency Initiative (IATI) — which works with international aid donors to encourage them to publish project information in a common structure — and HXL, the Humanitarian eXchange Language, which offers a lightweight approach to structure spreadsheets with ‘Who, What, Where’ information from different agencies engaged in disaster response activities.
When these standards work well, they allow a broad community to share data that represents their own reality, and make data interoperable with that from others. But for this to happen, standards must be designed with broad participation so that they avoid design choices that embed problematic cultural assumptions, create unequal power dynamics, or strike the wrong balance between comprehensive representation of the world and simple data preparation. Without the right balance certain populations may drop out of the data sharing process altogether.
To use AI for the public good, we need to focus on the data substrata on which AI systems are built. This requires a primary focus on data standards, and far more inclusive standards development processes. Even if machine learning allows us to ask questions of data in new ways, we cannot shirk our responsibility to consciously design data infrastructures that make possible meaningful and socially just answers.
*I’ve only got print copies of the publication right now: happy to share locally in Stroud, and will update with a link to digital versions when available. Thanks to Dor Glick at Rockefeller for the invite and brief for this piece, and to Carolyn Whelan for editing.
[Summary: a brain-dump of thoughts on approaches to data standardisation relevant in the current coronavirus context.]
Over the last few weeks I’ve talked with a number of initiatives that are seeking to bring greater coherence to data collection on the impacts that coronavirus is having on their constituencies. Thousands of organisations, from chambers of commerce, to charity networks, and international agencies, are sending out surveys, or soliciting inputs, to help them understand the social, economic, organisational and operational impacts of the current pandemic – and to start charting ways forward in response.
This has led to a number of conversations asking how data standards could help. Common fears of wasted effort in duplicate data collection, missed insights from siloed data, or confusion created by incompatible categorisations, are all being compounded by the rapid data collection needs in this crisis. Yet, creating new standards can be a time-consuming process: involving in-depth negotiation of different user needs and capacities, careful drafting of definitions, and rigorous testing of schemas, in order to develop something that can function as an equitable tool for long-term communication and collaboration. That doesn’t mean, however, that it’s not possible to iterate towards more aligned and standardised data right now.
In this post I’ll try and set out a few (non-exhaustive) considerations on where some of the data standardisation practices I’ve engaged with over recent years fit in the current landscape, and some approaches to move towards aligning data collection initiatives.
Documentation, documentation, documentation
There are a couple of different parts of a data standard, including definitions that describe what the data should cover, and what each field is about and schemas that determine how the data should be encoded, serialised and shared. But it is documentation that brings these together, and makes them widely usable.
Good documentation should allow people designing data collection instruments (surveys, studies etc.) to quickly identify the building blocks of standardisation that they can draw upon, and should make following the standard the path of least resistance, rather than an uphill struggle.
Ideally documentation should be clearly versioned, and, if intended for global use, published in ways that support language translation.
Start from user needs
It’s easy to fall into the trap of being ‘data driven’, and trying to work out ways to bring together ’all the data’ by imposing top-down structures on data collection or aggregation. But, in working out where to prioritise alignment of definitions and structures it’s crucial to be driven user need. In a crisis context, it may help to identify the primary user need that data pipelines are being built to meet (e.g. a dashboard for operational decision making), and secondary user needs that is is desirable to meet too (e.g. evaluating whether support has been provided equitably; gathering baselines for future research; supporting advocacy for funding certain needs). This will help guide decisions on…
…’just enough standardisation’
Standards are about the distribution of costs and benefits between data producers, intermediaries and data users. Without any standards, data users wanting to draw on data from different sources have to do all the work of reconciling differences and inconsistencies – and sometimes find different datasets are simply irreconcilable. Where multiple datasets have compatible definitions, but different schemas, if may be possible for intermediaries to do the work of creating a consistent dataset by standardising non-standard data. Where data produces are made responsible for data standardisation, they have to do the work of reconciling their own business needs and local definitions, with the definitions and structures provided by a standard.
In the early stages of a crisis, the focus should be on what intermediaries can do: keeping the burden on data producers and users as low as possible, and focussing only on essential standardisation (guided by an understanding of user needs). By seeking to reconcile data from different sources, intermediaries will quickly learn which gaps in data alignment or standardisation are most costly to creating interoperable datasets.
Whilst adopting standards like the Open Contracting Data Standard or Beneficial Ownership Data Standard involves working with organisations over many months and even years to align their data (and in some cases, underlying business processes) with a shared model – in a crisis response, data producers need light-weight building blocks that make their job easier – giving them content to copy and paste into surveys, or data structures that can be easily implemented.
One well-developed approach for alignment in a crisis context comes from HXL – the Humanitarian eXchange Language which provides a simple approach to mark-up columns in spreadsheets using a collection of known # hash-tags, and then provides tools to combine and filter tagged data.
It’s rare that you will need to ‘invent’ any standards from scratch: standardisation is often an assembly job: working out which existing standards to align with and which pieces are aligned enough to work together. As a starting point I often turn to schema.org, the ad-hoc effort by search engines to create a common (and relatively loose) vocabulary of terms to describe everything from people, local businesses and books, to pandemic related data, or I look at conventions at use in existing datasets in the domain I’m helping create data models for.
Certain lower-level conventions, like using ISO Dates, unicode for text, and ISO language and country codes, are also worth encouraging and documenting: although in most cases as long as a data source is internally consistent in how it encodes countries, dates, languages and so-on, intermediaries will be able to more-or-less map the data to common codes over the short-term.
I say that one should ‘critically’ re-use existing standards, because, as the fantastic Data Feminism book underscores, definitions of data are about power: about whose lived experience and accounts of the world will be represented and shared. There is often a balance to strike between adopting common ways of representing the world, and challenging oppressive and problematic representations.
Particularly when building standards for use across national and cultural boundaries, this calls for an awareness of the many falsehoods embedded in data models, and consideration of the embedded assumptions in off-the-shelf data models. It can also call for a sensitivity to when standards, even in a crisis, should not take the path of least resistance, but should introduce some friction in deciding which categories to use, or how to disaggregate data. For example, where user needs (and here is where considering diverse secondary user needs can be important, as ‘primary user needs’ may often represent dominant power perspectives) require an understanding of how data varies by gender, or the ability to provide intersectional disaggregation, then standards should make clear how this should be recorded and shared.
Look for the keys
One way to lower the burden on data collectors is to look for the keys that unlock additional existing open datasets. For example:
Postcodes in many countries allow data to be geocoded, and allow you to integrate a range of local classifications and statistics. In the UK, collecting the postcode of where a service is delivered allows you to look up the socio-economic status of the are, the local authority responsible for service delivery there, and a whole host of other information. In other countries, location data may be possible to match with satellite observation data to infer other relevant classifications for a survey respondent.
Organisation identifiers – which, if collected and well validated, can be reconciled against public databases to find information on companies, charities and other entities. In the UK, a Charity number can be used to look up classification data on the organisation’s beneficiaries taken from annual charity returns. For many nations, company numbers can be reconciled against OpenCorporates to provide detailed corporate information.
URLS and Social Media IDs can be useful in some use-cases to crawl web pages and social network and find signals about the networks an organisation is part of, of the topics they work on.
Each sector and domain is also likely to have some of its own ‘keys’ that can hook into existing datasets (e.g. the Common Procurement Vocabulary for classifying public procurements in Europe). If you are lucky, they will be attached to relevant open datasets.
Care still needs to be taken to consider gaps in the lookup data (e.g. some countries lack open corporate register data; satellite data coverage varies; not all organisations have websites), and to avoid introducing biases through faulty assumptions (e.g. if assuming the ‘register office’ postcode of UK charities is where their beneficiaries are, then it looks like London gets more funding than it does). It’s also important to consider how easy it will be for those providing data to enter it. For example, do organisations know their registration number? (On the organisation identifiers point, this is one of the reasons I was involved in creating org-id.guide and there remains a lot still to do in this area).
Decide on your approach to categories
At the heart of many standardisation processes is classification: sorting needs, organisations, events or people into categories. Standardising categories can be notoriously difficult: and is often hard to do in a rush. You might find there are existing classification schemes you can draw upon, or you might find a need to create your own (or, as LandVoc has done, albeit over a number of years, to engage with an existing classification scheme to get the elements you need included).
Good documentation of the boundaries of a category (ideally with examples) is vital for them to be used in interoperable ways.
Many of the standards I’ve worked on have stepped back from settling categorisation debates, but representing classification elements in terms of:
A vocabulary – to allow different datasets to use different classification schemes
A code – that stays constant across languages
A label – that can be translated into local languages
This offers a way to at least avoid two people talking about different things with the same terms, but leaves the alignment problem to later.
In an ideal world, a rapid standardisation project might be able to provide ‘good enough’ categories for data collectors to start with, but then offer them some level of flexibility so that individual data collection exercises can address their local user needs by adapting core categorisations.
Semantic standards such as SKOS have a lot to offer to efforts to bring together data using heterogenous classification schemes: allowing not only hierarchical relationships (i.e. the ability to add a ‘narrower’ concept under a headline category), but also broad and narrow matches between neighbouring concepts. However, tools and skills for working well with this kind of data and classification structure are, in my experience, quite scarce.
One of the most important things to help intermediaries align different datasets is ‘data about the data’. Knowing who collected a dataset (ideally with ability to contact them), knowing when and where it was collected, and ideally having pointers to the survey forms or data collection instruments used can make the process of ingesting and reconciling disparate datasets at lot, lot easier.
Conventions like MetaTab provide an easy way to get started providing standardised meta-data when circulating spreadsheets, and there are well established standards for meta-data in most domains.
Meta-data should also include clear information on restrictions or permissions that apply to re-use of a dataset, which brings me onto:
Don’t forget standards of data governance
The first question to ask before making use of any dataset that might contain sensitive information from individuals or organisations is: do I have the right to use this data? Does using or sharing this data (or analysis based on it), put anyone at risk?
As the responsible data initiative puts it, there is a:
…collective duty to account for unintended consequences of working with data by:
1) prioritising people’s rights to consent, privacy, security and ownership when using data in social change and advocacy efforts,
2) implementing values and practices of transparency and openness.
Working out early on a set of shared procedures for assessing the need for, obtaining and recording consents from data subjects for data sharing and re-use can avoid hitting barriers later on. This might take a number of forms, such as:
Identifying the different states that consent might take (.e.g. consent for data to be ‘shared’ with identified partners, or consent for non-personal data to be ‘open’ – drawing on the ODI’s data spectrum and how these should be encoded in each relevant row of a dataset;
Adding a section to meta-data templates for those sharing data to indicate who else data can be shared with, and if any fields should be masked from an open version of a dataset.
Standards are about people
Lastly, but by no means least – it is important to think of standards as a process, not a product. That documentation I mentioned at the start? That’s not for users: that’s for you. Because most of the time people don’t read documentation: they don’t have the time, or don’t know where to start. In reality, most of the standards I’ve worked on require conversations, engagement and feedback to help people align their data with them.
If someone is designing a data collection survey, the prime opportunity for standardisation is between their first draft, and it going out in the field. If you can get into a conversation then, and provide prioritised feedback on how it can align more with the documented standard, how it could incorporate some ‘key fields’ that will unlock other data, or how the consent questions could be worded to be compatible with shared data governance, then you have a chance of the data that flows from that data collection will be possible to bring together as part of a wider aligned insight datasets.
In all the standards I’ve worked on, the ‘Helpdesk’ team have been as vital as the documentation and schema to making standards truly work as tools of coordination and collaboration.
I started writing this just before the Christmas break, but got interrupted by both festivities and flu. So, below, a slightly belated look back at 2019: where yet again my blogging has been far too sporadic.
My other FOI adventures of 2019 have been less conclusive:
Gloucestershire’s refusal to provide prices and buyers of the public land they have sold off means the only way to piece this together would be by spending £100s on land registry records: something I’ve not had space to pursue. Promises that this information would be published proactively from September have been broken by Cabinet – and our experiment in using the Local Audit and Accountability Act in June to look at relevant documents didn’t appear to provide a full overview. It seems profoundly odd that there is so little transparency over how public assets are being disposed of.
Much of March was spent working on final editing of chapters for The State of Open Data, and then, late in the month, heading to Paris for The Impacts of Civic Technology (TicTec) conference to present initial finings with my co-editor, Mor. An evening reception and hearing about digital democracy and participation projects at French National Assembly was particularly inspiring.
April – Printing and Driving
2019 was supposed to be a bit of a sabbatical year (learning point: I’m not very good at sabbaticals), but in late March and April I did finally get round to my two main goals of: (a) learning a bit about printmaking; (b) passing my driving test.
A wonderful two day workshop with Rod Nelson had me exploring woodcut designs exploring field patterns and the Stroud landscape.
Getting hold of physical copies of The State of Open Data book was a great moment: as at times the project has felt quite beyond delivery. I’m pretty pleased indeed with how it turned out – with contributions from 60+ authors, and many more reviewers and contributors.
I’ve still got a few hard copies that can go free to University or organisational libraries, so if you’ve read this far, and you would like one – do drop me a note.
June – Facilitation fun with IATI
In June I took another #slowTravel trip – heading to Copenhagen by train to facilitate a workshop for the International Aid Transparency Initiative’s technical community on the draft strategy.
Wrapping up a very enjoyable three days facilitating technical community & strategy consultation workshop for @IATI_aid at UN City in Copenhagen. Great to see the passion of the group for improving the data infrastructure of aid & humanitarian coordination #IATIpic.twitter.com/pZcHUoMeIE
This followed some online facilitation work for strategy dialogues earlier in the year. I’ve also had chance this year to co-facilitate an online dialogue for Land Portal: reminding me how much I enjoy this kind of blended online and offline facilitation work. Perhaps something to explore more in 2020.
The weather and walk was stunning – and a real chance for reflection.
August – Impact Bonds and Waste Management
Besides the annual August pilgrimage to Greenbelt, it was a month of interesting UK projects – including work with the Government Outcomes Lab at the University of Oxford to scope out ways to improve transparency and data sharing around Social Impact Bonds, and contributing to a(sadly unsuccessful) pitch by Open Data Manchester and Dsposal to secure innovation funding to build on their prototype KnoWaste standard.
September – Civic Media Observatory
In September, I had my first opportunity to work in-depth on a project with the fantastic Global Voices team – using AirTable to rapid prototype a database and workflow for tracking and analysing mainstream media, social media, and offline events through a local lens, and understanding the context and subtext of the media that platform moderators may be asked to make snap judgements over.
I spent all of October in Italy, first as a residential fellow at the Rockefeller Bellagio Center in Italy, and then with a brief vacation in Verona, and quick trip to Rome to work with Land Portal.
Taking part in the Bellagio Center’s thematic month on Artificial Intelligence was quite simply a once in a lifetime opportunity. I didn’t write much about it at the time (as I was busy trying to pull together the outline of a new book proposal) and with an election called in the UK just as we were heading home, haven’t had the space to follow up. Hopefully some point next year I’ll be publishing a few outputs from the month.
However, I can’t leave my fellow resident’s work un-shared, so if I’ve not already signposted the below to you, do take time to:
Consider new challenges of digital colonialism in work coming up from Rumman Chowdhury; and
Find out how funders are thinking about AI futures in work coming up from Vilas Dhar.
I should also mention one of the other highlights of the residency: enjoying two shows, numerous tricks. and sage advice from ‘Magician in residence’ Brad Barton, Reality Thief – go see him if you are ever in the Bay Area!
November – Elections!
I returned from Italy right into the middle of the biggest General Election campaign Stroud District Green Party have ever run, for the fantastic Molly Scott Cato. It was a month both spent both on the doorstep, and juggling spreadsheets – exploring the reality of values-based volunteer-driven political campaigning in an era of data.
December – Global Data Barometer
Over November and December I was also working on the scoping for a potential new project – the Global Data Barometer – a successor to the Open Data Barometer study I helped create at the Web Foundation back in 2013. The goal is to explore how a 100+ country study could provide insight into patterns of ‘responsible re-use’ of data around the world – capturing both use of data as a resource for sustainable development – and efforts to manage the risks that the unregulated collection and processing of ever increasing quantities of data might create. I published the initial draft research framework just before Christmas, and will be exploring the project more in a workshop in Washington next week.
Over 2020 I’m looking forward to more work on the Global Data Barometer, and with the Open Ownership team, as well as some further facilitation projects, and, hopefully, a bit more writing time! We’ll see.
I’m spending much of this October as a resident fellow at the Bellagio Centre in Italy, taking part in a thematic month on Artificial Intelligence (AI). Besides working on some writings about the relationship between open standards for data and the evolving AI field, I’m trying to read around the subject more widely, and learn as much as I can from my fellow residents.
As the first of a likely series of ‘thinking aloud’ blog posts to try and capture reflections from reading and conversations, I’ve been exploring what Wittgenstein’s later language philosophy might add to conversations around AI.
Wittgenstein and technology
Wittgenstein’s philosophy of language, whilst hard to summarise in brief, might be conveyed through reference to a few of his key aphorisms. §43 of the Philosophical Investigations makes the key claim that: ”For a large class of cases–though not for all–in which we employ the word ‘meaning’ it can be defined thus: the meaning of a word is its use in the language.” But this does not lead to the idea that words can mean anything: rather, correct use of a word depends on its use being effective, and that in turn depends on a setting, or, as Wittgenstein terms it, a ‘language game’. In a language game participants have come to understand the rules, even if the rules are not clearly stated or entirely legible: we engage successfully in language games through learning the techniques of participation, acquired through a mix of instruction and of practice. Our participation in these language games is linked to the idea of ‘forms of life’, or, as it is put in §241 of the Philosophical Investigations, “It is what human beings say that is false and true; and they agree in the language they use. That is not agreement in opinions but in form of life.”.
As I understand it, one of the key ideas here can be expressed by stating that meaning is essentially social, and it is our behaviours and ways of acting, constrained by wider social and physical limits, that determine the ways in which meaning is made and remade.
Where does AI fit into this? Well in Wittgenstein as a Philosopher of Technology: Tool Use, Forms of Life, Technique, and a Transcendental Argument, Coeckelbergh & Funk (2018) draw on Wittgenstein’s tool metaphors (and professional history as an engineer as well as philosopher) to show that we can apply a Wittgensteinian analysis to technologies, explaining that: that “we can only understand technologies in and from their use, that is, in technological practice which is also culture-in-practice.” (p 178) . At the same time, they point to the role of technologies in constructing the physical and material constraints upon plausible forms of life:
Understanding technology, then, means understanding a form of life, and this includes technique and the use of all kinds of tools—linguistic, material, and others. Then the main question for a Wittgensteinian philosophy of technology applied to technology development and innovation is: what will the future forms of life, including new technological developments, look like, and how might this form of life be related to historical and contemporary forms of live?[sic] (p 179)
It is important though to be attentive to the different properties ofdifferent kinds of tools in use (linguistic, material, technological) within any form of life. Mass digital technologies, in particular, appears to spread in less negotiable ways: that is, some new technology introduced, whilst open to be embedded in forms of life in some subtly different ways, often has core features presented only on a take-it-or-leave-it basis, and, once introduced, can be relatively brittle and resistant to shaping by its users.
So – as new technologies are introduced, we may find that they reconfigure the social and material bounds of our current forms of life, whilst also introducing new language games, or new rules to existing games into our social settings. And with contemporary AI technologies in particular, a number of specific concerns may arise.
AI Concerns and Critical Responses
Before we consider how AI might affect our forms of life, a few further observations (and statements of value):
The plural of ‘forms’ is intentional. There are variations in the forms of life lived across our planet. Social agreements in behaviour and action vary between cultural settings, regions or social strata. Many humans live between multiple forms of life, translating in word and behaviour between the different meanings each requires. Multiple forms are not strictly dichotomous: different forms of life may have many resemblances, but their distinctions matter and should be valued (this is an explicit political statement of value on my part).
There have been a number of social projects to establish certain universal forms of life over past centuries. For example, the development of consensus on human rights frameworks is one of these. seeking equitable treatment of all (I also personally subscribe to the view that a high level of respect for universal human rights should feature as a constraint toall forms of life).
Within this trend, there are also a number of significant projects seeking to establish greater acceptance of different ways of living, including action to reverse the victorian imposition of certain normative family structures, work to afford individuals greater autonomy in defining their own identities, and activity to embed much more ecological models of thinking about human society.
These trends (or ongoing social struggles if you like) seeking to make our ways of living more tolerant, open,inclusive and sustainable are important to note when we consider the rise of AI systems. Such systems are frequently reliant on categorised data, and on a reductive modelling of the human experience based on past, rather than prospective, data.
This noted, it appears then that we might point to two distinct forms of concern about AI:
(A) The use of algorithmic systems, built on reductive data, risks ossifying past ways of life (with their many injustices), rather than supporting struggles for social justice that involve ongoing efforts to renegotiate the meaning of certain categories and behaviours.
(B) Algorithmic systems may embody particular ways of life that, because of the power that can be exercised through their pervasive operation, cause those forms of life to be imposed over others. This creates pressure for humans to adapt their ways of life to fit the machine (and its creators/owners), rather than allowing the adaptation of the machine to fit into different human ways of life.
Gender detection software is AI trained to judgethe gender of a person from an image (or from analysing names, text or some other input). In general, such systems define gender using a male-female binary. Such systems are being widely used in research and industry. Yet, at the same time the task of judging gender is being passed from human to machine, there are increasingly present ways of life that reject the equation of gender and sex identity, and the idea of a fixed gender-binary. The introduction of AI here risks the ossification of past social forms.
Predictive text tools are increasingly being embedded in e-mail and chat clients to suggest one-click automatic responses, instead of requiring the human to craft a written response. Such AI-driven features are at once a tool of great convenience, but also an imposed shift in our patterns of social interaction.
Such forms of ‘social robot’ are addressed by Coeckelbergh & Funk when they write: “These social robots become active systems for verbal communication and therefore influence human linguistic habits more than non-talking tools.” (p 185). But note the material limitations of these robots: they can’t construct a full sentence representative of their user. Instead, they push conversation towards the quick short response, creating a pressure to change patterns of human interaction.
The examples above suggested by gmail for me to use in reply to a recent e-mail might follow terms I’d often use, but push towards a form of e-mail communication that, at least in my experience, represents a particularly capitalist and functional form of life, in which speed of communication is of the essence, rather than social communication and exploration of ideas.
Reflections and responses
Wittgenstein was not a social commentator, but it is possible to draw upon his ideas to move beyond conversations about AI bias, to look at how the widespread introduction of algorithmic and machine-learning driven systems may interact with different contemporary forms of living.
I’m always interested though in the critical leading to the practical, and so below I’ve started to sketch out possible responses the analysis above leads me to consider. I also strongly suspect that these responses, and justification for them, can be elaborated much more directly and accessibility without getting here via Wittgenstein. Writing that may be a task for later, but as I came here via the Wittgensitinian route, I’ll stick with it.
(1) Find better categories
If we want future algorithmic systems to represent the forms of live we want to live, not just those lived in the past, or imposed upon populations, we need to focus on the categories and data structured used to describe the world and train machine-learning systems.
The question of when we can develop global categories that have meaning that is ‘good enough’ in terms of alignment in use across different settings, and when it is important to have systems that can accommodate more localised categorisations, is one that requires detailed work, and that is inherent political.
(2) Build a better machine
Some objects to particular instances of AI may be because it is, ultimately, too blunt in its current form. Would my objection to the predictive text tools be the same if they could express more complete sentences, more in line with the way I want to communicate? For many critiques of algorithmic systems, there may be a plausible response to suggest that a better designed or trained system could address the problem raised.
I’m sceptical however, of whether it is plausible for most current instantiations of machine-learning to be adaptable enough to different forms of life: not least on the grounds that for some ways of living the sample-size may be too small to gather enough data points to construct a good model, or the collection of the data required may be too expensive or intrusive for theoretical possibilities of highly adaptive machine-learning systems to be practically feasible or desirable.
(3) Strategic rejection
Recognising the economic and political power embedded in certain AI implementations, and the particular form of life it embodies, may help us to see technologies we want to reject outright. If a certain tool makes moves in a language game that are at odds with the game we want to be playing, and only gains agreement of action through its imposition, then perhaps we should not admit it at all.
To put that more bluntly (and bringing in my own political stance), certain AI tools embody a late-capitalist form of life, rooted in cultures and practices of a small strata of Silicon Valley. Such tools should have no place in shaping other ways of life, and should be rejected not because they are biased, or because they have not adequately considered issues of privacy, but simply because the form of life they replicate undermines both equality and ecology.
Over my time here at Bellagio, I’ll be particularly focussed on the first of these responses – seeking better categories, and understanding how processes of standardisation interact with AI. My goal is to do that with more narrative, and less abstraction, but we shall see…
[Summary: report from a one day workshop with Create Gloucestershire bringing together artists and technologists to create artworks responding to data. Part 2 in a series with Exploring Arts Engagement with (Open) Data]
What happens when you bring together a group of artists, scientists, teachers and creative producers, with a collection of datasets, and a sprinkling of technologists and data analysts for a day? What will they create? What can we learn about data through the process?
The steady decline in education spending and increased focus on STEM subjects has impacted significantly on arts teaching and teachers. The knock on effect is observed in the take up of arts subjects at secondary, further and higher education level and, ultimately, impacting negatively on the arts and cultural sector in the UK. As such, Create Gloucestershire has been piloting new work in Gloucestershire schools to embed new creative curriculum approaches, supporting its mission to ‘make arts everyday for everyone’. The cultural education agenda therefore provided a useful ‘hook’ for this data exploration.
We started thinking about the idea of a ‘art and data hackathon’ at the start of this year, as part of Create Gloucestershire’s data maturity journey and decided to focus on questions around cultural education in Gloucestershire. However, we quickly realised an event could not be entirely modelled on a classic coding hackathon event, so, in April we brought together a group of potential participants for a short design meeting.
For this, we sought out a range of datasets about schools, arts education, arts teaching and funding for arts activities – and I worked to prepare Gloucestershire extracts of these datasets (slimming them down from hundreds of columns and rows) . Inspired by the Dataset Nutrition Project project, and using AirTable blocks to rapidly create a set of cards, we took along profiles of some of these datasets to help give participants at the planning meeting a sense of what might be found inside each of the datasets we looked at.
Through this planning meeting we were able to set our expectations about the kind of analysis and insights we might get to from these datasets, and to think about placing the emphasis of the day on collaboration and learning, rather than being overly directive about the questions to be answered with data. We also decided that, in order to help collaborative groups form in the workshop, and to make sure we had materials prepared for particular art forms, we would invite a number of artists to act as anchor facilitators on the day.
Culture: the hackathon day
After an overview of Create Gloucestershire’s mission to bring about ‘arts everyday for everyone’, we began with introductions, going round the group and completing three sentences:
For me, data is…
For me, arts everyday is…
In Gloucestershire, is arts everyday….?
Through this, we began to surface different experiences of engagement with data (everywhere; semi-transparent; impersonal; information; a goldmine; less well defined than art; complex; connective…), and with questions of access to arts (Arts everyday is: fun; making sense of the world; what you make of it; necessary; a privilege for some; an improbable dream; essential).
Illustrator and filmmaker, Joe Magee described the power of the pen, and how to sketch out responses to data;
Digital communications consultant and artist, Sarah Dixon described the use of textiles and paper to create work that mixes 2D and 3D; and
Architect Tomas Millar introduced a range of Virtual Reality technologies, and how tools from architecture and gaming could be adapted to create data-related artworks.
To get our creative ideas flowing, we then ran through some rapid idea generation, with everyone rotating around our four artists groups, and responding to four different items of data (below) with as many different ideas as possible. From the 30+ ideas generated came some of the seeds of the works we then developed during the afternoon.
Following a short break, everyone had the chance to form groups and dig deeper into designing an artwork, guided by a number of questions:
What response to data do group members want to focus on? Collecting data? Data representation? Interpretation and response? Or exploring ‘missing data’?
Is there a story, or a question you want to explore?
Who is the audience for your creation?
What data do you need? Individual numbers; graphs; tables; geo data; qualitative data; network data or some other form?
Groups then had around three hours to start making and creating prototype artworks based on their ideas, before we reconvened for a showcase of the creations.
The process was chaotic and collaborative. Some groups were straight into making: testing out the physical properties of materials, and then retrofitting data into their works later. Others sought to explore available datasets and find the stories amongst a wall of statistics. In some cases, we found ourselves gathering new data (e.g. lists of extracurricular activities taken from school websites), and in others, we needed to use exploratory data visualisation tools to see trends and extrapolate stories that could be explored through our artforms. People moved between groups to help create: recording audio, providing drawings, or sharing skills to stimulate new ways of increasing access to the stories within the data. Below is a brief summary of some of the works created, followed by some reflections on learning from the day.
Interactive audio: school subjects in harmony
Responding to questions about the balance of the school curriculum, and the low share of teaching hours occupied by the arts, the group recorded a four-part harmony audio clip, and set the volume of each part relative to the share of teaching time for arts, english, sciences and humanities. Through a collection of objects representing each subject, audiences could trigger individual parts, all four parts together, or a distorted version of the harmony. Through inviting interaction, and using volume and distortion, the piece invited reflection on the ‘right’ balance of school subjects, and the effect of loosing arts from the curriculum for the overall harmony of education.
Fabric chromatography: creative combinations
Picking up on a similar theme, this fabric based project sought to explore the mix of extracurricular activities available at a school, and how access to a range of activities can interact to support creative education. Using strips of fabric, woven in a grid onto a backcloth, the work immersed a dangling end of each strip in coloured ink, the mix of inks depending on the range of arts activities available at a particular school. As the ink soaked up vertical strands of the fabric, it also started to seep into horizontal strands, which could mix with other colours. The colours chosen reflected a chart representation of the dataset used to inform the work, establishing a clear link between data, information, and art work.
This work offered a powerful connection between art, data and science: allowing an exploration of how the properties of different inks, and different fabrics, could be used to represent data on ‘absorption’ of cultural education, and the benefits that may emerge from combining different cultural activities. The group envisaged works like this being developed with students, and then shown in the reception area of a school to showcase it’s cultural offer.
The shrinking design teacher (VR installation)
Using a series of photographs taken on a mobile phone, a 3D model of representation of Pip, a design teacher, was created in a virtual landscape. An audio recording of Pip describing the critical skill sets engendered through design teaching was linked to the model, which was set to shrink in size over the time of the recording reflecting 7-years of data on the reduction in design teaching hours in school.
Observed through VR goggles, the piece offered an emotive way to engage with a narrative on the power of art to encourage critical questioning of structures, and to support creative engagement with the world, all whilst – imperceptibly at first, and more clearly as the VR observer finds themselves looking down at the shrinking teacher – highlighting current trends in teaching hours.
From the virtual to the physical, this sketch questioned the ‘rigged’ nature of grammar school and private education, imagining an arcade machine where the weight, size and shape of tokens were set according to various data points, and where the mechanism would lead to certain tokens having a better chance of winning.
By exploring a data-informed arcade mechanisms, this idea captures the idea that statistical models can tell us something about potential future outcomes, but that outcomes are not entirely determined, and there are still elements of chance, or unpredictable interactions, in any individual story.
Building on data about different reasons for school exclusion, eight workshop participants were handed paper tags, marking them out for exclusion from the ‘classroom’. They were told to leave the room, where the images on their tags were scanned (using the Mayfly app) playing them a cold explanation of why they have been excluded and for how long.
The group were then invited to create a fabric based sculpture to represent the percentage of children excluded from school in Gloucestershire for the reasons indicated on their tag.
The work sought to explore the subjective experience of being excluded, and to look behind the numbers to the individual stories – whilst also prototyping a possible creative yarn-bombing workshop that could be used with excluded young people to re-engage them with education.
The team envisaged a further set of tags linked to personal narratives collected from young people excluded from school, bringing their voices into the piece to humanise the data story.
Library lights: stories from library users
This early prototype explored the potential VR to let an audience explore a space, shedding light on areas that are otherwise in darkness. Drawing on statistics about the fact that 33% of people use libraries, and on audio recordings – drawn from direct participant quotes collected by Create Gloucestershire during their 3-year Art of Libraries test programme describing how people benefitted from engagement with arts interventions in libraries across Gloucestershire – a virtual space was populated with 100 orbs – the percentage lit relating to those who use libraries. As the audience in VR approached a lit orb, an audio recording of an individual experience with a library would play.
The creative team envisaged the potential to create a galaxy of voices: offseting negative comments about libraries from those that don’t use them (they were able to find a significant number of data sets showing negative perceptions about libraries, but few positive ones) with the good experiences of those that do.
Artwork: Tomas Millar and team (image to come)
Seeing our networks
Not so much an artwork, as a data visualisation, this piece took data gathered over the last five years by Create Gloucestershire to record attendance at Create Gloucestershire events. Adding in data on attendance at the Creative Lab, lists of people, events and event participation (captured and cleaned up using the vTiger CRM), were fed into Kumu, and used to build an interactive network diagram. The visual allows an identification of how, over time, CG events have both engaged with new people (out on the edge of the network), and have started to build ongoing connections.
A note on naming
*One things we forgot to do (!) in our process was to ask each group to title their works, so the titles and descriptions above are given by the authors of this post. We will happily amend with input from each group.
We closed our workshop reflecting on learning from the day. I was particularly struck by the way in which responding to dataset through the lens of artistic creation (and not just data visualisation) provided opportunities to ask new questions of datasets, and to critically question their veracity and politics: digging into the stories behind each data point, and powerfully combining qualitative and quantitative data to look not just at presenting data, but finding what it might mean for particular audiences.
However, as Joe Magee framed it, it wasn’t always easy to find a route up the “gigantic data coalface”. Faced with hundreds of rows and columns of data, it was important to have access to tools and skills to carry out quick visualisations: yet knowing the right tools to use, or how to shape data so that it can be easily visualised, is not always straightforwards. Unlike a classic data hackathon, where there are often demands for the ‘raw data’, a data and art creative lab benefits from more work to prepare data extracts, and to provide access to layers of data (individual data points, a small set they belong in, the larger set they come from) .
Our journey, however, took use beyond the datasets we had pre-prepared. One particular resource we came across was the UK Taking Part Survey which offers a range of analysis tools to drill down into statistics on participation in art forms by age, region and socio-economic status. With this dataset, and a number of others, our expectations were often confounded when, for example, relationships we had expected to find between poverty and arts participation, or age and involvement, were not borne out in the data.
This points to a useful symmetry: turning to data allowed us to challenge the assumptions that might otherwise be baked into an agenda-driven artwork, but engaging with data through an arts lens also allowed us to challenge the assumptions behind data points, and behind the ways data is used in policy-making.
We’ve also learnt more about how to frame an event like this. We struggled to describe it in advance and to advertise it. Too much text was the feedback from some! Now with images of this event, we can think about ways to provide a better visual story for future workshops of what might be involved.
Given Create Gloucestershire’s commitment to arts everyday for everyone as a wholly inclusive statement of intent, it was exciting to see collaborators on the day truly engaging with data in a way they may not have done previously, and then expanding access to it by representing data in accessible and engaging forms which, additionally, could be explored by subjects of the data themselves. What might have seemed “boring” or “troublesome” at the start of the day become a font of inspiration and creativity, opening up new conversations that may never have previously taken place and setting up the potential for new collaborations, conversations, advocacy and engagement.
Thank you to the team at Create Gloucestershire for hosting the day, and particularly to Caroline, Pippa and Jay for all the organisation. Thanks to Kat at Atelier for hosting us, and to our facilitating artists: Barney, Sarah, Thomas and Joe. And thanks to everyone who gave up a Saturday to take part!
[Summary: an argument for the importance of involving civil society, and thinking broad when exploring the concept of high value data (with lots of links to past research and the like smuggled in)]
On 26th June this year the European Parliament and Council published an update to the Public Sector Information (PSI) directive, now recast as Directive 2019/1024 “on open data and the re-use of public sector information”.The new text makes a number of important changes, including bringing data held by publicly controlled companies in utility and transport sectors into the scope of the directive, extending coverage of research data, and seeking to limit the granting of exclusive private sector rights to data created during public tasks, and increase the transparency when such rights are granted.
However, one of the most significant changes of all is the inclusion of Article 14 on High Value Datasets which gives the Commission power to adopt an implementing act “laying down a list of specific high-value datasets” that member states will be obliged to publish under open licenses, and, in some cases, using certain APIs and standards. The implementing acts will have the power to set out those standards. This presents a major opportunity to shape the open data ecosystem of Europe for decades to come.
A few weeks back, a number of open data researchers and campaigners had a quick call to discuss ways to make sure past research, and civil society voices, inform the work that goes forward. As part of that, I agreed to draft a short(ish) post exploring the concept of high value data, and looking at some of the issues that might need to be addressed in the coming months. I’d hoped to co-draft this with colleagues, but with summer holidays and travel having intervened, am instead posting a sole authored post, with an invite to others to add/dispute/critique etc.
Notably, whilst it appears few (if any) open-data related civil society organisations are in a position to lead a response to the current EC tender, the civil society open data networks built over the last decade in Europe have a lot to offer in identifying, exploring and quantifying the potential social value of specific open datasets.
What counts as high value?
The Commission’s tender points towards a desire for a single list of datasets that can be said to exist in some form in each member state. The directive restricts the scope of this list to six domains: geospatial, earth observation and environment, meteorological, statistical, company and company ownership, and mobility-related datasets. It also appears to anticipate that data standards will only be prescribed for some kinds of data: highlighting a distinction between data that may be high value simply by virtue of publication, and data which is high-value by virtue of it’s interoperability between states.
In the new directive, the definition of ‘high value datasets’ is put as:
“documents the re-use of which is associated with important benefits for society, the environment and the economy, in particular because of their suitability for the creation of value-added services, applications and new, high-quality and decent jobs, and of the number of potential beneficiaries of the value-added services and applications based on those datasets;” (§2.10)
Although the ordering of society, environment and economy is welcome, there are subtle but important differences from the definition advanced in a 2014 paper from W3C and PwC for the European Commission which described a number of factors for determining whether there was high value to making a dataset open (and standardising it in some ways). It focussed attention on whether publication of a dataset:
Contributes to transparency
Helps governments meet legal obligations
Relates to a public task
Realises cost reductions; and
Has some value to a large audience, or substantial value to a smaller audience.
Although the recent tender talks of identifying “socio-economic” benefits of datasets, overall it adopts a strongly economic frame, seeking quantification of these and asking in particular for evaluation of “potential for AI applications of the identified datasets;”. (This particular framing of open data as a raw material input for AI is something I explored in the recent State of Open Data book, where the privacy chapter also picked up on a brief exploration how AI applications may also create new privacy risks for release of certain datasets.) But to keep wider political and social uses of open data in view, and to recognise that quantification of benefits is not a simple process of adding up the revenue of firms that use that data, any comprehensive method to explore high value datasets will need to consider a range of issues, including that:
Value is produced in a range of different ways
Not all future value can be identified from looking at existing data use cases
Value may result from network effects
Realising value takes more than data
Value is a two-sided calculation; and
The distribution of value matters as well as the total amount
I dig into each of these below.
Value is produced in different ways
A ‘raw material’ theory of change still pervades many discussions of open data, in spite of the growing evidence base about the many different ways that opening up access to data generates value. In ‘raw material’ theory, open data is an input, taken in by firms, processed, and output as part of new products and services. The value of the data can then be measured in the ‘value add’ captured from sales of the resulting product or service. Yet, this only captures a small part of the value that mandating certain datasets be made open can generate. Other mechanisms at play can include:
Risk reduction. Take, for example, beneficial ownership data. Quite asides from the revenue generated by ‘Know Your Customer (KYC)’ brokers who might build services off the back of public registers of beneficial ownership, consider the savings to government and firms from not being exposed to dodgy shell-companies, and the consumer surplus generated by supporting a clamp down on illicit financial flows into the housing market by supporting more effective cross-border anti-money laundering investigations. OpenOwnership are planning research later this year to dig more into how firms are using, or could use, beneficial ownership transparency data including to manage their exposure to risk. Any quantification needs to take into account not only value gained, but also value ‘not lost’ because a dataset is made open.
Internal efficiency and innovation. When data is made open, and particularly when standards are adopted, it often triggers a reconfiguration of data practices inside the data (c.f. Goëta & Davies), with the potential for this to support more efficient working, and enable innovation through collaboration between government, civil society and enterprise. For example, the open publication of contracting data, particularly with the adoption of common data standards, has enabled a number of governments to introduce new analytical tools, finding ways to get a better deal on the products and services they buy. Again, this value for money for the taxpayer may be missed by a simple ‘raw material’ theory.
Political and rights impacts. The 2014 W3C/PWC paper I cited earlier talks about identifying datasets with “some value to a large audience, or substantial value to a smaller audience.”. There may also be datasets that have low likelihood of causing impact, but high impact (at least for those affected) when they do. Take, for example, statistics on school admissions. When I first looked at use of open data back in 2009, I was struck by the case of an individual gaining confidence from the fact that statistics on school admission appeals were available (E7) when constructing an appeal case against a school’s refusal to admit their own child. The open availability of this data (not necessarily standardised or aggregated) had substantial value to empowering a citizen in securing their rights. Similarly, there are datasets that are important for communities to secure their rights (e.g. air quality data), or to take political action to either enforce existing policy (e.g. air quality limits), or to change policy (e.g. secure new air quality action zones). No only is such value difficult to quantify, but whether or not certain data generates value will vary between countries in accordance with local policies and political issues. The definition of EU-wide ‘high value datasets’ should not crowd out the possibility or process of defining data that is high-value in particular country. That said, there may at least be scope to look at datasets in the study categories that have substantial potential value in relation to EU social and environmental policy priorities.
Beyond the mechanisms above, there may also be datasets where we find a high intrinsic value in the transparency their publication brings, even without a clear evidence base that can quantifies their impact. In these cases, we might also talk of the normative value of openness, and consider which datasets deserve a place on the high-value list because we take the openness of this data to be foundational to the kind of societies we want to live in, just as we may take certain freedoms of speech and movement as foundational to the kind of Europe we want to see created.
Not all value can be found from prior examples
The tender cites projects like the Open Data Barometer (which I was involved in developing the methodology for) as potential inspirations for the design of approaches to assess “datasets that should belong to the list of high value datasets”. The primary place to look for that inspiration is not in the published stats, but in the underlying qualitative data which includes raw reports of cases of political, social and economic impact from open data. This data (available for a number of past editions of the Barometer) remains an under-explored source of potential impact cases that could be used to identify how data has been used in particular countries and settings. Equally, projects like the State of Open Data can be used to find inspiration on where data has been used to generate social value: the chapter on Transport is as case-in-point, looking at how comprehensive data on transport can support applications improving the mobility of people with specific needs.
However, many potential uses and impacts of open data are still to be realised, because the data they might work with has not heretofore been accessible. Looking only at existing cases of use and impact is likely to miss such cases. This is where dialogue with civil society becomes vitally important. Campaigners, analysts and advocates may have ideas for the projects that could exist if only particular data was available. In some cases, there will be a hint at what is possible from academic projects that have gained access to particular government datasets, or from pilot projects where limited data was temporarily shared – but in other cases, understanding potential value will require a more imaginative and forward-looking and consultative process. Given the upcoming study may set the list of high value datasets for decades to come – it’s important that the agenda is not be solely determined by prior publication precedent.
For some datasets, certain value comes from network effects
If one country provides an open register of corporate ownership, the value this has for anti-corruption purposes only goes so far. Corruption is a networked game, and without being able to following corporate chains across borders, the value of a single register may be limited. The value of corporate disclosures in one jurisdiction increase the more other jurisdictions provide such data. The general principle here, that certain data gains value through network effects, raises some important issues for the quantification of value, and will help point towards those datasets where standardisation is particularly important. Being able to show, for example, that the majority of the value of public transit data comes from domestic use (and so interoperability is less important), but the majority of value of, say, carbon emission or climate change mitigation financing data, comes from cross-border use, will be important to support prioritisation of datasets.
Value generation takes more than data
Another challenge of of the ‘raw material’ theory of change is that it often fails to consider (a) the underlying quality (not only format standardisation) of source data, and (b) the complementary policies and resources that enable use. For example, air quality data from low-quality or uncalibrated particulate sensors may be less valuable than data from calibrated and high quality sensors, particularly when national policy may set out criteria for the kinds of data that can be used in advancing claims for additional environmental protections in high-pollution areas. Understanding this interaction of ‘local data’ and the governance contexts where it is used is important in understanding how far, and under what conditions, one may extrapolate from value identified in one context, to potential value to be realised in another. This calls for methods that can go beyond naming datasets, to being able to describe features (not just formats) that are important for them to have.
It can be temping to quantify the value of a dataset simply by taking all the ‘positive’ value it might generate, and adding it up. But, a true quantification calculation also needs to consider potential negative impacts. In some cases, this could be positive economic value set against some social or ecological dis-benefit. For example, consider the release of some data that might increase use of carbon-intensive air and road transport. While this could generate quantifiable revenue for haulage and airline firms, it might undermine efforts to tackle climate change, destroying long-term value. Or in other cases, there may be data that provides social benefit (e.g. through the release of consumer protection related data) but that disrupts an existing industry in ways that reduce private sector revenues.
Recognising the power of data, involves recognising that power can be used in both positive and negative ways. A complete balance sheet needs to consider the plus and the minus. This is another key point where dialogue with civil society will be vital – and not only with open data advocates, but with those who can help consider the potential harms of certain data being more open.
Distribution of value matters
Last but not least, when considering public investment in ‘high value’ datasets, it is important to consider who captures that value. I’ve already hinted at the fact that value might be captured as government surplus, consumer surplus or producer (private sector) surplus – but there are also relevant question to ask about which countries or industries may be best placed to capture value from cross-border interoperable datasets.
When we see data as infrastructure, then it can help us consider the potential to both provide infrastructure that is open to all and generative of innovation, but also to design policies that ensure those capturing value from the infrastructure are contributing to its maintenance.
Work on methodologies to identify high value datasets in Europe should not start from scratch, and stand to benefit substantially from engaging with open data communities across the region. There is a risk that a narrow conceptualisation and quantification of ‘high value’ will fail to capture the true value of openness, and to consider the contexts of data production and use. However, there is a wealth of research from the last decade (including some linked in this post, and cited in State of Open Data) to build upon, and I’m hopeful that whichever consultant or consortium takes on the EC’s commissioned study, they will take as broad a view as possible within the practical constraints of their project.