Creative Lab Report: Data | Culture | Learning

[Summary: report from a one day workshop with Create Gloucestershire bringing together artists and technologists to create artworks responding to data. Part 2 in a series with Exploring Arts Engagement with (Open) Data]

What happens when you bring together a group of artists, scientists, teachers and creative producers, with a collection of datasets, and a sprinkling of technologists and data analysts for a day? What will they create? What can we learn about data through the process?  

There has been a long trend of data-driven artworks, or individual artists incorporating responses to structured data in their work. But how does this work in the compressed context of a one-day collaborative workshop? These are all questions I had the opportunity to explore last Saturday in a workshop co-facilitated with Jay Haigh of Create Gloucestershire and hosted at Atelier in Stroud:  an event we ran under the title “Data | Create | Learning: Creative Lab”

The steady decline in education spending and increased focus on STEM subjects has impacted significantly on arts teaching and teachers. The knock on effect is observed in the take up of arts subjects at secondary, further and higher education level and, ultimately, impacting negatively on the arts and cultural sector in the UK. As such, Create Gloucestershire has been piloting new work in Gloucestershire schools to embed new creative curriculum approaches, supporting its mission to ‘make arts everyday for everyone’. The cultural education agenda therefore provided a useful ‘hook’ for this data exploration. 

Data: preparation

We started thinking about the idea of a ‘art and data hackathon’ at the start of this year, as part of Create Gloucestershire’s data maturity journey and decided to focus on questions around cultural education in Gloucestershire. However, we quickly realised an event could not be entirely modelled on a classic coding hackathon event, so, in April we brought together a group of potential participants for a short design meeting. 

Photo of preparation workshop

For this, we sought out a range of datasets about schools, arts education, arts teaching and funding for arts activities – and I worked to prepare Gloucestershire extracts of these datasets (slimming them down from hundreds of columns and rows) . Inspired by the Dataset Nutrition Project project, and using AirTable blocks to rapidly create a set of cards, we took along profiles of some of these datasets to help give participants at the planning meeting a sense of what might be found inside each of the datasets we looked at. 

Dataset labels: inspired by dataset nutrition project
Through this planning meeting we were able to set our expectations about the kind of analysis and insights we might get to from these datasets, and to think about placing the emphasis of the day on collaboration and learning, rather than being overly directive about the questions to be answered with data. We also decided that, in order to help collaborative groups form in the workshop, and to make sure we had materials prepared for particular art forms, we would invite a number of artists to act as anchor facilitators on the day.

Culture: the hackathon day 

Group photo of hackathon day

After an overview of Create Gloucestershire’s mission to bring about ‘arts everyday for everyone’, we began with introductions, going round the group and completing three sentences:

  • For me, data is…
  • For me, arts everyday is…
  • In Gloucestershire, is arts everyday….? 

For me, data is... (post-it notes)

Through this, we began to surface different experiences of engagement with data (everywhere; semi-transparent; impersonal; information; a goldmine; less well defined than art; complex; connective…), and with questions of access to arts (Arts everyday is: fun; making sense of the world; what you make of it; necessary; a privilege for some; an improbable dream; essential). 

We then turned briefly to look at some of the data available to explore these questions, before inviting our artists to explain the tools and approaches they had brought along to share:

  • Barney Heywood of Stand + Stare demonstrated use of touch-sensitive tape to create physical installations that respond to an audience with sound or visuals, as well as the Mayfly app that links stickers and sounds;
  • Illustrator and filmmaker, Joe Magee described the power of the pen, and how to sketch out responses to data;
  • Digital communications consultant and artist, Sarah Dixon described the use of textiles and paper to create work that mixes 2D and 3D; and
  • Architect Tomas Millar introduced a range of Virtual Reality technologies, and how tools from architecture and gaming could be adapted to create data-related artworks. 

To get our creative ideas flowing, we then ran through some rapid idea generation, with everyone rotating around our four artists groups, and responding to four different items of data (below) with as many different ideas as possible. From the 30+ ideas generated came some of the seeds of the works we then developed during the afternoon.

Slides showing: 38% drop in arts GCSE entries 2010 to 2019; Table of number and percentage of students a local secondary schools eligible for free school meals; Quantitative and qualitative data from a study on arts education in schools.

Following a short break, everyone had the chance to form groups and dig deeper into designing an artwork, guided by a number of questions:

  • What response to data do group members want to focus on? Collecting data? Data representation? Interpretation and response? Or exploring ‘missing data’?
  • Is there a story, or a question you want to explore?
  • Who is the audience for your creation?
  • What data do you need? Individual numbers; graphs; tables; geo data; qualitative data; network data or some other form? 
Example of sketches
Sketching early ideas

Groups then had around three hours to start making and creating prototype artworks based on their ideas, before we reconvened for a showcase of the creations.

The process was chaotic and collaborative. Some groups were straight into making: testing out the physical properties of materials, and then retrofitting data into their works later. Others sought to explore available datasets and find the stories amongst a wall of statistics. In some cases, we found ourselves gathering new data (e.g. lists of extracurricular activities taken from school websites), and in others, we needed to use exploratory data visualisation tools to see trends and extrapolate stories that could be explored through our artforms. People moved between groups to help create: recording audio, providing drawings, or sharing skills to stimulate new ways of increasing access to the stories within the data. Below is a brief summary of some of the works created, followed by some reflections on learning from the day. 

The artworks

Interactive audio: school subjects in harmony

Artwork: Barney Heywood and team | Photo credit: Kazz Hollick

Responding to questions about the balance of the school curriculum, and the low share of teaching hours occupied by the arts, the group recorded a four-part harmony audio clip, and set the volume of each part relative to the share of teaching time for arts, english, sciences and humanities. Through a collection of objects representing each subject, audiences could trigger individual parts, all four parts together, or a distorted version of the harmony. Through inviting interaction, and using volume and distortion, the piece invited reflection on the ‘right’ balance of school subjects, and the effect of loosing arts from the curriculum for the overall harmony of education. 

Fabric chromatography: creative combinations

Artwork: Sarah Dixon and team. Photo credit: Jay Haigh

 Picking up on a similar theme, this fabric based project sought to explore the mix of extracurricular activities available at a school, and how access to a range of activities can interact to support creative education. Using strips of fabric, woven in a grid onto a backcloth, the work immersed a dangling end of each strip in coloured ink, the mix of inks depending on the range of arts activities available at a particular school. As the ink soaked up vertical strands of the fabric, it also started to seep into horizontal strands, which could mix with other colours. The colours chosen reflected a chart representation of the dataset used to inform the work, establishing a clear link between data, information, and art work.

This work offered a powerful connection between art, data and science: allowing an exploration of how the properties of different inks, and different fabrics, could be used to represent data on ‘absorption’ of cultural education, and the benefits that may emerge from combining different cultural activities. The group envisaged works like this being developed with students, and then shown in the reception area of a school to showcase it’s cultural offer. 

The shrinking design teacher (VR installation)

Artwork: Tomas Millar & Pip Heywood. Photo credit: Jay Haigh

Using a series of photographs taken on a mobile phone, a 3D model of representation of Pip, a design teacher, was created in a virtual landscape. An audio recording of Pip describing the critical skill sets engendered through design teaching was linked to the model, which was set to shrink in size over the time of the recording reflecting 7-years of data on the reduction in design teaching hours in school.

Observed through VR goggles, the piece offered an emotive way to engage with a narrative on the power of art to encourage critical questioning of structures, and to support creative engagement with the world, all whilst – imperceptibly at first, and more clearly as the VR observer finds themselves looking down at the shrinking teacher – highlighting current trends in teaching hours. 

Arcade mechanicals

Artwork: Joe Magee and team. Photo credit: Jay Haigh

From the virtual to the physical, this sketch questioned the ‘rigged’ nature of grammar school and private education, imagining an arcade machine where the weight, size and shape of tokens were set according to various data points, and where the mechanism would lead to certain tokens having a better chance of winning. 

By exploring a data-informed arcade mechanisms, this idea captures the idea that statistical models can tell us something about potential future outcomes, but that outcomes are not entirely determined, and there are still elements of chance, or unpredictable interactions, in any individual story. 

Exclusion tags

Artwork: Joe Magee, Sarah Dixon and team. Photo: Jay Haigh

Building on data about different reasons for school exclusion, eight workshop participants were handed paper tags, marking them out for exclusion from the ‘classroom’. They were told to leave the room, where the images on their tags were scanned (using the Mayfly app) playing them a cold explanation of why they have been excluded and for how long.

The group were then invited to create a fabric based sculpture to represent the percentage of children excluded from school in Gloucestershire for the reasons indicated on their tag.  

The work sought to explore the subjective experience of being excluded, and to look behind the numbers to the individual stories – whilst also prototyping a possible creative yarn-bombing workshop that could be used with excluded young people to re-engage them with education.  

The team envisaged a further set of tags linked to personal narratives collected from young people excluded from school, bringing their voices into the piece to humanise the data story.

Library lights: stories from library users

This early prototype explored the potential VR to let an audience explore a space, shedding light on areas that are otherwise in darkness. Drawing on statistics about the fact that 33% of people use libraries, and on audio recordings – drawn from direct participant quotes collected by Create Gloucestershire during their 3-year Art of Libraries test programme describing how people benefitted from engagement with arts interventions in libraries across Gloucestershire – a virtual space was populated with 100 orbs – the percentage lit relating to those who use libraries. As the audience in VR approached a lit orb, an audio recording of an individual experience with a library would play. 

The creative team envisaged the potential to create a galaxy of voices: offseting negative comments about libraries from those that don’t use them (they were able to find a significant number of data sets showing negative perceptions about libraries, but few positive ones) with the good experiences of those that do.

Artwork: Tomas Millar and team (image to come)

Seeing our networks


Not so much an artwork, as a data visualisation, this piece took data gathered over the last five years by Create Gloucestershire to record attendance at Create Gloucestershire events. Adding in data on attendance at the Creative Lab, lists of people, events and event participation (captured and cleaned up using the vTiger CRM), were fed into Kumu, and used to build an interactive network diagram. The visual allows an identification of how, over time, CG events have both engaged with new people (out on the edge of the network), and have started to build ongoing connections. 

A note on naming

*One things we forgot to do (!) in our process was to ask each group to title their works, so the titles and descriptions above are given by the authors of this post. We will happily amend with input from each group. 

Learning

We closed our workshop reflecting on learning from the day. I was particularly struck by the way in which responding to dataset through the lens of artistic creation (and not just data visualisation) provided opportunities to ask new questions of datasets, and to critically question their veracity and politics: digging into the stories behind each data point, and powerfully combining qualitative and quantitative data to look not just at presenting data, but finding what it might mean for particular audiences. 

However, as Joe Magee framed it, it wasn’t always easy to find a route up the “gigantic data coalface”. Faced with hundreds of rows and columns of data, it was important to have access to tools and skills to carry out quick visualisations: yet knowing the right tools to use, or how to shape data so that it can be easily visualised, is not always straightforwards. Unlike a classic data hackathon, where there are often demands for the ‘raw data’, a data and art creative lab benefits from more work to prepare data extracts, and to provide access to layers of data (individual data points, a small set they belong in, the larger set they come from) . 

Our journey, however, took use beyond the datasets we had pre-prepared. One particular resource we came across was the UK Taking Part Survey which offers a range of analysis tools to drill down into statistics on participation in art forms by age, region and socio-economic status. With this dataset, and a number of others, our expectations were often confounded when, for example,  relationships we had expected to find between poverty and arts participation, or age and involvement, were not borne out in the data. 

This points to a useful symmetry: turning to data allowed us to challenge the assumptions that might otherwise be baked into an agenda-driven artwork, but engaging with data through an arts lens also allowed us to challenge the  assumptions behind data points, and behind the ways data is used in policy-making. 

We’ve also learnt more about how to frame an event like this. We struggled to describe it in advance and to advertise it. Too much text was the feedback from some! Now with images of this event, we can think about ways to provide a better visual story for future workshops of what might be involved. 

Given Create Gloucestershire’s commitment to arts everyday for everyone as a wholly inclusive statement of intent, it was exciting to see collaborators on the day truly engaging with data in a way they may not have done previously, and then expanding access to it by representing data in accessible and engaging forms which, additionally, could be explored by subjects of the data themselves.  What might have seemed “boring” or “troublesome” at the start of the day become a font of inspiration and creativity, opening up new conversations that may never have previously taken place and setting up the potential for new collaborations, conversations, advocacy and engagement.

Thanks

Thank you to the team at Create Gloucestershire for hosting the day, and particularly to Caroline, Pippa and Jay for all the organisation. Thanks to Kat at Atelier for hosting us, and to our facilitating artists: Barney, Sarah, Thomas and Joe. And thanks to everyone who gave up a Saturday to take part!

Photo credit where not stated: Jay Haigh

High value datasets: an exploration

[Summary: an argument for the importance of involving civil society, and thinking broad when exploring the concept of high value data (with lots of links to past research and the like smuggled in)]

On 26th June this year the European Parliament and Council published an update to the Public Sector Information (PSI) directive, now recast as Directive 2019/1024 “on open data and the re-use of public sector information.  The new text makes a number of important changes, including bringing data held by publicly controlled companies in utility and transport sectors into the scope of the directive, extending coverage of research data, and seeking to limit the granting of exclusive private sector rights to data created during public tasks, and increase the transparency when such rights are granted.

However, one of the most significant changes of all is the inclusion of Article 14 on High Value Datasets which gives the Commission power to adopt an implementing act “laying down a list of specific high-value datasets” that member states will be obliged to publish under open licenses, and, in some cases, using certain APIs and standards. The implementing acts will have the power to set out those standards. This presents a major opportunity to shape the open data ecosystem of Europe for decades to come.

The EU Commission have already issued a tender for a consultant to support them in defining a ‘List of High-value Datasets to be made Available by the Member States under the PSI-Directive’, and work looks set to advance at pace, particularly as the window granted by the directive to the Commission to set out a list of high value datasets is time-limited.

A few weeks back, a number of open data researchers and campaigners had a quick call to discuss ways to make sure past research, and civil society voices, inform the work that goes forward. As part of that, I agreed to draft a short(ish) post exploring the concept of high value data, and looking at some of the issues that might need to be addressed in the coming months. I’d hoped to co-draft this with colleagues, but with summer holidays and travel having intervened, am instead posting a sole authored post, with an invite to others to add/dispute/critique etc. 

Notably, whilst it appears few (if any) open-data related civil society organisations are in a position to lead a response to the current EC tender, the civil society open data networks built over the last decade in Europe have a lot to offer in identifying, exploring and quantifying the potential social value of specific open datasets.

What counts as high value?

The Commission’s tender points towards a desire for a single list of datasets that can be said to exist in some form in each member state. The directive restricts the scope of this list to six domains: geospatial, earth observation and environment, meteorological, statistical, company and company ownership, and mobility-related datasets. It also appears to anticipate that data standards will only be prescribed for some kinds of data: highlighting a distinction between data that may be high value simply by virtue of publication, and data which is high-value by virtue of it’s interoperability between states.

In the new directive, the definition of ‘high value datasets’ is put as:

“documents the re-use of which is associated with important benefits for society, the environment and the economy, in particular because of their suitability for the creation of value-added services, applications and new, high-quality and decent jobs, and of the number of potential beneficiaries of the value-added services and applications based on those datasets;” (§2.10)

Although the ordering of society, environment and economy is welcome, there are subtle but important differences from the definition advanced in a 2014 paper from W3C and PwC for the European Commission which described a number of factors for determining whether there was high value to making a dataset open (and standardising it in some ways). It focussed attention on whether publication of a dataset:

  • Contributes to transparency
  • Helps governments meet legal obligations
  • Relates to a public task
  • Realises cost reductions; and
  • Has some value to a large audience, or substantial value to a smaller audience.

Although the recent tender talks of identifying “socio-economic” benefits of datasets, overall it adopts a strongly economic frame, seeking quantification of these and asking in particular for evaluation of “potential for AI applications of the identified datasets;”. (This particular framing of open data as a raw material input for AI is something I explored in the recent State of Open Data book, where the privacy chapter also picked up on a brief exploration how AI applications may also create new privacy risks for release of certain datasets.)  But to keep wider political and social uses of open data in view, and to recognise that quantification of benefits is not a simple process of adding up the revenue of firms that use that data, any comprehensive method to explore high value datasets will need to consider a range of issues, including that:

  • Value is produced in a range of different ways
  • Not all future value can be identified from looking at existing data use cases
  • Value may result from network effects
  • Realising value takes more than data
  • Value is a two-sided calculation; and
  • The distribution of value matters as well as the total amount

I dig into each of these below.

Value is produced in different ways

A ‘raw material’ theory of change still pervades many discussions of open data, in spite of the growing evidence base about the many different ways that opening up access to data generates value. In ‘raw material’ theory, open data is an input, taken in by firms, processed, and output as part of new products and services. The value of the data can then be measured in the ‘value add’ captured from sales of the resulting product or service. Yet, this only captures a small part of the value that mandating certain datasets be made open can generate. Other mechanisms at play can include:

  • Risk reduction. Take, for example, beneficial ownership data. Quite asides from the revenue generated by ‘Know Your Customer (KYC)’ brokers who might build services off the back of public registers of beneficial ownership, consider the savings to government and firms from not being exposed to dodgy shell-companies, and the consumer surplus generated by supporting a clamp down on illicit financial flows into the housing market by supporting more effective cross-border anti-money laundering investigations. OpenOwnership are planning research later this year to dig more into how firms are using, or could use, beneficial ownership transparency data including to manage their exposure to risk. Any quantification needs to take into account not only value gained, but also value ‘not lost’ because a dataset is made open.
  • Internal efficiency and innovation. When data is made open, and particularly when standards are adopted, it often triggers a reconfiguration of data practices inside the data (c.f. Goëta & Davies), with the potential for this to support more efficient working, and enable innovation through collaboration between government, civil society and enterprise. For example, the open publication of contracting data, particularly with the adoption of common data standards, has enabled a number of governments to introduce new analytical tools, finding ways to get a better deal on the products and services they buy. Again, this value for money for the taxpayer may be missed by a simple ‘raw material’ theory.
  • Political and rights impacts. The 2014 W3C/PWC paper I cited earlier talks about identifying datasets with “some value to a large audience, or substantial value to a smaller audience.”. There may also be datasets that have low likelihood of causing impact, but high impact (at least for those affected) when they do. Take, for example, statistics on school admissions. When I first looked at use of open data back in 2009, I was struck by the case of an individual gaining confidence from the fact that statistics on school admission appeals were available (E7) when constructing an appeal case against a school’s refusal to admit their own child. The open availability of this data (not necessarily standardised or aggregated) had substantial value to empowering a citizen in securing their rights. Similarly, there are datasets that are important for communities to secure their rights (e.g. air quality data), or to take political action to either enforce existing policy (e.g. air quality limits), or to change policy (e.g. secure new air quality action zones). No only is such value difficult to quantify, but whether or not certain data generates value will vary between countries in accordance with local policies and political issues. The definition of EU-wide ‘high value datasets’ should not crowd out the possibility or process of defining data that is high-value in particular country. That said, there may at least be scope to look at datasets in the study categories that have substantial potential value in relation to EU social and environmental policy priorities.

Beyond the mechanisms above, there may also be datasets where we find a high intrinsic value in the transparency their publication brings, even without a clear evidence base that can quantifies their impact. In these cases, we might also talk of the normative value of openness, and consider which datasets deserve a place on the high-value list because we take the openness of this data to be foundational to the kind of societies we want to live in, just as we may take certain freedoms of speech and movement as foundational to the kind of Europe we want to see created.

Not all value can be found from prior examples

The tender cites projects like the Open Data Barometer (which I was involved in developing the methodology for) as potential inspirations for the design of approaches to assess “datasets that should belong to the list of high value datasets”. The primary place to look for that inspiration is not in the published stats, but in the underlying qualitative data which includes raw reports of cases of political, social and economic impact from open data. This data (available for a number of past editions of the Barometer) remains an under-explored source of potential impact cases that could be used to identify how data has been used in particular countries and settings. Equally, projects like the State of Open Data can be used to find inspiration on where data has been used to generate social value: the chapter on Transport is as case-in-point, looking at how comprehensive data on transport can support applications improving the mobility of people with specific needs.

However, many potential uses and impacts of open data are still to be realised, because the data they might work with has not heretofore been accessible. Looking only at existing cases of use and impact is likely to miss such cases. This is where dialogue with civil society becomes vitally important. Campaigners, analysts and advocates may have ideas for the projects that could exist if only particular data was available. In some cases, there will be a hint at what is possible from academic projects that have gained access to particular government datasets, or from pilot projects where limited data was temporarily shared – but in other cases, understanding potential value will require a more imaginative and forward-looking and consultative process. Given the upcoming study may set the list of high value datasets for decades to come – it’s important that the agenda is not be solely determined by prior publication precedent.

For some datasets, certain value comes from network effects

If one country provides an open register of corporate ownership, the value this has for anti-corruption purposes only goes so far. Corruption is a networked game, and without being able to following corporate chains across borders, the value of a single register may be limited. The value of corporate disclosures in one jurisdiction increase the more other jurisdictions provide such data. The general principle here, that certain data gains value through network effects, raises some important issues for the quantification of value, and will help point towards those datasets where standardisation is particularly important. Being able to show, for example, that the majority of the value of public transit data comes from domestic use (and so interoperability is less important), but the majority of value of, say, carbon emission or climate change mitigation financing data, comes from cross-border use, will be important to support prioritisation of datasets.

Value generation takes more than data

Another challenge of of the ‘raw material’ theory of change is that it often fails to consider (a) the underlying quality (not only format standardisation) of source data, and (b) the complementary policies and resources that enable use. For example, air quality data from low-quality or uncalibrated particulate sensors may be less valuable than data from calibrated and high quality sensors, particularly when national policy may set out criteria for the kinds of data that can be used in advancing claims for additional environmental protections in high-pollution areas. Understanding this interaction of ‘local data’ and the governance contexts where it is used is important in understanding how far, and under what conditions, one may extrapolate from value identified in one context, to potential value to be realised in another. This calls for methods that can go beyond naming datasets, to being able to describe features (not just formats) that are important for them to have. 

Within the Web Foundation hosted Open Data Research Network a few years back we spent considerable time refining a framework for thinking about all the aspects that go into securing impact (and value) from open data, and work by GovLab has also identified factors that have been important to the success of initiatives using open data. Beyond this, numerous dataset-specific frameworks for understanding what quality looks like may exist. Whilst recommending dataset-by-dataset measures to enhance the value realised from particular open datasets may be beyond the scope of the European Commission’s current study – when researching and extrapolating from past value generation in different contexts it is important to look at the other complementary factors that may have contributed that value realising alongside the simple availability of data.

Value is a two-sided calculation

It can be temping to quantify the value of a dataset simply by taking all the ‘positive’ value it might generate, and adding it up. But, a true quantification calculation also needs to consider potential negative impacts. In some cases, this could be positive economic value set against some social or ecological dis-benefit. For example, consider the release of some data that might increase use of carbon-intensive air and road transport. While this  could generate quantifiable revenue for haulage and airline firms, it might undermine efforts to tackle climate change, destroying long-term value. Or in other cases, there may be data that provides social benefit (e.g. through the release of consumer protection related data) but that disrupts an existing industry in ways that reduce private sector revenues. 

Recognising the power of data, involves recognising that power can be used in both positive and negative ways. A complete balance sheet needs to consider the plus and the minus. This is another key point where dialogue with civil society will be vital – and not only with open data advocates, but with those who can help consider the potential harms of certain data being more open. 

Distribution of value matters

Last but not least, when considering public investment in ‘high value’ datasets, it is important to consider who captures that value. I’ve already hinted at the fact that value might be captured as government surplus, consumer surplus or producer (private sector) surplus – but there are also relevant question to ask about which countries or industries may be best placed to capture value from cross-border interoperable datasets.

When we see data as infrastructure, then it can help us consider the potential to both provide infrastructure that is open to all and generative of innovation, but also to design policies that ensure those capturing value from the infrastructure are contributing to its maintenance.

In summary

Work on methodologies to identify high value datasets in Europe should not start from scratch, and stand to benefit substantially from engaging with open data communities across the region. There is a risk that a narrow conceptualisation and quantification of ‘high value’ will fail to capture the true value of openness, and to consider the contexts of data production and use. However, there is a wealth of research from the last decade (including some linked in this post, and cited in State of Open Data) to build upon, and I’m hopeful that whichever consultant or consortium takes on the EC’s commissioned study, they will take as broad a view as possible within the practical constraints of their project.

Linking data and AI literacy at each stage of the data pipeline

[Summary: extended notes from an unConference session]

At the recent data literacy focussed Open Government Partnership unConference day (ably facilitated by my fellow Stroudie Dirk Slater)  I acted as host for a break-out discussion on ‘Artificial Intelligence and Data Literacy’, building on the ‘Algorithms and AI’ chapter I contributed to The State of Open Data book.

In that chapter, I offer the recommendation that machine learning should be addressed within wider open data literacy building.  However, it was only through the unConference discussions that we found a promising approach to take that recommendation forward: encouraging a critical look at how AI might be applied at each stage of the School of Data ‘Data Pipeline’.

The Data Pipeline, which features in the Data Literacy chapter of The State of Open Data, describes seven stages for woking with data, from defining the problem to be addressed, through to finding and getting hold of relevant data, verifying and cleaning it, and analysing data and presenting findings.

Figure 2: The School of Data’s data pipeline. Source: https://schoolofdata.org/methodology/
Figure: The School of Data’s data pipeline. Source: https://schoolofdata.org/methodology/

 

Often, AI is described as a tool for data analysis (any this was the mental framework many unConference session participants started with). Yet, in practice, AI tools might play a role at each stage of the data pipeline, and exploring these different applications of AI could support a more critical understanding of the affordances, and limitations, of AI.

The following rough worked example looks at how this could be applied in practice, using an imagined case study to illustrate the opportunities to build AI literacy along the data pipeline.

(Note: although I’ll use machine-learning and AI broadly interchangeably in this blog post, as I outline in the State of Open Data Chapter, AI is a  broader concept than machine-learning.)

Worked example

Imagine a human rights organisation, using a media-monitoring service to identify emerging trends that they should investigate. The monitoring service flags a spike in gender based violence, encouraging them to seek out more detailed data. Their research locates a mix of social media posts, crowdsourced data from a harassment mapping platform, and official statistics collected in different regions across the country. They bring this data together, and seek to check it’s accuracy, before producing an analysis and visually impactful report.

As we unpack this (fictional) example, we can consider how algorithms and machine-learning are, or could be, applied at each stage – and we can use that to consider the strengths and weaknesses of machine-learning approaches, building data and AI literacy.

  • Define – The patterns that first give rise to a hunch or topic to investigate may have been identified by an algorithmic model.  How does this fit with, or challenge, the perception of staff or community members? If there is a mis-match – is this because the model is able to spot a pattern than humans were not able to see (+1 for the AI)? Or could it be because the model is relying on input data that reflects certain bias (e.g. media may under-report certain stories, or certain stories may be over-reported because of certain cognitive biases amongst reporters)?

  • Find – Search engine algorithms may be applying machine-learning approaches to identify and rank results. Machine-translation tools, that could be used to search for data described in other languages, are also an example of really well established AI. Consider the accuracy of search engines and machine-translation: they are remarkable tools, but we also recognise that they are nowhere near 100% reliable. We still generally rely on a human to sift through the results they give.

  • Get – One of the most common, and powerful, applications of machine-learning, is in turning information into data: taking unstructured content, and adding structure through classification or data extraction. For example, image classification algorithms can be trained to convert complex imagery into a dataset of terms or descriptions; entity extraction and sentiment analysis tools can be used to pick out place names, event descriptions and a judgement on whether the event described is good or bad, from free text tweets, and data extraction algorithms can (in some cases) offer a much faster and cheaper way to transcribe thousands of documents than having humans do the work by hand. AI can, ultimately, change what counts as structured data or not.  However, that doesn’t mean that you can get all the data you need using AI tools. Sometimes, particularly where well-defined categorical data is needed, getting data may require creation of new reporting tools, definitions and data standards.

  • Verify – School of Data describe the verification step like this: “We got our hands in the data, but that doesn’t mean it’s the data we need. We have to check out if details are valid, such as the meta-data, the methodology of collection, if we know who organised the dataset and it’s a credible source.” In the context of AI-extracted data, this offers an opportunity to talk about training data and test data, and to think about the impact that tuning tolerances to false-positives or false-negatives might have on the analysis that will be carried out. It also offers an opportunity to think about the impact that different biases in the data might have on any models built to analyse it.

  • Clean – When bringing together data from multiple sources, there may be all sorts of errors and outliers to address. Machine-learning tools may prove particularly useful for de-duplication of data, or spotting possible outliers. Data cleaning to prepare data for a machine-learning based analysis may also involve simplifying a complex dataset into a smaller number of variables and categories. Working through this process can help build an understanding of the ways in which, before a model is applied, certain important decisions have already been made.

  • Analyse – Often, data analysis takes the form of simple descriptive charts, graphs and maps. But, when AI tools are added to the mix, analysis might involve building predictive models, able, for example, to suggest areas of a county that might see future hot-spots of violence, or that create interactive tools that can be used to perform ongoing monitoring of social media reports. However, it’s important in adding AI to the analysis toolbox, not to skip entirely over other statistical methods: and instead to think about the relative strengths and weaknesses of a machine-learning model as against some other form of statistical model. One of the key issues to consider in algorithmic analysis is the ’n’ required: that is, the sample size needed to train a model, or to get accurate results. It’s striking that many machine-learning techniques required a far larger dataset that can be easily supplied outside big corporate contexts. A second issue that can be considered in looking at analysis is how ‘explainable’ a model is: does the machine-learning method applied allow an exploration of the connections between input and output? Or is it only a black box.

  • Present – Where the output of conventional data analysis might be a graph or a chart describing a trend, the output of a machine-learning model may be a prediction. Where a summary of data might be static, a model could be used to create interactive content that responds to user input in some way. Thinking carefully about the presentation of the products of machine-learning based analysis could support a deeper understanding of the ways in which such outputs could or should be used to inform action.

The bullets above give just some (quickly drafted and incomplete) examples of how the data pipeline can be used to explore AI-literacy alongside data literacy. Hopefully, however, this acts as enough of a proof-of-concept to suggest this might warrant further development work.

The benefit of teaching AI literacy through open data

I also argue in The State of Open Data that:

AI approaches often rely on centralising big datasets and seeking to personalise services through the application of black-box algorithms. Open data approaches can offer an important counter-narrative to this, focusing on both big and small data and enabling collective responses to social and developmental challenges.

Operating well in a datified world requires citizens to have a critical appreciation of a wide variety of ways in which data is created, analysed and used – and the ability to judge which tool is appropriate to which context.  By introducing AI approaches as one part of the wider data toolbox, it’s possible to build this kind of literacy in ways that are not possible in training or capacity building efforts focussed on AI alone.

Over the horizons: reflections from a week discussing the State of Open Data

[Summary: thinking aloud with five reflections on future directions for ope data related work, following discussions around the US east coast]

Over the last week I’ve had the opportunity to share findings from The State of Open Data: Histories and Horizons in a number of different settings: from academic roundtables, to conference presentations, and discussion panels.

Each has been an opportunity not only to promote the rich open access collection of essays just published, but also a chance to explore the many and varied chapters of the book as the starting point for new conversation about how to take forward an open approach to data in different settings and societies.

In this post I’m going to try and reflect on a couple of themes that have struck me during the week. (Note: These are, at this stage, just my initial and personal reflections, rather than a fully edited take on discussions arising from the book.)

Panel discussion at the GovLab with Tariq Khokhar, Adrienne Schmoeker and Beth Noveck.

Renewing open advocacy in a changed landscape

The timeliness of our look at the Histories and Horizons of open data was underlined on Monday when a tweet from Data.gov announced this week as their 10th anniversary, and the Open Knowledge Foundation, also celebrated their 15th birthday with a return to their old name, a re-focussed mission to address all forms of open knowledge, and an emphasis on creating “a future that is fair, free and open.”As they put it:

  …in 2019, our world has changed dramatically. Large unaccountable technology companies have monopolised the digital age, and an unsustainable concentration of wealth and power has led to stunted growth and lost opportunities. “

going on to say

“we recognise it is time for new rules for this new digital world.”

Not only is this a welcome and timely example of the kind of “thinking politically we call for in the State of Open Data conclusion, but it chimes with many of the discussions this week, which have focussed as much on the ways in which private sector data should be regulated as they have on opening up government data. 

While, in tools like the Open Data Charter’s Open Up Guides, we have been able to articulate a general case for opening up data in a particular sector, and then to enumerate ‘high value’ datasets that efforts should attend to, future work may need to go even deeper into analysing the political economy around individual datasets, and to show how a mix of voluntary data sharing, and hard and soft regulation, can be used to more directly address questions about how power is created, structured and distributed through control of data.

As one attendee at our panel at the Gov Lab put it, right now, open data is still often seen as a “perk not a right”.  And although ‘right to data’ advocacy has an important role, it is by linking access to data to other rights (to clean air, to health, to justice etc.) that a more sophisticated conversation can develop around improving openness of systems as well as datasets (a point I believe Adrienne Schmoeker put in summing up a vision for the future).

Policy enables, problems drive

So does a turn towards problem-focussed open data initiatives mean we can put aside work on developing open data policies or readiness assessments? In short, no.

In a lunchtime panel at the World Bank, Anat Lewin offered an insightful reflection on The State of Open Data from a multilateral’s perspective, highlighting the continued importance of developing a ‘whole of government’ approach to open data. This was echoed in Adrienne Schmoeker’s description at The Gov Lab of the steps needed to create a city-wide open data capacity in New York. In short, without readiness assessment and open data policies put in place, initiatives that use open data as a strategic tool are likely to rub up against all sorts of practical implementation challenges.

Where in the past, government open data programmes have often involved going out to find data to release, the increasing presence of data science and data analytics teams in government means the emphasis is shifting onto finding problems to solve. Provided data analytics teams recognise the idea of ‘data as a team sport’, requiring not just technical skills, but also social science, civic engagement and policy development skill sets – and providing professional values of openness are embedded in such teams – then we may be moving towards a model in which ‘vertical’ work on open data policy, works alongside ‘horizontal’ problem-driven initiatives that may make less use of the language of open data, but which still benefit from a framework of openness.

Chapter discussions at the OpenGovHub, Washington DC

Political economy really matters

It’s been really good to see the insights that can be generated by bringing different chapters of the book into conversation. For example, at the Berkman-Klein Centre, comparing and contrasting attitudes in North America vs. North Africa towards the idea that governments might require transport app providers like Uber to share their data with the state, revealed the different layers of concern, from differences in the market structure in each country, to different levels of trust in the state. Or as danah boyd put it in our discussions at Data and Society, “what do you do when the government is part of your threat model?”.  This presents interesting challenges for the development of transnational (open) data initiatives and standards – calling for a recognition that the approach that works in one country (or even one city), may not work so well in others. Research still does too little to take into account the particular political and market dynamics that surround successful open data and data analytic projects.

A comparisons across sectors, emerging from our ‘world cafe’ with State of Open Data authors at the OpenGovHub also shows the trade-offs to be made when designing transparency, open data and data sharing initiatives. For example, where the extractives transparency community has the benefit of hard law to mandate certain disclosures, such law is comparatively brittle, and does not always result in the kind of structured data needed to drive analysis. By contrast, open contracting, in relying on a more voluntary and peer-pressure model, may be able to refine it’s technical standards more iteratively, but perhaps at the cost of weaker mechanisms to enforce comprehensive disclosure. As Noel Hidalgo put it, there is a design challenge in making a standard that is a baseline, on top of which more can be shared, rather than one that becomes a ceiling, where governments focus on minimal compliance.

It is also important to recognise that when data has power, many different actors may seek to control, influence and ultimately mess with it. As data systems become more complex, the vectors for attack can increase. In discussions at Data & Society, we briefly touched on one cases where a government institution has had to take considerable steps to correct for external manipulation of it’s network of sensors. When data is used to trigger direct policy response (e.g. weather data triggering insurance payouts, or crime data triggering policing action), then the security and scrutiny of that data becomes even more important.

Open data as a strategic tool for data justice

I heard the question “Is open data dead?” a few times over this week. As the introductory presentation I gave for a few talks noted, we are certainly beyond peak open data hype. But, the jury is, it seems, still very much out on the role that discourses around open data should play in the decade ahead. At our Berkman-Klein Centre roundtable, Laura Bacon shared work by Omidyar/Luminate/Dalberg that offered a set of future scenarios for work on open data, including the continued existence of a distinct open data field, and an alternative future in which open data becomes subsumed within some other agenda such as ‘data rights’. However, as we got into discussions at Data & Society of data on police violence, questions of missing data, and debates about the balancing act to be struck in future between publishing administrative data and protecting privacy, the language of ‘data justice’ (rather than data rights) appeared to offer us the richest framework for thinking about the future.

Data justice is broader than open data, yet open data practices may often be a strategic tool in bringing it about. I’ve been left this week with a sense that we have not done enough to date to document and understand ways of drawing on open data production, consumption and standardisation as a form of strategic intervention. If we had a better language here, better documented patterns, and a stronger evidence base on what works, it might be easier to both choose when to prioritise open data interventions, and to identify when other kinds of interventions in a data ecosystem are more appropriate tools of social progress and justice.

Ultimately, a lot of discussions the book has sparked have been less about open data per-se, and much more about the shape of data infrastructures, and questions of data interoperability.  In discussions of Open Data and Artificial Intelligence at the OpenGovHub, we explored the failure of many efforts to develop interoperability within organisations and across organisational boundaries. I believe it was Jed Miller who put the challenge succinctly: to build interoperable systems, you need to “think like an organiser” – recognising data projects also as projects of organisational change and mass collaboration. Although I think we have mostly moved past the era in which civic technologists were walking around with an open data hammer, and seeing every problem as a nail, we have some way to go before we have a full understanding of the open data tools that need to be in everyones toolbox, and those that may still need a specialist.

Reconfiguring measurement to focus on openness of infrastructure

One way to support advocacy for openness, whilst avoiding reifying open data, and integrating learning from the last decade on the need to embed open data practices sector-by-sector, could be found in an updated approach to measurement. David Eaves made the point in our Berkman-Klein Centre roundtable that the number of widely adopted standards, as opposed to the number of data portals or datasets, is a much better indicator of progress.

As resource for monitoring, measuring or benchmarking open data per-se becomes more scarce, there is an opportunity to look at new measurement frames that look at the data infrastructure and ecosystem around a particular problem, and ask about the extent of openness, not only of data, but also of governance. A number of conversations this week have illustrated the value of shifting the discussion onto data infrastructure and interoperability: yet (a) the language of data infrastructure has not yet taken hold, and can be hard to pin down; and (b) there is a risk of openness being downplayed in favour of a focus on centralised data infrastructures. Updating open data measurement tools to look at infrastructures and systems rather than datasets may be one way to intervene in this unfolding space.

Thought experiment: a data extraction transparency initiative

[Summary: rapid reflections on applying extractives metaphors to data in a international development context]

In yesterday’s Data as Development Workshop at the Belfer Center for Science and International Affairs we were exploring the impact of digital transformation on developing countries and the role of public policy in harnessing it. The role of large tech firms (whether from Silicon Valley, or indeed from China, India and other countries around the world) was never far from the debate. 

Although in general I’m not a fan of descriptions of ‘data as the new oil’ (I find the equation tends to be made as part of rather breathless techno-deterministic accounts of the future), an extractives metaphor may turn out to be quite useful in asking about the kinds of regulatory regimes that could be appropriate to promote both development, and manage risks, from the rise of data-intensive activity in developing countries.

Over recent decades, principles of extractives governance have developed that recognise the mineral and hydrocarbon resources of a country as at least partially part of the common wealth, such that control of extraction should be regulated, firms involved in extraction should take responsibility for externalities from their work, revenues should be taxed, and taxes invested into development. When we think about firms ‘extracting’ data from a country, perhaps through providing social media platforms and gathering digital trace data, or capturing and processing data from sensor networks, or even collecting genomic information from a biodiverse area to feed into research and product development, what regimes could or should exist to make sure benefits are shared, externalities managed, and the ‘common wealth’ that comes from the collected data, does not entirely flow out of the country, or into the pockets of a small elite?

Although real world extractives governance has often not resolved all these questions successfully, one tool in the governance toolbox has been the  Extractives Industry Transparency Initiative (EITI) . Under EITI, member countries and companies  are required to disclose information on all stages of of the extractives process: from the granting of permissions to operate, through to the taxation or revenue sharing secured, and the social and economic spending that results. The model recognises that governance failures might come from the actions of both companies, and governments – rather than assuming one or the other is the problem or benign. Although transparency alone does not solve governance problems: it can support better debate about both policy design and implementation, and can help address distorting information and power asymmetries that otherwise work against development.

So, what could an analogous initiative look like if applied to international firms involved in ‘data extraction’?

(Note: this is a rough-and-ready thought experiment testing out an extended version of an originally tweet-length thought. It is not a fully developed argument in favour of the ideas explored here).

Data as a national resource

Before conceptualising a ‘data extraction transparency initiative’ we need to first think about what counts as ‘data extraction’.  This involves considering the collected informational (and attention) resources of a population as a whole. Although data itself can be replicated (marking a key difference from finite fossil fuels and mineral resources), the generation and use of data is often rival (i.e. if I spend my time on Facebook, I’m not spending it on some other platform, and/or, some other tasks and activities),  involves first mover advantages (e.g. the first person who street view maps country X may corner the market), and can be made finite through law (e.g. someone collecting genomic material from a country may gain intellectual property rights protection for their data), or simply through restricting access (e.g. as Jeni considers here, where data is gathered from a community and used to shape policy, without the data being shared back to that community).

We could think then of data extraction as any data collection process which ‘uses up’ a common resource such as attention and time, which reduces the competitiveness of a market (thus shifting consumer to producer surplus), or which reduces the potential extent of the knowledge commons through intellectual property regimes or other restrictions on access and use.  Of course, the use of an extracted data resource may have economic and social benefits that feed back to the subjects of the extraction. The point is not that all extraction is bad, but is rather to be aware that data collection and use as an embedded process is definitely not the non-rival, infinitely replicable and zero-cost activity that some economic theories would have us believe.

(Note that underlying this lens is the idea that we should approach data extraction at the level of populations and environments, rather than trying to conceptualise individual ownership of data, and to define extraction in terms of a set of distinct transactions between firms and individuals.)

Past precedent: states and companies

Our model then for data extraction involves a relationship between firms and communities, which we will assume for the moment can be adequately represented by their states. A ‘data extractive transparency initiative’ would then be asking for disclosure from these firms at a country-by-country level, and disclosure from the states themselves. Is this reasonable to expect? 

We can find some precedents for disclosure by looking at the most recent Ranking Digital Rights Report, released last week. This describes how many firms are now providing data about government requests for content or account restriction. A number of companies produce detailed transparency reports that describe content removal requests from government, or show political advertising spend. This at least establishes the idea that voluntarily, or through regulation, it is feasible to expect firms to disclose certain aspects of their operations.

The idea that states should disclose information about their relationship with firms is also reasonably well established (if not wholly widespread). Open Contracting, and the kind of project-level disclosure of payments to government that can be see at ResourceProjects.org illustrate ways in which transparency can be brought to the government-private sector nexus.

In short, encouraging or mandating the kinds of disclosures we might consider below is not a new. Targeted transparency has long been in the regulatory toolbox.

Components of transparency

So – to continue the thought experiment: if we take some of the categories of EITI disclosure, what could this look like in a data context?

Legal framework

Countries would publish in a clear, accessible (and machine-readable?) form, details of the legal frameworks relating to privacy and data protection, intellectual property rights, and taxation of digital industries.

This should help firms to understand their legal obligations in each country, and may also make it easier for smaller firms to provide responsible services across borders without current high costs of finding the basic information needed to make sure they are complying with laws country-by-country.

Firms could also be mandated to make their policies and procedures for data handling clear, accessible (and machine-readable?).

Contracts, licenses and ownership

Whenever governments sign contracts that allow private sector to collect or control data about citizens, public spaces, or the environment, these contracts should be public. 

(In the Data as Development workshop, Sriganesh related the case  of a city that had signed a 20 year deal for broadband provision, signing over all sorts of data to the private firm involved.)

Similarly, licenses to operate, and permissions granted to firms should be clearly and publicly documented.

Recently, EITI has also focussed on beneficial ownership information: seeking to make clear who is really behind companies. For digital industries, mandating clear disclosure of corporate structure, and potentially also of the data-sharing relationships between firms (as GDPR starts to establish) could allow greater scrutiny of who is ultimately benefiting from data extraction.

Production

In the oil, gas and mining context, firms are asked to reveal production volumes (i.e. the amount extracted). The rise of country-by-country reporting, and project-level disclosure has sought to push for information on activity to be revealed not at the aggregated firm level, but in a more granular way.

For data firms, this requirement might translate into disclosure of the quantity of data (in terms of number of users, number of sensors etc.) collected from a country, or disclosure of country by country earnings.

Revenue collection

One important aspect of EITI has been an audit and reconciliation process that checks that the amounts firms claim to be paying in taxes or royalties to government match up with the amounts government claims to have received. This requires disclosure from both private firms and government.

A better understanding of whose digital activities are being taxed, and how, may support design of better policy that allows a share of revenues from data extraction to flow to the populations whose data-related resources are being exploited.

In yesterday’s workshop, Sriganesh pointed to the way in which some developing country governments now treat telecoms firms as an easy tax collection mechanism: if everyone wants a mobile phone connection, and mobile providers are already collecting payments, levying a charge on each connection, or a monthly tax, can be easy to administer. But, in the wrong places, and at the wrong levels, such taxes may capture consumer rather than producer surplus, and suppress rather than support the digital economy,

Perhaps one of the big challenges for ‘data as development’ when companies in more developed economies may extract data from developing countries, but process it back ‘at home’, is that current economic models may suggest that the biggest ‘added value’ is generated from the application of algorithms and processing. This (combined with creative accounting by big firms) can lead to little tax revenue in the countries from which data was originally extracted. Combining ‘production’ and ‘revenue’ data can at least bring this problem into view more clearly – and a strong country-by-country reporting regime may even allow governments to more accurately apply taxes.

Revenue allocation, social and economic spending

Important to the EITI model, is the idea that when governments do tax, or collect royalties, they do so on behalf of the whole polity, and they should be accountable for how they are then using the resulting resources.

By analogy, a ‘data extraction transparency initiative’ initiative may include requirements for greater transparency about how telecoms and data taxes are being used. This could further support multi-stakeholder dialogue on the kinds of public sector investments needed to support national development through use of data resources.

Environmental and social reporting

EITI encourages countries to ‘go beyond the standard and disclose other information too, including environmental information and information on gender.

Similar disclosures could also form part of a ‘data extraction transparency initiative’: encouraging or requiring firms to provide information on gender pay gaps and their environmental impact.

Is implementation possible?

So far this though experiment has established ways of thinking about ‘data extraction’ by analogy to natural resource extraction, and has identified some potential disclosures that could be made by both governments and private actors. It has done so in the context of thinking about sustainable development, and how to protect developing countries from data-exploitation, whilst also supporting them to appropriately and responsibly harness data as a developmental tool. There are some rough edges in all this: but also, I would argue, some quite feasible proposals too (disclosure of data-related contracts for example).

Large scale implementation would, of course, need careful design. The market structure, capital requirements and scale of digital and data firms is quite different to that of the natural resource industry. Compliance costs of any disclosure regime would need to be low enough to ensure that it is not only the biggest firms that can engage. Developing country governments also often have limited capacity when it comes to information management. Yet, most of the disclosures envisaged above relate to transactions that, if ‘born digital’, should be fairly easy to publish data on. And where additional machine-readable data (e.g. on laws and policies) is requested, if standards are designed well, there could be a win-win for firms and governments – for example, by allowing firms to more easily identify and select cloud providers that allow them to comply with the regulatory requirements of a particular country.

The political dimensions of implementation are, of course, another story – and one I’ll leave out of this thought experiment for now.

But why? What could the impact be?

Now we come to the real question. Even if we could create a ‘data extraction transparency initiative’, could it have any meaningful developmental impacts?

Here’s where some of the impacts could lie:

  • If firms had to report more clearly on the amount of ‘data’ they are taking out of a country, and the revenue that gives rise to, governments could tailor licensing and taxation regimes to promote more developmental uses of data. Firms would also be encouraged think about how they are investing in value-generation in countries where they operate. 
  • If contracts that involve data extraction are made public, terms that promote development can be encouraged, and those that diminish the opportunity to national development can be challenged.
  • If a country government chooses to engage in forms of ‘digital protectionism’, or to impose ‘local content requirements’ on the development of data technologies that could bring long-term benefits, but risk creating a short-term hit on the quality of digital services available in a country, greater transparency could support better policy debate. (Noting, however, that recent years have shown us that politics often trumps rational policy making in the real world).

There will inevitably be readers who see the thrust of this thought experiment as fundamentally anti-market, and who are fearful of, or ideologically opposed, to any of the kinds of government intervention that increasing transparency around data extraction might bring. It can be hard to imagine a digital future not dominated by the ever-increased rise of a small number of digital monopolies. But, from a sustainable development point of view, allowing another path to be sought: which supports to creation of resilient domestic technology industries, which prices in positive and negative externalities from data extraction, and which therefore allows active choices to be made about how national data resources are used as common asset, may be no bad thing.

The State of Open Data: Histories and Horizons – panels and conversations

The online and open access book versions ‘The State of Open Data: Histories and Horizons’ went live yesterday. Do check it out!

We’ve got an official book launch on 27th May in Ottawa, but ahead of that, I’m spending the next 8 days on the US East Coast contributing to a few of events to share learning from the project.

Over the last 18 months we’ve worked with 66 fantastic authors, and many other contributors, reviewers and editorial board members, to pull together a review of the last decade of activity on open data. The resulting collection provides short essays that look at open data in different sectors, fromaccountability and anti-corruption, to the environment, land ownership and international aid, as well as touching on cross-cutting issues, differentstakeholder perspectives, and regional experiences. We’ve tried to distill key insights in overall and section introductions, and to draw out some emerging messages in an overall conclusion.

This has been my first experience pulling together a whole book, and I’m incredibly grateful to my co-editors, Steve Walker, Mor Rubinstein, and Fernando Perini, who have worked tirelessly over the project to bring together all these contributions, make sure the project is community driven, and to present a professional final book to the world, particularly in what has been a tricky year personally. The team at our co-publishers, African Mindsand IDRC (Simon, Leith, Francois and Nola) also deserve a great debt of thanks for their attention to detail and design.

I’ll ty and write up some reflections and learning points on the book process in the near future, and will be blogging more about specific elements of the research in the coming weeks, but for now, let me share the schedule of upcoming events in case any blog readers happen to be able to join. I’ll aim to update these with links to any outcomes from the sessions too later.

Book events

Thursday 16th May – 09:00 – 11:00Future directions for open data research and action

Roundtable at the Harvard Berkman Klein Center, with chapter authors David Eaves, Mariel Garcia Montes, Nagla Rizk, and response from Luminate’s Laura Bacon.

Thursday 16th MayDeveloping the Caribbean

I’ll be connecting via hangouts to explore the connections between data literacy, artificial intelligence, and private sector engagement with open data

Monday 20th May – 12:00 – 13:00Let’s Talk Data – Does open data have an identity crisis?, World Bank I Building, Washington DC

A panel discussion as part of the World Bank Let’s Talk Data series, exploring the development of open data over the last decade. This session will also be webcast – see detail in EventBrite.

Monday 20th May – 17:30 – 19:30World Cafe & Happy Hour @ OpenGovHub, Washington DC

We’ll be bringing together authors from lots of different chapters, including Shaida Baidee (National Statistics), Catherine Weaver (Development Assistance & Humanitarian Action), Jorge Florez (Anti-corruption), Alexander Howard (Journalists and the Media), Joel Gurin (Private Sector), Christopher Wilson (Civil Society) and Anders Pedersen (Extractives) to talk about their key findings in an informal world cafe style.

Tuesday 21st MayThe State of Open Data: Open Data, Data Collaboratives and the Future of Data Stewardship, GovLab, New York

I’m joining Tariq Khokhar, Managing Director & Chief Data Scientist, Innovation, The Rockefeller Foundation, Adrienne Schmoeker, Deputy Chief Analytics Officer, City of New York and Beth Simone Noveck, Professor and Director, The GovLab, NYU Tandon (and also foreword writer for the book), to discuss changing approaches to data sharing, and how open data remains relevant.

Wednesday 22nd May – 18:00 – 20:00Small Group Session at Data & Society, New York

Join us for discussions of themes from the book, and how open data communities could or should interact with work on AI, big data, and data justice.

Monday 27th May – 17:00 – 19:30Book Launch in Ottawa

Join me and the other co-editors to celebrate the formal launch of the book!

Notes from a RightsCon panel on AI, Open Data and Privacy

[Summary: Preliminary notes on open data, privacy and AI]

At the heart of open data is the idea that when information is provided in a structured form, freely accessible, and with permission granted for anyone to re-use it, latent social and economic value within it can be unlocked.

Privacy positions assert the right of individuals to control their information and data, and data about them, and to have protection from harms that might occur through exploitation of their data.

Artificial intelligence is a field of computing concerned with equipping machines with the ability to perform tasks that many previously have required human intelligence, including recognising patterns, making judgements, and extracting and analysing semi-structured information.

Around each of these concepts vibrant (and broad based) communities exist: advocating respectively for policy to focus on openness, privacy and the transformative use of AI. At first glance, there seem to be some tensions here: openness may be cast as the opposite of privacy; or the control sought in privacy as starving artificial intelligence models of the data they could use for social good. The possibility within AI of extracting signals from messy records might appear to negate the need to construct structured public data, and as data-hungry AI draws increasingly upon proprietary data sources, the openness of data on which decisions are made may be undermined. At some points these tensions are real. But if we dig beneath surface level oppositions, we may find arguments that unite progressive segments of each distinct community – and that can add up to a more coherent contemporary narrative around data in society.

This was the focus of a panel I took part in at RightsCon in Toronto last week, curated by Laura Bacon of Omidyar Network, and discussing with Carlos Affonso Souza (ITS Rio) and Frederike Kaltheuner (Privacy International) – and the first in a series of panels due to take place over this year at a number of events. In this post I’ll reflect on five themes that emerged both from our panel discussion, and more widely from discussions I had at RightsCon. These remarks are early fragments, rather than complete notes, and I’m hoping that a number may be unpacked further in the upcoming panels.

The historic connection of open data and AI

The current ‘age of artificial intelligence’ is only the latest in a series of waves of attention the concept has had over the years. In this wave, the emphasis is firmly upon the analysis of large collections of data, predominantly proprietary data flows. But it is notable that a key thread in advocacy for open government data in the late 2000s came from Artificial Intelligence and semantic web researchers such as Prof. Nigel Shadbolt, whose Advanced Knowledge Technologies (AKT) programme was involved in many early re-use projects with UK public data, and Prof. Jim Hendler at TWC. Whilst I’m not aware of any empirical work that explores the extent to which open government data has gone on to feed into machine-learning models, in terms of bootstrapping data-hungry research, there is a connection here to be explored.

There also an argument to be made that open data advocacy, implementation and experiences over the last ten years have played an important role in contributing to growing public understandings of data, and in embedding cultural norms around seeking access to the raw data underlying decisions. Without the last decade of action on open data, we might be encountering public sector AI based purely on proprietary models, as opposed to now navigating a mixed ecology of public and private AI.

(Some) open data is getting personal

Its not uncommon to hear open data advocates state that open data only covers ‘non-personal data’. It’s certainly true that many of the datasets sought through open data policy, such as bus timetables, school rankings, national maps, weather reports and farming statistics don’t contain an personally identifying information (PII). Yet, whilst we should be able to mark a sizable teritory of the open data landscape as free from privacy concerns, there are increasingly blurred lines at points where ‘public data’ is also ‘personal data’.

In some cases, this may be due to mosaic effects: where the combination of multiple open datasets could be personally identifying. In other cases, the power to AI to extract structured data from public records about people raises interesting questions about how far permissive regimes of access and re-use around those documents should also apply to datasets derived from them. However, there are also cases where open data strategies are being applied to the creation of new datasets that directly contain personally identifying information.

In the RightsCon panel I gave the example of Beneficial Ownership data: information about the ultimate owners of companies that can be used to detect ilicit use of shell companies for money laundering or tax evasion, or that can support better due dilligence on supply chains. Transparency campaigners have called for beneficial ownership registers to be public and available as open data, citing the risk that restricted registers will be underused and will much less effective than open registers, and drawing on the idea of a social contract that means the limited liability conferred by a company comes with the responsibility to be identified as party to that company. We end up then with data that is both public (part of the public record), but also personal (containing information about identified individuals).

Privacy is not secrecy: but consent remains key

Frederike Kaltheuner kicked off our discussions of privacy on the panel by reminding us that privacy and secrecy are not the same thing. Rather, privacy is related to control: and the ability of individuals and communities to excercise rights over the presentation and use of their data. The beneficial ownership example highlights that not all personal data can or should be kept secret, as taking an ownership role in a company comes with a consequent publicity requirement. However, as Ann Cavoukian forcefully put the point in our discussions, the principle of consent remains vitally important. Individuals need to be informed enough about when and how their personal information may be shared in order to make an informed choice about entering into any relationship which requests or requires information disclosure.

When we reject a framing of privacy as secrecy, and engage with ideas of active consent, we can see, as the GDPR does, that privacy is not a binary choice, but instead involves a set of choices in granting permissions for data use and re-use. Where, as in the case of company ownership, the choice is effectively between being named in the public record vs. not taking on company ownership, it is important for us to think more widely about the factors that might make that choice trickier for some individuals or groups. For example, as Kendra Albert expained to me, for trans-people a business process that requires current and former names to be on the public record may have substantial social consequences. This highlights the need for careful thinking about data infrastructures that involve personal data, such that they can best balance social benefits and individual rights, giving a key place to mechanisms of acvice consent: and avoiding the creation of circumstances in which individuals may find themselves choosing uncomfortably between ‘the lesser of two harms’.

Is all data relational?

One of the most challenging aspects of the receny Cambridge Analytica scandal is the fact that even if individuals did not consent at any point to the use of their data by Facebook apps, there is a chance they were profiled as a result of data shared by people in their wider network. Whereas it might be relatively easy to identify the subject of a photo, and to give that individual rights of control over the use and distribution of their image, an individual ownership and rights framework is difficult can be difficult to apply to many modern datasets. Much of the data of value to AI analysis, for example, concerns the relationship between individuals, or between individuals and the state. When there are multiple parties to a dataset, each with legitimate interests in the collection and use of the data, who holds the rights to govern its re-use?

Strategies of regulation

What unites the progressive parts of the open data, privacy and AI communities? I’d argue that each has a clear recognition of the power of data, and a concern with minimising harm (albeit with a primary focus in individual harm in privacy contexts, and with the emphasis placed on wider social harms from corruption or poor service delivery by open data communities)*. As Martin Tisné has suggested, in a context where harmful abuses of data power are all around us, this common ground is worth building on. But in charting a way forward, we need to more fully unpack where there are differences of emphasis, and different preferences for regulatory strategies – produced in part by the different professional backgrounds of those playing leadership roles in each community.

(*I was going to add ‘scepticism about centralised power’ (of companies and states) to the list of common attributes across progressive privacy, open data and AI communities, but I don’t have a strong enough sense of whether this could apply in an AI context.)

In our RightsCon panel I jotted down and shared five distinct strategies that may be invoked:

  • Reshaping inputs – for example, where an AI system is generated biased outputs, work can take place to make sure the inputs it recieves are more representative. This strategy essentially responds to negative outcomes from data by adding more, corrective, data.
  • Regulating ownership – for example, asserting that individuals have ownership of their data, and can use ownership rights to make claims of control over that data. Ownership plays an important role in open data licensing arrangements, but runs up against the ‘relational data problem’ in many cases, where its not clear who has ownership rights.
  • Regulating access – for example, creating a dataset of company ownership only available to approved actors, or keeping potentially disclosive AI training datasets from being released.
  • Regulating use – for example, allowing that a beneficial ownership register is public, but ensuring that uses of the data to target individuals is strictly prohibited, and prohibitions are enforced.
  • Remediating consequences – for example, recognising that harm is caused to some groups by the publicity of certain data, but judging that the net public benefit is such that the data should remain public, but the harm should be redressed by some other aspect of policy.

By digging deeper into questions of motivations, goals and strategies my sense is we will better be able to find the points where AI, privacy and open data intersect in a joint critical engagement with todays data environment.

Where next?

I’m looking forward to exploring these themes more, both attending the next panel in this series at the Open Government Partnership meeting in Tblisi in July, and through the State of Open Data project.

Publishing with purpose? Reflections on designing with standards and locating user engagement

[Summary: Thinking aloud about open data and data standards as governance tools]

There are interesting shifts in the narratives of open data taking place right now.

Earlier this year, the Open Data Charter launched their new stategy: “Publishing with purpose”, situating it as a move on from the ‘raw data now’ days where governments have taken an open data initaitive to mean just publishing easy-to-open datasets online, and linking to them from data catalogues.

The Open Contracting Partnership, which has encouraged governments to purposely prioritise publication of procurement data for a number of years now, has increasingly been exploring questions of how to design interventions so that they can most effectively move from publication to use. The idea enters here that we should be spending more time with governments focussing on their use cases for data disclosure.

The shifts are welcome: and move closer to understanding open data as strategy. However, there are also risks at play, and we need to take a critical look at the way these approaches could or should play out.

In this post, I introduce a few initial thoughts, though recognising these are as yet underdeveloped. This post is heavily influenced by a recent conversation convened by Alan Hudson of Global Integrity at the OpenGovHub, where we looked at the interaction of ‘(governance) measurement, data, standards, use and impact ‘.

(1) Whose purpose?

The call for ‘raw data now‘ was not without purpose: but it was about the purpose of particular groups of actors: not least semantic web reseachers looking for a large corpus of data to test their methods on. This call configured open data towards the needs and preferences of a particular set of (technical) actors, based on the theory that they would then act as intermediaries, creating a range of products and platforms that would serve the purpose of other groups. That theory hasn’t delivered in practice, with lots of datasets languishing unused, and governments puzzled as to why the promised flowering of re-use has not occurred.

Purpose itself then needs unpacking. Just as early research into the open data agenda questioned how different actors interests may have been co-opted or subverted – we need to keep the question of ‘whose purpose’ central to the publish-with-purpose debate.

(2) Designing around users

Sunlight Foundation recently published a write-up of their engagement with Glendale, Arizona on open data for public procurement. They describe a process that started with a purpose (“get better bids on contract opportunities”), and then engaged with vendors to discuss and test out datasets that were useful to them. The resulting recommendations emphasise particular data elements that could be prioritised by the city administration.

Would Glendale have the same list of required fields if they had started asking citizens about better contract delivery? Or if they had worked with government officials to explore the problems they face when identifying how well a vendor will deliver? For example, the Glendale report doesn’t mention including supplier information and identifiers: central to many contract analysis or anti-corruption use cases.

If we see ‘data as infrastructure’, then we need to consider the appropriate design methods for user engagement. My general sense is that we’re currently applying user centred design methods that were developed to deliver consumer products to questions of public infrastructure: and that this has some risks. Infrastructures differ from applications in their iterability, durability, embeddedness and reach. Premature optimisation for particular data users needs may make it much harder to reach the needs of other users in future.

I also have the concern (though, I should note, not in any way based on the Glendale case) that user-centred design done badly, can be worse than user-centred design done not at all. User engagement and research is a profession with it’s own deep skill set, just as work on technical architecture is, even if it looks at first glance easier to pick up and replicate. Learning from the successes, and failures, of integrating user-centred design approaches into bureacratic contexts and government incentives structures need to be taken seriously. A lot of this is about mapping the moments and mechanisms for user engagement (and remembering that whilst it might help the design process to talk ‘user’ rather than ‘citizen’, sometimes decisions of purpose should be made at the level of the citizenry, not their user stand-ins).

(3) International standards, local adoption

(Open) data standards are a tool for data infrastructure building. They can represent a wide range of user needs to a data publisher, embedding requirement distilled from broad research, and can support interoperabiliy of data between publishers – unlocking cross-cutting use-cases and creating the economic conditions for a marketplace of solutions that build on data. (They can, of course, also do none of these things: acting as interventions to configure data to the needs of a particular small user group).

But in seeking to be generally usable, standard are generally not tailored to particular combinations of local capacity and need. (This pairing is important: if resource and capacity were no object, and each of the requirements of a standard were relevant to at least one user need, then there would be a case to just implement the complete standard. This resource unconstrained world is not one we often find ourselves in.)

How then do we secure the benefits of standards whilst adopting a sequenced publication of data given the resources available in a given context? This isn’t a solved problem: but in the mix are issues of measurement, indicators and incentive structures, as well as designing some degree of implementation levels and flexibility into standards themselves. Validation tools, guidance and templated processes all help too in helping make sure data can deliver both the direct outcomes that might motivate an implementer, whilst not cutting off indirect or alternative outcomes that have wider social value.

(I’m aware that I write this from a position of influence over a number of different data standards. So I have to also introspect on whether I’m just optimising for my own interests in placing the focus on standard design. I’m certainly concerned with the need to develop a clearer articulation of the interaction of policy and technical artefacts in this element of standard setting and implementation, in order to invite both more critique, and more creative problem solving, from a wider community. This somewhat densely written blog post clearly does not get there yet.)

Some preliminary conclusions

In thinking about open data as strategy, we can’t set rules for the relative influence that ‘global’ or ‘local’ factors should have in any decision making. However, the following propositions might act as starting point for decision making at different stages of an open data intervention:

  • Purpose should govern the choice of dataset to focus on
  • Standards should be the primary guide to the design of the datasets
  • User engagement should influence engagement activities ‘on top of’ published data to secure prioritised outcomes
  • New user needs should feed into standard extension and development
  • User engagement should shape the initiatives built on top of data

Some open questions

  • Are there existing theoretical frameworks that could help make more sense of this space?
  • Which metaphors and stories could make this more tangible?
  • Does it matter?

Exploring participatory public data infrastructure in Plymouth

[Summary: Slides, notes and references from a conference talk in Plymouth]

A few months back I was invited to give a presentation to a joint plenary of the ‘Whose Right to the Smart City‘ and ‘DataAche 2017‘ conferences in Plymouth. Building on some recent conversations with Jonathan Gray, I took the opportunity to try and explore some ideas around the concept of ‘participatory data infrastructure’, linking those loosely with the smart cities theme.

As I fear I might not get time to turn it into a reasonable paper anytime soon, below is a rough transcript of what I planned to say when I presented earlier today. The slides are also below.

For those at the talk, the promised references are found at the end of this post.

Thanks to Satyarupa Shekar for the original invite, Katharine Willis and the Whose Right to the Smart Cities network for stimulating discussions today, and to the many folk whose ideas I’ve tried to draw on below.

Participatory public data infrastructure: open data standards and the turn to transparency

In this talk, my goal is to explore one potential strategy for re-asserting the role of citizens within the smart-city. This strategy harnesses the political narrative of transparency and explores how it can be used to open up a two-way communication channel between citizens, states and private providers.

This not only offers the opportunity to make processes of governance more visible and open to scrutiny, but it also creates a space for debate over the collection, management and use of data within governance, giving citizens an opportunity to shape the data infrastructures that do so much to shape the operation of smart cities, and of modern data-driven policy and it’s implementation.

In particular, I will focus on data standards, or more precisely, open data standards, as a tool that can be deployed by citizens (and, we must acknowledge, by other actors, each with their own, sometimes quite contrary interests), to help shape data infrastructures.

Let me set out the structure of what follows. It will be an exploration in five parts, the first three unpacking the title, and then the fourth looking at a number of case studies, before a final section summing up.

  1. Participatory public data infrastructure
  2. Transparency
  3. Standards
  4. Examples: Money, earth & air
  5. Recap

Part 1: Participatory public data infrastructure

Data infrastructure

infrastructure. /?nfr?str?kt??/ noun. “the basic physical and organizational structures and facilities (e.g. buildings, roads, power supplies) needed for the operation of a society or enterprise.” 1

The word infrastructure comes from the latin ‘infra-‘ for below, and structure, meaning structure. It provides the shared set of physical and organizational arrangements upon which everyday life is built.

The notion of infrastructure is central to conventional imaginations of the smart city. Fibre-optic cables, wireless access points, cameras, control systems, and sensors embedded in just about anything, constitute the digital infrastructure that feed into new, more automated, organizational processes. These in turn direct the operation of existing physical infrastructures for transportation, the distribution of water and power, and the provision of city services.

However, between the physical and the organizational lies another form of infrastructure: data and information infrastructure.

(As a sidebar: Although data and information should be treated as analytically distinct concepts, as the boundary between the two concepts is often blurred in the literature, including in discussions of ‘information infrastructures’, and as information is at times used as a super-category including data, I won’t be too strict in my use of the terms in the following).

(That said,) It is by being rendered as structured data that the information from the myriad sensors of the smart city, or the submissions by hundreds of citizens through reporting portals, are turned into management information, and fed into human or machine based decision-making, and back into the actions of actuators within the city.

Seen as a set of physical or digital artifacts, the data infrastructure involves ETL (Extract, Transform, Load) processes, APIs (Application Programming Interfaces), databases and data warehouses, stored queries and dashboards, schema, codelists and standards. Seen as part of a wider ‘data assemblage’ (Kitchin 5) this data infrastructure also involves various processes of data entry and management, of design, analysis and use, as well relationships to other external datasets, systems and standards.

However, if is often very hard to ‘see’ data infrastructure. By their very natures, infrastructures moves into the background, often only ‘visible upon breakdown’ to use Star and Ruhleder’s phrase 2. (For example, you may only really pay attention to the shape and structure of the road network when your planned route is blocked…). It takes a process of “infrastructural inversion” to bring information infrastructures into view 3, deliberately foregrounding the background. I will argue shortly that ‘transparency’ as a policy performs much the same function as ‘breakdown’ in making the contours infrastructure more visible: taking something created with one set of use-cases in mind, and placing it in front of a range of alternative use-cases, such that its affordances and limitations can be more fully scrutinized, and building on that scrutiny, it’s future development shaped. But before we come to that, we need to understand the extent of ‘public data infrastructure’ and the different ways in which we might understand a ‘participatory public data infrastructure’.

Public data infrastructure

There can be public data without a coherent public data infrastructure. In ‘The Responsive City’ Goldsmith and Crawford describe the status quo for many as “The century-old framework of local government – centralized, compartmentalized bureaucracies that jealously guard information…” 4. Datasets may exist, but are disconnected. Extracts of data may even have come to be published online in data portals in response to transparency edicts – but it exists as islands of data, published in different formats and structures, without any attention to interoperability.

Against this background, initiatives to construct public data infrastructure have sought to introduce shared technology, standards and practices that provide access to a more coherent collection of data generated by, and focusing on, the public tasks of government.

For example, in 2012, Denmark launched their ‘Basic Data’ programme, looking to consolidate the management of geographic, address, property and business data across government, and to provide common approaches to data management, update and distribution 6. In the European Union, the INSPIRE Directive and programme has been driving creation of a shared ‘Spatial Data Infrastructure’ since 2007, providing reference frameworks, interoperability rules, and data sharing processes. And more recently, the UK Government has launched a ‘Registers programme’ 8 to create centralized reference lists and identifiers of everything from countries to government departments, framed as part of building governments digital infrastructure. In cities, similar processes of infrastructure building, around shared services, systems and standards are taking place.

The creation of these data infrastructures can clearly have significant benefits for both citizens and government. For example, instead of citizens having to share the same information with multiple services, often in subtly different ways, through a functioning data infrastructure governments can pick up and share information between services, and can provide a more joined up experience of interacting with the state. By sharing common codelists, registers and datasets, agencies can end duplication of effort, and increase their intelligence, drawing more effectively on the data that the state has collected.

However, at the same time, these data infrastructures tend to have a particularly centralizing effect. Whereas a single agency maintaining their own dataset has the freedom to add in data fields, or to restructure their working processes, in order to meet a particular local need – when that data is managed as part of a centralized infrastructure, their ability to influence change in the way data is managed will be constrained both by the technical design and the institutional and funding arrangements of the data infrastructure. A more responsive government is not only about better intelligence at the center, it is also about autonomy at the edges, and this is something that data infrastructures need to be explicitly designed to enable, and something that they are generally not oriented towards.

In “Roads to Power: Britain Invents the Infrastructure State” 10, Jo Guldi uses a powerful case study of the development of the national highways networks to illustrate the way in which the design of infrastructures shapes society, and to explore the forces at play in shaping public infrastructure. When metaled roads first spread out across the country in the eighteenth century, there were debates over whether to use local materials, easy to maintain with local knowledge, or to apply a centralized ‘tarmacadam’ standard to all roads. There were questions of how the network should balance the needs of the majority, with road access for those on the fringes of the Kingdom, and how the infrastructure should be funded. This public infrastructure was highly contested, and the choices made over it’s design had profound social consequences. Jo uses this as an analogy for debates over modern Internet infrastructures, but it can be equally applied to explore questions around an equally intangible public data infrastructure.

If you build roads to connect the largest cities, but leave out a smaller town, the relative access of people in that town to services, trade and wider society is diminished. In the same way, if your data infrastructure lack the categories to describe the needs of a particular population, their needs are less likely to be met. Yet, that town connected might also not want to be connected directly to the road network, and to see it’s uniqueness and character eroded; much like some groups may also want to resist their categorization and integration in the data infrastructure in ways that restrict their ability to self-define and develop autonomous solutions, in the face of centralized data systems that are necessarily reductive.

Alongside this tension between centralization and decentralization in data infrastructures, I also want to draw attention to another important aspect of public data infrastructures. That is the issue of ownership and access. Increasingly public data infrastructures may rely upon stocks and flows of data that are not publicly owned. In the United Kingdom, for example, the Postal Address File, which is the basis of any addressing service, was one of the assets transferred to the private sector when Royal Mail was sold off. The Ordnance Survey retains ownership and management of the Unique Property Reference Number (UPRN), a central part of the data infrastructure for local public service delivery, yet access to this is heavily restricted, and complex agreements govern the ability of even the public sector to use it. Historically, authorities have faced major challenges in relation to ‘derived data’ from Ordnance Survey datasets, where the use of proprietary mapping products as a base layer when generating local records ‘infects’ those local datasets with intellectual property rights of the proprietary dataset, and restricts who they can be shared with. Whilst open data advocacy has secured substantially increased access to many publicly owned datasets in recent years, when the datasets the state is using are privately owned in the first place, and only licensed to the state, the potential scope for public re-use and scrutiny of the data, and scrutiny of the policy made on the basis of it, is substantially limited.

In the case of smart cities, I suspect this concern is likely to be particularly significant. Take transit data for example: in 2015 Boston, Massachusetts did a deal with Uber to allow access to data from the data-rich transportation firm to support urban planning and to identify approaches to regulation. Whilst the data shared reveals something of travel times, the limited granularity rendered it practically useless for planning purposes, and Boston turned to senate regulations to try and secure improved data 9. Yet, even if the city does get improved access to data about movements via Uber and Lyft in the city – the ability of citizens to get involved in the conversations about policy from that data may be substantially limited by continued access restrictions on the data.

With the Smart City model often involving the introduction of privately owned sensors networks and processes, the extent to which the ‘data infrastructure for public tasks ceases to have the properties that we will shortly see are essential to a ‘participatory public data infrastructure’ is a question worth paying attention to.

Participatory public data infrastructure

I will posit then that the grown of public data infrastructures is almost inevitable. But the shape they take is not. I want, in particular then, to examine what it would mean to have a participatory public data infrastructure.

I owe the concept of a ‘participatory public data infrastructure’ in particular to Jonathan Gray ([11], [12], [13]), who has, across a number of collaborative projects, sought to unpack questions of how data is collected and structured, as well as released as open data. In thinking about the participation of citizens in public data, we might look at three aspects:

  1. Participation in data use
  2. Participation in data production
  3. Participation in data design

And, seeing these as different in kind, rather than different in degree, we might for each one deploy Arnstein’s ladder of participation [14] as an analytical tool, to understand that the extent of participation can range from tokenism through to full shared decision making. As for all participation projects, we must also ask the vitally important question of ‘who is participating?’.

At the bottom-level ‘non-participation’ runs of Arnstein’s ladder we could see a data infrastructure that captures data ‘about’ citizens, without their active consent or involvement, that excludes them from access to the data itself, and then uses the data to set rules, ‘deliver’ services, and enact policies over which citizens have no influence in either their design of delivery. The citizen is treated as an object, not an agent, within the data infrastructure. For some citizens contemporary experience, and in some smart city visions, this description might not be far from a direct fit.

By contrast, when citizens have participation in the use of a data infrastructure they are able to make use of public data to engage in both service delivery and policy influence. This has been where much of the early civic open data movement placed their focus, drawing on ideas of co-production, and government-as-a-platform, to enable partnerships or citizen-controlled initiatives, using data to develop innovative solutions to local issues. In a more political sense, participation in data use can remove information inequality between policy makers and the subjects of that policy, equalizing at least some of the power dynamic when it comes to debating policy. If the ‘facts’ of population distribution and movement, electricity use, water connections, sanitation services and funding availability are shared, such that policy maker and citizen are working from the same data, then the data infrastructure can act as an enabler of more meaningful participation.

In my experience though, the more common outcome when engaging diverse groups in the use of data, is not an immediate shared analysis – but instead of a lot of discussion of gaps and issues in the data itself. In some cases, the way data is being used might be uncontested, but the input might turn out to be misrepresenting the lived reality of citizens. This takes us to the second area of participation: the ability to not jusT take from a dataset, but also to participate in dataset production. Simply having data collected from citizens does not make a data infrastructure participatory. That sensors tracked my movement around an urban area, does not make me an active participant in collecting data. But by contrast, when citizens come together to collect new datasets, such as the water and air quality datasets generated by sensors from Public Lab 15, and are able to feed this into the shared corpus of data used by the state, there is much more genuine participation taking place. Similarly, the use of voluntary contributed data on Open Street Map, or submissions to issue-tracking platforms like FixMyStreet, constitute a degree of participation in producing a public data infrastructure when the state also participates in use of those platforms.

It is worth noting, however, that most participatory citizen data projects, whether concerned with data use of production, are both patchy in their coverage, and hard to sustain. They tend to offer an add-on to the public data infrastructure, but to leave the core substantially untouched, not least because of the significant biases that can occur due to inequalities of time, hardware and skills to be able to contribute and take part.

If then we want to explore participation that can have a sustainable impact on policy, we need to look at shaping the core public data infrastructure itself – looking at the existing data collection activities that create it, and exploring whether or not the data collected, and how it is encoded, serves the broad public interest, and allows the maximum range of democratic freedom in policy making and implementation. This is where we can look at a participatory data infrastructure as one that enables citizens (and groups working on their behalf) to engage in discussions over data design.

The idea that communities, and citizens, should be involved in the design of infrastructures is not a new one. In fact, the history of public statistics and data owes a lot to voluntary social reform focused on health and social welfare collecting social survey data in the eighteenth and nineteenth centuries to influence policy, and then advocating for government to take up ongoing data collection. The design of the census and other government surveys have long been sources of political contention. Yet, with the vast expansion of connected data infrastructures, which rapidly become embedded, brittle and hard to change, we are facing a particular moment at which increased attention is needed to the participatory shaping of public data infrastructures, and to considering the consequences of seemingly technical choices on our societies in the future.

Ribes and Baker [16], in writing about the participation of social scientists in shaping research data infrastructures draw attention to the aspect of timing: highlighting the limited window during which an infrastructure may be flexible enough to allow substantial insights from social science to be integrated into its development. My central argument is that transparency, and the move towards open data, offers a key window within which to shape data infrastructures.

Part 2: Transparency

transparency /tran?spar(?)nsi/ noun “the quality of being done in an open way without secrets” 21

Advocacy for open data has many distinct roots: not only in transparency. Indeed, I’ve argued elsewhere that it is the confluence of many different agendas around a limited consensus point in the Open Definition that allowed the breakthrough of an open data movement late in the last decade [17] [18]. However, the normative idea of transparency plays an important roles in questions of access to public data. It was a central part of the framing of Obama’s famous ‘Open Government Directive’ in 2009 20, and transparency was core to the rhetoric around the launch of data.gov.uk in the wake of a major political expenses scandal.

Transparency is tightly coupled with the concept of accountability. When we talk about government transparency, it is generally as part of government giving account for it’s actions: whether to individuals, or to the population at large via the third and fourth estates. To give effective account means it can’t just make claims, it has to substantiate them. Transparency is a tool allowing citizens to exercise control over their governments.

Sweden’s Freedom of the Press law from 1766 were the first to establish a legal right to information, but it was a slow burn until the middle of the last century, when ‘right to know’ statutes started to gather pace such that over 100 countries now have Right to Information laws in place. Increasingly, these laws recognize that transparency requires not only access to documents, but also access to datasets.

It is also worth noting that transparency has become an important regulatory tool of government: where government may demand transparency off others. As Fung et. al argue in ‘Full Disclosure’, governments have turned to targeted transparency as a way of requiring that certain information (including from the private sector) is placed in the public domain, with the goal of disciplining markets or influencing the operation of marketized public services, by improving the availability of information upon which citizens will make choices [19].

The most important thing to note here is that demands for transparency are often not just about ‘opening up’ a dataset that already exists – but ultimately are about developing an account of some aspect of public policy. To create this account might require data to be connected up from different silos, and may required the creation of new data infrastructures.

This is where standards enter the story.

Part 3: Standards

standard /?stand?d/ noun

something used as a measure, norm, or model in [comparative] evaluations.

The first thing I want to note about ‘standards’ is that the term is used in very different ways by different communities of practice. For a technical community, the idea of a data standard more-or-less relates to a technical specification or even schema, by which the exact way that certain information should be represented as data is set out in minute detail. To assess if data ‘meets’ the standard is a question of how the data is presented. For a policy audience, talk of data standards may be interpreted much more as a question of collection and disclosure norms. To assess if data meets the standard here is more a question of what data is presented. In practice, these aspects interrelate. With anything more than a few records, to assess ‘what’ has been disclosed requires processing data, and that requires it to be modeled according to some reasonable specification.

The second thing I want to note about standards is that they are highly interconnected. If we agree upon a standard for the disclosure of government budget information, for example, then in order to produce data to meet that standard, government may need to check that a whole range of internal systems are generating data in accordance with the standard. The standard for disclosure that sits on the boundary of a public data infrastructure can have a significant influence on other parts of that infrastructure, or its operation can be frustrated when other parts of the infrastructure can’t produce the data it demands.

The third thing to note is that a standard is only really a standard when it has multiple users. In fact, the greater the community of users, the stronger, in effect, the standard is.

So – with these points in mind, let’s look at how a turn to transparency and open data has created both pressure for application of data standards, and an opening for participatory shaping of data infrastructures.

One of the early rallying cries of the open data movement was ‘Raw Data Now’. Yet, it turns out raw data, as a set of database dumps of selected tables from the silo datasets of the state does not always produce effective transparency. What it does do, however, is create the start of a conversation between citizen, private sector and state over the nature of the data collected, held and shared.

Take for example this export from a council’s financial system in response to a central government policy calling for transparency on spend over £500.

Service Area BVA Cop ServDiv Code Type Code Date Transaction No. Amount Revenue / Capital Supplier
Balance Sheet 900551 Insurance Claims Payment (Ext) 47731 31.12.2010 1900629404 50,000.00 Revenue Zurich Insurance Co
Balance Sheet 900551 Insurance Claims Payment (Ext) 47731 01.12.2010 1900629402 50,000.00 Revenue Zurich Insurance Co
Balance Sheet 933032 Other income 82700 01.12.2010 1900632614 -3,072.58 Revenue Unison Collection Account
Balance Sheet 934002 Transfer Values paid to other schemes 11650 02.12.2010 1900633491 4,053.21 Revenue NHS Pensions Scheme Account
Balance Sheet 900601 Insurance Claims Payment (Ext) 47731 06.12.2010 1900634912 1,130.54 Revenue Shires (Gloucester) Ltd
Balance Sheet 900652 Insurance Claims Payment (Int) 47732 06.12.2010 1900634911 1,709.09 Revenue Bluecoat C Of E Primary School
Balance Sheet 900652 Insurance Claims Payment (Int) 47732 10.12.2010 1900637635 1,122.00 Revenue Christ College Cheltenham

It comes from data generated for one purpose (the council’s internal financial management), now being made available for another purpose (external accountability), but that might also be useful for a range of further purposes (companies looking to understand business opportunities; other council’s looking to benchmark their spending, and so-on). Stripped of its context as part of internal financial systems, the column headings make less sense: what is BVA COP? Is the date the date of invoice? Or of payment? What does each ServDiv Code relate to? The first role of any standardization is often to document what the data means: and in doing so, to surface unstated assumptions.

But standardization also plays a role in allowing the emerging use cases for a dataset to be realized. For example, when data columns are aligned comparison across council spending is facilitated. Private firms interested in providing such comparison services may also have a strong interest in seeing each of the authorities providing data doing so to a common standard, to lower their costs of integrating data from each new source.

If standards are just developed as the means of exchanging data between government and private sector re-users of the data, the opportunities for constructing a participatory data infrastructure are slim. But when standards are explored as part of the transparency agenda, and as part of defining both the what and the how of public disclosure, such opportunities are much richer.

When budget and spend open data became available in Sao Paulo in Brazil, a research group at University of Sao Paulo, led by Gisele Craviero, explored how to make this data more accessible to citizens at a local level. They found that by geocoding expenditure, and color coding based on planned, committed and settled funds, they could turn the data from impenetrable tables into information that citizens could engage with. More importantly, they argue that in engaging with government around the value of geocoded data “moving towards open data can lead to changes in these underlying and hidden process [of government data creation], leading to shifts in the way government handles its own data” [22]

The important act here was to recognize open data-enabled transparency not just as a one-way communication from government to citizens, but as an invitation for dialog about the operation of the public data infrastructure, and an opportunity to get involved – explaining that, if government took more care to geocode transactions in its own systems, it would not have to wait for citizens to participate in data use and to expend the substantial labour on manually geocoding some small amount of spending, but instead the opportunity for better geographic analysis of spending would become available much more readily inside and outside the state.

I want to give three brief examples of where the development, or not, of standards is playing a role in creating more participatory data infrastructures, and in the process to draw out a couple of other important aspects of thinking about transparency and standardization as part of the strategic toolkit for asserting citizen rights in the context of smart cities.

Part 4: Examples

Contracts

My first example looks at contracts for two reasons. Firstly, it’s an area I’ve been working on in depth over the last few years, as part of the team creating and maintaining the Open Contracting Data Standard. But, more importantly, its an under-explored aspect of the smart city itself. For most cities, how transparent is the web of contracts that establishes the interaction between public and private players? Can you easily find the tenders and awards for each component of the new city infrastructure? Can you see the terms of the contracts and easily read-up on who owns and controls each aspect of emerging public data infrastructure? All too often the answer to these questions is no. Yet, when it comes to procurement, the idea of transparency in contracting is generally well established, and global guidance on Public Private Partnerships highlights transparency of both process and contract documents as an essential component of good governance.

The Open Contracting Data Standard emerged in 2014 as a technical specification to give form to a set of principles on contracting disclosure. It was developed through a year-long process of research, going back and forth between a focus on ‘data supply’ and understanding the data that government systems are able to produce on their contracting, and ‘data demand’, identifying a wide range of user groups for this data, and seeking to align the content and structure of the standard with their needs. This resulted in a standard that provides a framework for publication of detailed information at each stage of a contracting process, from planning, through tender, award and signed contract, right through to final spending and delivery.

Meeting this standard in full is quite demanding for authorities. Many lack existing data infrastructures that provide common identifiers across the whole contracting process, and so adopting OCDS for data disclosure may involve some elements of update to internal systems and processes. The transparency standard has an inwards effect, shaping not only the data published, but the data managed. In supporting implementation of OCDS, we’ve also found that the process of working through the structured publication of data often reveals as yet unrecognized data quality issues in internal systems, and issues of compliance with existing procurement policies.

Now, two of the critiques that might be offered of standards is that, as highly technical objects their development is only open to participation from a limited set of people, and that in setting out a uniform approach to data publication, they are a further tool of centralization. Both these are serious issues.

In the Open Contracting Data Standard we’ve sought to navigate them by working hard on having an open governance process for the standard itself, and using a range of strategies to engagement people in shaping the standard, including workshops, webinars, peer-review processes and presenting the standard in a range of more accessible formats. We’re also developing an implementation and extensions model that encourages local debate over exactly which elements of the overall framework should be prioritized for publication, whilst highlighting the fields of data that are needed in order to realize particular use-cases.

This highlights an important point: standards like OCDS are more than the technical spec. There is a whole process of support, community building, data quality assurance and feedback going on to encourage data interoperability, and to support localization of the standard to meet particular needs.

When standards create the space, then other aspects of a participatory data infrastructure are also enabled and facilitated. A reliable flow of data on pipeline contracts may allow citizens to scrutinize the potential terms of tenders for smart city infrastructure before contracts are awarded and signed, and an infrastructure with the right feedback mechanisms could ensure, for example, that performance-based payments to providers are properly influenced by independent citizen input.

The thesis here is one of breadth and depth. A participatory developed open standard allows a relatively small-investment intervention to shape a broad section of public data infrastructure, influencing the internal practice of government and establishing the conditions for more ad-hoc deep-dive interventions, that allow citizens to use that data to pursue particular projects of change.

Earth

The second example explores this in the context of land. Who owns the smart city?

The Open Data Index and Open Data Barometer studies of global open data availability have had a ‘Land Ownership’ category for a number of years, and there is a general principle that land ownership information should, to some extent, be public. However, exactly what should be published is a tricky question. An over-simplified schema might ignore the complex realities of land rights, trying to reduce a set of overlapping claims to a plot number and owner. By contrast, the narrative accounts of ownership that often exist in the documentary record may be to complex to render as data [24]. In working on a refined Open Data Index category, the Cadasta Foundation 23 noted that opening up property owners names in the context of a stable country with functioning rule of law “has very different risks and implications than in a country with less formal documentation, or where dispossession, kidnapping, and or death are real and pervasive issues” 23.

The point here is that a participatory process around the standards for transparency may not, from the citizen perspective, always drive at more disclosure, but that at times, standards may also need to protect the ‘strategic invisibility’ of marginalized groups [25]. In the United Kingdom, although individual titles can be bought for £3 from the Land Registry, no public dataset of title-holders is available. However, there are moves in place to establish a public dataset of land owned by private firms, or foreign owners, coming in part out of an anti-corruption agenda. This fits with the idea that, as Sunil Abraham puts it, “privacy should be inversely proportional to power” 26.

Central property registers are not the only source of data relevant to the smart city. Public authorities often have their own data on public assets. A public conversation on the standards needed to describe this land, and share information about it, is arguable overdue. Again looking at the UK experience, the government recently consulted on requiring authorities to record all information on their land assets through the Property Information Management system (ePIMS): centralizing information on public property assets, but doing so against a reductive schema that serves central government interests. In the consultation on this I argued that, by contrast, we need an approach based on a common standard for describing public land, but that allows local areas the freedom to augment a core schema with other information relevant to local policy debates.

Air

From the earth, let us turn very briefly to the air. Air pollution is a massive issue, causing millions on premature deaths worldwide every year. It is an issue that is particularly acute in urban areas. Yet, as the Open Data Institute note “we are still struggling to ‘see’ air pollution in our everyday lives” 27. They report the case of decision making on a new runway at Heathrow Airport, where policy makers were presented with data from just 14 NO2 sensors. By contrast, a network of citizen sensors provided much more granular information, and information from citizen’s gardens and households, offering a different account from those official sensors by roads or in fields.

Mapping the data from official government air quality sensors reveals just how limited their coverage is: and backs up the ODI’s calls for a collaborative, or participatory, data infrastructure. In a 2016 blog post, Jamie Fawcett describes how:

“Our current data infrastructure for air quality is fragmented. Projects each have their own goals and ambitions. Their sensor networks and data feeds often sit in silos, separated by technical choices, organizational ambition and disputes over data quality and sensor placement. The concerns might be valid, but they stand in the way of their common purpose, their common goals.”

He concludes “We need to commit to providing real-time open data using open standards.”

This is a call for transparency by both public and private actors: agreeing to allow re-use of their data, and rendering it comparable through common standards. The design of such standards will need to carefully balance public and private interests, and to work out how the costs of making data comparable will fall between data publishers and users.

Part 5: Recap

So, to briefly recap:

  • I want to draw attention to the data infrastructures of the smart city and the modern state;
  • I’ve suggested that open data and transparency can be powerful tools in performing the kind of infrastructural inversion that brings the context and history of datasets into view and opens them up to scrutiny;
  • I’ve furthermore argued that transparency policy opens up an opportunity for a two-way dialogue about public data infrastructures, and for citizen participation not only in the use and production of data, but also in setting standards for data disclosure;
  • I’ve then highlighted how standards for disclosure don’t just shape the data that enters the public domain, but they also have an upwards impact on the shape of the public data infrastructure itself.

Taken together, this is a call for more focus on the structure and standardization of data, and more work on exploring the current potential of standardization as a site of participation, and an enabler of citizen participation in future.

If you are looking for a more practical set of takeaways that flow from all this, let me offer a set of questions that can be asked of any smart cities project, or indeed, any data-rich process of governance:

  • (1) What information is pro-actively published, or can be demanded, as a result of transparency and right to information policies?
  • (2) What does the structure of the data reveal about the process/project it relates to?
  • (3) What standards might be used to publish this data?
  • (4) Do these standards provide the data I, or other citizens, need to be empowered in relevant to this process/project?
  • (5) Are these open standards? Whose needs were they designed to serve?
  • (6) Can I influence these standards? Can I afford not to?

References

1: https://www.google.co.uk/search?q=define%3Ainfrastructure, accessed 17th August 2017

2: Star, S., & Ruhleder, K. (1996). Steps Toward an Ecology of Infrastructure: Design and Access for Large Information Spaces. Information Systems Research, 7(1), 111–134.

3: Bowker, G. C., & Star, S. L. (2000). Sorting Things Out: Classification and Its Consequences. The MIT Press.

4: Goldsmith, S., & Crawford, S. (2014). The responsive city. Jossey-Bass.

5: Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. SAGE Publications.

6: The Danish Government. (2012). Good Basic Data for Everyone – a Driver for Growth and Efficiency, (October 2012)

7: Bartha, G., & Kocsis, S. (2011). Standardization of Geographic Data: The European INSPIRE Directive. European Journal of Geography, 22, 79–89.

10: Guldi, J. (2012). Roads to power: Britain invents the infrastructure state.

[11]: Gray, J., & Davies, T. (2015). Fighting Phantom Firms in the UK : From Opening Up Datasets to Reshaping Data Infrastructures?

[12]: Gray, J., & Tommaso Venturini. (2015). Rethinking the Politics of Public Information: From Opening Up Datasets to Recomposing Data Infrastructures?

[13]: Gray, J. (2015). DEMOCRATISING THE DATA REVOLUTION: A Discussion Paper

[14]: Arnstein, S. R. (1969). A ladder of citizen participation. Journalof the American Institute of Planners, 34(5), 216–224.

[16]: Ribes, D., & Baker, K. (2007). Modes of social science engagement in community infrastructure design. Proceedings of the 3rd Communities and Technologies Conference, C and T 2007, 107–130.

[17]: Davies, T. (2010, September 29). Open data, democracy and public sector reform: A look at open government data use from data.gov.uk.

[18]: Davies, T. (2014). Open Data Policies and Practice: An International Comparison.

[19]: Fung, A., Graham, M., & Weil, D. (2007). Full Disclosure: The Perils and Promise of Transparency (1st ed.). Cambridge University Press.

[22]: Craveiro, G. S., Machado, J. A. S., Martano, A. M. R., & Souza, T. J. (2014). Exploring the Impact of Web Publishing Budgetary Information at the Sub-National Level in Brazil.

[24]: Hetherington, K. (2011). Guerrilla auditors: the politics of transparency in neoliberal Paraguay. London: Duke University Press.

[25]: Scott, J. C. (1987). Weapons of the Weak: Everyday Forms of Peasant Resistance.

Open data for tax justice: the real design challenge is social

[Summary: Thinking aloud about a pragmatic / humanist approach to data infrastructure building]

Stephen Abbott Pugh of Open Knowledge International has just blogged about the Open Data for Tax Justice ‘design sprint’ that took place in London on Monday and Tuesday. I took part in the first day and a half of the workshop, and found myself fairly at-odds with the approach being taken that focussed narrowly on the data-pipelines based creation of a centralised dataset, and that appeared to create barriers rather than bridges between data and domain experts. Rather than the rethink the approach, as I would argue is needed, the Open Knowledge write up appears to show the Open Data for Tax Justice project heading further down this flawed path.

In this post, I’m offering an (I hope) constructive critique of the approach, trying to draw out some more general principles that might inform projects to create more participatory data infrastructures.

The context

As the OKI post relates:

“Country-by-country reporting (CBCR) is a transparency mechanism which requires multinational corporations to publish information about their economic activities in all of the countries where they operate. This includes information on the taxes they pay, the number of people they employ and the profits they report.”

Country by Country reporting has been a major ask of tax justice campaigners since the early 2000s, in order to address tax avoidance by multi-national companies who shift their profits around the world through complex corporate structures and internal transfers. CBCR got a major boost in 2013 with the launch of reporting requirements for EU Banks to publicly disclose Country by Country reports under the CRD IV regulations. In the extractives sector, campaigners have also secured regulations requiring disclosure of tax and licensing payments to government on a project-by-project basis.

Although in the case of UK extractives firms, reporting is taking place to companies house as structured data, with an API available to access reports, for EU Banks, reporting is predominantly in the form of tables at the back of PDF format company reports.

If campaigners are successful, public reporting will be extended to all EU multinationals, holding out the prospect of up to 6000 more annual reports that can provide a breakdown of turnover, profit, tax and employees country-by-country. If the templates for disclosure are based on existing OECD models for private exchange between tax authorities, the data may also include information on the different legal entities that make a corporate group, important for public understanding the structure of the corporate world.

Earlier this year, a report from Alex Cobham, Jonathan Gray and Richard Murphey set out a number of use-cases for such data, making the case that “a global public database on the tax contributions and economic activities of multinational companies” would be an asset for a wide range of users, from journalists, civil society and investors.

Sprinting with a data-pipelines hammer

This week’s design sprint focussed particularly on ‘data extraction’, developing a set of data pipeline scripts and processes that involve downloading a report PDF, marking up the tables where Country by Country data is stored, describing what each column contains using YAML, and then committing this to GitHub where the process can then be replicably run using datapipeline commands. Then, with the data extracted, it can be loaded into an SQL database, and explored by writing queries or building simple charts. It’s a technically advanced approach, and great for ensuring replicability of data extraction.

But, its also an approach that ultimately entirely misses the point, ignoring the social process of data production, creating technical barriers instead of empowering contributors and users, and offering nothing for campaigners who want to ensure that better data is produced ‘at source’ by companies.

Whilst the OKI blog post reports that “The Open Data for Tax Justice network team are now exploring opportunities for collaborations to collect and process all available CRD IV data via the pipeline and tools developed during our sprint.” I want to argue for a refocussed approach, based around a much closer look at the social dynamics of data creation and use.

An alternative approach: crafting collaborations

I’ve tried below to unpack a number of principles that might guide that alternative approach:

Principle 1: Letting people use their own tools

Any approach that involves downloading, installing, signing-up to, configuring or learning new software in order to create or use data is likely to exclude a large community of potential users. If the data you are dealing with is tabular: focus on spreadsheets.

More technical users can transform data into database formats when the questions they want to answer require the additional power that brings, but it is better if the starting workflow is configured to be accessible to the largest number of likely users.

Back in October I put together a rough prototype of a Google spreadsheets based transcription tool for Country by Country reports, that needed just copy-and-paste of data, and a few selections from validated drop-down lists to go from PDFs to normalised data – allowing a large user community to engage directly with the data, with almost zero learning curve.

The only tool this approach needs to introduce is something like tabula or PDFTables to convert from PDF to Excel or CSV: but in this workflow the data comes right back to the user to be able to work with it after it has been converted, rather than being taken away from them into a longer processing pipeline. Plus, it brings the benefit of raising awareness of data extraction from PDF that the user can adopt for other projects in future, and allowing the user to work-around failed conversions using a manual transcription approach if they need to.

(Sidenote: from discussions, I understand that one of the reasons the OKI team made their technical choice was from envisaging the primary users as ‘non-experts’ who would engage in crowdsourcing transcriptions of PDF reports. I think this is both highly optimistic, and relies on a flawed analysis of the relatively small scale of the crowdsourcing task in terms of a few 1000 reports a year, and the potential benefits of involving a more engaged group of contributors in creating a civil society database)

Principle 2: Aim for instant empowerment

One of the striking things about Country by Country reporting data is how simple it ultimately is. The CRD IV disclosures contain just a handful of measures (turnover, pre-tax profits, tax paid, number of employees), a few dimensions (company name, country, year), and a range of annotations in footnotes or explanations. The analysis that can be done with this is data is similarly simple – yet also very powerful. Being able to go from a PDF table of data, to a quick view of the ratios between turnover and tax, or profit and employees for a country can quickly highlight areas to investigate for profit-shifting and tax-avoidance behaviour.

Calculating these ratios is possible almost as soon as you have data in a spreadsheet form. In fact, a well set up template could calculate them directly, or the user with basic ability to write formula could fill in the columns they need.

Many of the use-cases for Country by Country reports are based not on aggregation across hundreds of firms, but on simply understanding the behaviour of one or two firms. Investigators and researchers often have firms they are particularly interested in, and where the combination of simple data, and their contextual knowledge, can go a long way.

Principle 3: Don’t drop context

On the topic of context: all those footnotes and explanations in company reports are an important part of the data. They might not be computable, or easy to query against, but in the data explorations that took place on Monday and Tuesday I was struck by how much the tax justice experts were relying not only on the numerical figures to find stories, but also on the explanations and other annotations from reports.

The data pipelines approach dropped these annotations (and indeed dropped anything that didn’t fit into it’s schema). An alternative approach would work from the principle that, as far as possible, nothing of the source should be thrown away – and structure should be layered on top of the messy reality of accounting judgements and decisions.

Principle 4: Data making is meaning-making

A lot of the analysis of Country by Country reporting data is about look for outliers. But data outliers and data errors can look pretty similar. Instead of trying to separate the process of data preparation and analysis, these two need to be brought closer together.

Creating a shared database of tax disclosures will involve not only processes of data extraction, but also processes of validation and quality control. It will require incentives for contributors, and will require attention to building a community of users.

Some of the current structured data available from Country by Country reports has been transcribed by University students as part of their classes – where data was created as a starting point for a close feedback loop of data analysis. The idea of ‘frictionless data’ makes sense when it comes to getting a list of currency codes, but when it comes to understanding accounts, some ‘friction’ of social process can go a long way to getting reliable data, and building a community of practice who understand the data in more depth.

Principle 5: Standards support distributed collaboration

One of the difficulties in using the data mentioned above, prepared by a group of students, was that it had been transcribed and structured to solve the particular analytical problem of the class, and not against any shared standard for identifying countries, companies or the measures being transcribed.

The absence of agreement on key issues such as codelists for tax jurisdictions, company identifiers, codes and definitions of measures, and how to handle annotations and missing data means that the data that is generated by different researchers, or even different regulatory regimes, is not comparable, and can’t be easily combined.

The data pipelines approach is based on rendering data comparable through a centralised infrastructure. In my experience, such approaches are brittle, particularly in the context of voluntary collaboration, and they tend to create bottlenecks for data sharing and innovation. By contrast, an approach based on building light-weight standards can support a much more distributed collaboration approach – in which different groups can focus first on the data that is of most interest to them (for example, national journalists focussing on the tax record of the top-10 companies in their jurisdiction), easily contributing data to a common pool later when their incentives are aligned.

Campaigners also need to be armed with use-case backed proposals for how disclosures should be structured in order to push for the best quality disclosure regimes

What’s the difference?

Depending on your viewpoint, the approach I’ve started to set out above might look more technically ‘messy’ – but I would argue it is more in-tune with the social realities of building a collaborative dataset of company tax disclosures.

Fundamentally (with the exception perhaps of standard maintenance, although that should be managed as a multi-stakeholder project long-term) – it is much more decentralised. This is in line with the approach in the Open Contracting Data Standard, where the Open Contracting Partnership have stuck well to their field-building aspirations, and where many of the most interesting data projects emerge organically at the edge of the network, only later feeding into cross-collaboration.

Even then, this sketch of an alternative technical approach above is only part of the story in building a better data-foundation for action to address corporate tax avoidance. There will still be a lot of labour to create incentives, encourage co-operation, manage data quality, and build capacity to work with data. But better we engage with that labour, than spending our efforts chasing after frictionless dreams of easily created perfect datasets.