timdavies – Tim's Blog

Beyond october

In a few weeks time (October 12th) I’m going to be leaving Open Data Services Co-op and starting a short career-break of sorts: returning to my research roots, spending some time exploring possible areas of future focus, and generally taking a bit of time out.

I’ll be leaving projects in capable hands, with colleagues at Open Data Services continuing to work on Open Contracting, Beneficial Ownership, 360 Giving, Org-id.guide IATI and Social Economy data standards projects. One of the great advantages of the worker co-operative model we’ve been developing over the last three and a half years is that, instead of now needing to seek new leaders for the technical work on these projects, we’ve been developing shared leadership of these projects from day one.

I first got involved in the development of open data standards out of research interest: curious about how these elements of data infrastructure were created and maintained, and about the politics embedded within, or expressed through, them. Over the last five years my work has increasingly focussed on supporting open data standard adoption, generating tons of learning – but with little time to process it or write it up. So – at least for a while – I’ll be stepping back from day-to-day work on specific standards and data infrastructure, and hopefully next year will find ways to distill the last few years learning in some useful form.

Between now and the end of 2018, I’ll be working on editing the State of Open Data collection of essays for the OD4D network. Then in early 2019, I’m planning for a bit of time off completely, before starting to explore new projects from April onwards.

I’m imensely proud of what we’ve done with Open Data Services Co-op over the last 3.5 years, and grateful to colleague for co-creating something that both supports world-changing data projects, but that also supports team members in their own journeys. If you ever need support with an open data project, do not hesitate to drop them a line.

Javelin Park Episode 5: Return of the ICO

[Summary: The Information Commissioner’s Office has upheld an appeal against continued redaction of key financial information about the Javelin Park Incinerator Public Private Partnership (PPP) project in Gloucestershire]

The Story So Far

I’ve written before about controversy over the contract for Javelin Park, a waste incinerator project worth at least £0.5bn and being constructed just outside Stroud as part of a 25-year Public Private Partnership deal. There’s a short history at the bottom of this article, which breaks off in 2015 when the Information Commissioners’ Office last ruled against Gloucestershire County Council (GCC) and told them to release an unredacted copy of the PPP contract. GCC appealed that decision, but were finally told by the Information Tribunal in 2017 to publish the contract: which they did. Sort of. Because in the papers released, we found out about a 2015 renegotiation that had taken place, meaning that we still don’t know how much local taxpayers are on the hook for, nor how the charging model affects potential recycling rates, or incentives to burn plastics.

In June last year, through FOI, I got a heavily redacted copy of a report considering the value for money of this renegotiated contract, but blacking out all the key figures. This week the Information Commissioner upheld my appeal against the redactions, ruling that GCC have 35 days to provide un-redacted information. They may still make their own appeal against this, but the ICO decision makes very clear that the reasoning from the 2017 Information Tribunal ruling holds firm when it comes to the public interest in knowing salient details of original and renegotiated contracts.

The Story Right Now

For the last two weeks, Gloucestershire resident Sid Saunders has been on hunger strike outside the county’s Shire Hall to call for the release of the full revised contract between Gloucestershire County Council and Urbaser Balfour Beatty. This is, to my knowledge, unprecedented. It demonstrates the strength of feeling over the project, and the crucial importance of transparency around contracts in securing public accountability.

GCC are already weeks overdue responding to the most recent FOI/EIR request for the latest contract text, and continue to stonewall requests for even basic details, repeating discredited soundbites about potential savings that rely on outdated assumptions about comparisons and high waste flows.

On Wednesday, Sid and other local activists staged a dignified silent protest at the meeting of GCC Cabinet, where public and councillor questions on an air quality agenda item had unconstitutionally been excluded.

Tomorrow we’ll be heading to Gloucester in support of Sid’s continued campaign for information, and for action to bring accountability to this mega-project.

It’s against this backdrop that I wanted to draw out some of the key elements of the ICO’s decision notice, and observations on GCC responses to FOI and EIR requests.

Unpacking the decision notice

The decision notice has not yet been published on the ICO website, but I’ve posted a copy here and will update the link once the ICO version is online.

The delays can’t stay

It is notable that every request for information relating to Javelin Park has been met with very delayed replies, exceeding the statutory limits set down in the Freedom of Information Act (FOIA), and the stricter Environmental Information Regulations (EIR).

The decision notice states that the “council failed to comply with the requirements of Regulation 5(2) and Regulation 14(2)” which set strict time limits on the provision of information, and the grounds for which an authority can take extra time to respond.

Yet, we’re seeing in the latest requests, that GCC suggest that they will need until the end of June (which falls, curiously, just days after the next full meeting of the County Council) to work out what they can release. I suspect consistent breaches of the regulations on timeliness are not likely to be looked on favourably by the ICO in any future appeals.

The information tribunal principles stand

The Commissioners decision notice draws heavily on the earlier Information Tribunal ruling that noted that, whilst there are commercial interests of the Authority, and UBB at play, there are significant public interests in transparency, and:

“In the end it is the electorate which must hold the Council as a whole to account and the electorate are more able to do that properly if relevant information is available to all”

The decision note makes clear that the reasoning applies to revisions to the contract:

Even with the disclosures ordered by the Tribunal from the contract the Commissioner considers that it is impossible for the public to be fully aware of the overall value for money of the project in the long term if it is unable to analyse the full figures regarding costs and price estimates which the council was working from at the time of the revised project plan.

going on to say:

The report therefore provides more current, relevant figures which the council used to evaluate and inform its decisions regarding the contract and it will presumably be used as a basis for its future negotiations over pricing and costs. Currently these figures are not publicly available, and therefore the public as a whole cannot create an overall picture as to whether the EfW development provides value for money under the revised agreement.

As the World Bank PPP Disclosure Framework makes clear, amendment and revisions to a contract are as important as the contract itself, and should be proactively published. Not laboriously dragged out of an authority through repeated trips to information tribunals.

Prices come from markets, not from secrets

A consistent theme in the GCCs case for keeping heavy redactions in the contract is that disclosure of information might affect the price they get for selling electricity generated at the plant. However, the decision notice puts the point succinctly:

Whilst she [the Commissioner] also accepts that if these figures are published third parties might take account of them during negotiations, the main issue will be the market value of electricity at the time that negotiations are taking place.

As I recall from first year economics lectures (or perhaps even GCSE business studies…): markets function better with more perfect information. The energy market is competitive, and there is no reason to think that selective secrecy will distort the market or secure the authority a better deal.

(It is worth noting that the same reasoning, hiding information to ‘get a better deal’ seems to be driving the non-disclosure of details of the £53m of land the authority plan to dispose of – again raising major questions about exactly whose interests are being served by a culture of secrecy?).

Not everything is open

The ICO decision notice is nuanced. It does find some areas where, with the commercial interest of the private party invoked, public interest is not strong enough to lead to disclosure. The Commissioner states:

These include issues such as interest and debt rates and operating costs of UBB which do not directly affect the overall value for money to the public, but which are commercially sensitive to UBB.

This makes some sense. As this decision notice relates to a consultants report on Value for Money, rather than the contract with the public authority, it is possible for there to be figures that do not warrant wider disclosure. However, following the precedent set by the Information Tribunal, the same reasoning would only apply to parts of a contract if they had been agreed in advance to be commercially confidential. As Judge Shanks found, only a limited part of the agreement between UBB and GCC was covered by such terms. Any redactions GCC now want to apply to a revised agreement should start only from consulting contract Schedule 23 on agreed commercial confidential information.

Where next?

GCC have either 28 days to appeal the decision notice, or 35 days to provide the requested information. The document in question is only a 29 page report, with a small number of redactions to remove, so it certainly should not take that long.

Last time GCC appealed to a Tribunal in the case of the 2013 Javelin Park Contract they spent upwards of £400,000 of taxpayers money on lawyers*, only to be told to release the majority of the text. Given the ICO Decision Notice makes clear it is relying on the reasoning of the Tribunal, a new appeal to the tribunal would seem unlikely to succeed.

However, we do now have to wait and see what GCC do, and whether we’ll get to know what the renegotiated contract prices were in 2015. Of course, this doesn’t tell us whether or not there has been further renegotiation, and for that we have to continue to push for proactive transparency and a clear open contracting policy at GCC that will make transparency the norm, rather than something committed local citizens have to fight for through self-sacrificing direct action.

*Based on public spending data payments from Residential Waste Project to Eversheds.

Notes from a RightsCon panel on AI, Open Data and Privacy

[Summary: Preliminary notes on open data, privacy and AI]

At the heart of open data is the idea that when information is provided in a structured form, freely accessible, and with permission granted for anyone to re-use it, latent social and economic value within it can be unlocked.

Privacy positions assert the right of individuals to control their information and data, and data about them, and to have protection from harms that might occur through exploitation of their data.

Artificial intelligence is a field of computing concerned with equipping machines with the ability to perform tasks that many previously have required human intelligence, including recognising patterns, making judgements, and extracting and analysing semi-structured information.

Around each of these concepts vibrant (and broad based) communities exist: advocating respectively for policy to focus on openness, privacy and the transformative use of AI. At first glance, there seem to be some tensions here: openness may be cast as the opposite of privacy; or the control sought in privacy as starving artificial intelligence models of the data they could use for social good. The possibility within AI of extracting signals from messy records might appear to negate the need to construct structured public data, and as data-hungry AI draws increasingly upon proprietary data sources, the openness of data on which decisions are made may be undermined. At some points these tensions are real. But if we dig beneath surface level oppositions, we may find arguments that unite progressive segments of each distinct community – and that can add up to a more coherent contemporary narrative around data in society.

This was the focus of a panel I took part in at RightsCon in Toronto last week, curated by Laura Bacon of Omidyar Network, and discussing with Carlos Affonso Souza (ITS Rio) and Frederike Kaltheuner (Privacy International) – and the first in a series of panels due to take place over this year at a number of events. In this post I’ll reflect on five themes that emerged both from our panel discussion, and more widely from discussions I had at RightsCon. These remarks are early fragments, rather than complete notes, and I’m hoping that a number may be unpacked further in the upcoming panels.

The historic connection of open data and AI

The current ‘age of artificial intelligence’ is only the latest in a series of waves of attention the concept has had over the years. In this wave, the emphasis is firmly upon the analysis of large collections of data, predominantly proprietary data flows. But it is notable that a key thread in advocacy for open government data in the late 2000s came from Artificial Intelligence and semantic web researchers such as Prof. Nigel Shadbolt, whose Advanced Knowledge Technologies (AKT) programme was involved in many early re-use projects with UK public data, and Prof. Jim Hendler at TWC. Whilst I’m not aware of any empirical work that explores the extent to which open government data has gone on to feed into machine-learning models, in terms of bootstrapping data-hungry research, there is a connection here to be explored.

There also an argument to be made that open data advocacy, implementation and experiences over the last ten years have played an important role in contributing to growing public understandings of data, and in embedding cultural norms around seeking access to the raw data underlying decisions. Without the last decade of action on open data, we might be encountering public sector AI based purely on proprietary models, as opposed to now navigating a mixed ecology of public and private AI.

(Some) open data is getting personal

Its not uncommon to hear open data advocates state that open data only covers ‘non-personal data’. It’s certainly true that many of the datasets sought through open data policy, such as bus timetables, school rankings, national maps, weather reports and farming statistics don’t contain an personally identifying information (PII). Yet, whilst we should be able to mark a sizable teritory of the open data landscape as free from privacy concerns, there are increasingly blurred lines at points where ‘public data’ is also ‘personal data’.

In some cases, this may be due to mosaic effects: where the combination of multiple open datasets could be personally identifying. In other cases, the power to AI to extract structured data from public records about people raises interesting questions about how far permissive regimes of access and re-use around those documents should also apply to datasets derived from them. However, there are also cases where open data strategies are being applied to the creation of new datasets that directly contain personally identifying information.

In the RightsCon panel I gave the example of Beneficial Ownership data: information about the ultimate owners of companies that can be used to detect ilicit use of shell companies for money laundering or tax evasion, or that can support better due dilligence on supply chains. Transparency campaigners have called for beneficial ownership registers to be public and available as open data, citing the risk that restricted registers will be underused and will much less effective than open registers, and drawing on the idea of a social contract that means the limited liability conferred by a company comes with the responsibility to be identified as party to that company. We end up then with data that is both public (part of the public record), but also personal (containing information about identified individuals).

Privacy is not secrecy: but consent remains key

Frederike Kaltheuner kicked off our discussions of privacy on the panel by reminding us that privacy and secrecy are not the same thing. Rather, privacy is related to control: and the ability of individuals and communities to excercise rights over the presentation and use of their data. The beneficial ownership example highlights that not all personal data can or should be kept secret, as taking an ownership role in a company comes with a consequent publicity requirement. However, as Ann Cavoukian forcefully put the point in our discussions, the principle of consent remains vitally important. Individuals need to be informed enough about when and how their personal information may be shared in order to make an informed choice about entering into any relationship which requests or requires information disclosure.

When we reject a framing of privacy as secrecy, and engage with ideas of active consent, we can see, as the GDPR does, that privacy is not a binary choice, but instead involves a set of choices in granting permissions for data use and re-use. Where, as in the case of company ownership, the choice is effectively between being named in the public record vs. not taking on company ownership, it is important for us to think more widely about the factors that might make that choice trickier for some individuals or groups. For example, as Kendra Albert expained to me, for trans-people a business process that requires current and former names to be on the public record may have substantial social consequences. This highlights the need for careful thinking about data infrastructures that involve personal data, such that they can best balance social benefits and individual rights, giving a key place to mechanisms of acvice consent: and avoiding the creation of circumstances in which individuals may find themselves choosing uncomfortably between ‘the lesser of two harms’.

Is all data relational?

One of the most challenging aspects of the receny Cambridge Analytica scandal is the fact that even if individuals did not consent at any point to the use of their data by Facebook apps, there is a chance they were profiled as a result of data shared by people in their wider network. Whereas it might be relatively easy to identify the subject of a photo, and to give that individual rights of control over the use and distribution of their image, an individual ownership and rights framework is difficult can be difficult to apply to many modern datasets. Much of the data of value to AI analysis, for example, concerns the relationship between individuals, or between individuals and the state. When there are multiple parties to a dataset, each with legitimate interests in the collection and use of the data, who holds the rights to govern its re-use?

Strategies of regulation

What unites the progressive parts of the open data, privacy and AI communities? I’d argue that each has a clear recognition of the power of data, and a concern with minimising harm (albeit with a primary focus in individual harm in privacy contexts, and with the emphasis placed on wider social harms from corruption or poor service delivery by open data communities)*. As Martin Tisné has suggested, in a context where harmful abuses of data power are all around us, this common ground is worth building on. But in charting a way forward, we need to more fully unpack where there are differences of emphasis, and different preferences for regulatory strategies – produced in part by the different professional backgrounds of those playing leadership roles in each community.

(*I was going to add ‘scepticism about centralised power’ (of companies and states) to the list of common attributes across progressive privacy, open data and AI communities, but I don’t have a strong enough sense of whether this could apply in an AI context.)

In our RightsCon panel I jotted down and shared five distinct strategies that may be invoked:

Reshaping inputs – for example, where an AI system is generated biased outputs, work can take place to make sure the inputs it recieves are more representative. This strategy essentially responds to negative outcomes from data by adding more, corrective, data.
Regulating ownership – for example, asserting that individuals have ownership of their data, and can use ownership rights to make claims of control over that data. Ownership plays an important role in open data licensing arrangements, but runs up against the ‘relational data problem’ in many cases, where its not clear who has ownership rights.
Regulating access – for example, creating a dataset of company ownership only available to approved actors, or keeping potentially disclosive AI training datasets from being released.
Regulating use – for example, allowing that a beneficial ownership register is public, but ensuring that uses of the data to target individuals is strictly prohibited, and prohibitions are enforced.
Remediating consequences – for example, recognising that harm is caused to some groups by the publicity of certain data, but judging that the net public benefit is such that the data should remain public, but the harm should be redressed by some other aspect of policy.

By digging deeper into questions of motivations, goals and strategies my sense is we will better be able to find the points where AI, privacy and open data intersect in a joint critical engagement with todays data environment.

Where next?

I’m looking forward to exploring these themes more, both attending the next panel in this series at the Open Government Partnership meeting in Tblisi in July, and through the State of Open Data project.

Publishing with purpose? Reflections on designing with standards and locating user engagement

[Summary: Thinking aloud about open data and data standards as governance tools]

There are interesting shifts in the narratives of open data taking place right now.

Earlier this year, the Open Data Charter launched their new stategy: “Publishing with purpose”, situating it as a move on from the ‘raw data now’ days where governments have taken an open data initaitive to mean just publishing easy-to-open datasets online, and linking to them from data catalogues.

The Open Contracting Partnership, which has encouraged governments to purposely prioritise publication of procurement data for a number of years now, has increasingly been exploring questions of how to design interventions so that they can most effectively move from publication to use. The idea enters here that we should be spending more time with governments focussing on their use cases for data disclosure.

The shifts are welcome: and move closer to understanding open data as strategy. However, there are also risks at play, and we need to take a critical look at the way these approaches could or should play out.

In this post, I introduce a few initial thoughts, though recognising these are as yet underdeveloped. This post is heavily influenced by a recent conversation convened by Alan Hudson of Global Integrity at the OpenGovHub, where we looked at the interaction of ‘(governance) measurement, data, standards, use and impact ‘.

(1) Whose purpose?

The call for ‘raw data now‘ was not without purpose: but it was about the purpose of particular groups of actors: not least semantic web reseachers looking for a large corpus of data to test their methods on. This call configured open data towards the needs and preferences of a particular set of (technical) actors, based on the theory that they would then act as intermediaries, creating a range of products and platforms that would serve the purpose of other groups. That theory hasn’t delivered in practice, with lots of datasets languishing unused, and governments puzzled as to why the promised flowering of re-use has not occurred.

Purpose itself then needs unpacking. Just as early research into the open data agenda questioned how different actors interests may have been co-opted or subverted – we need to keep the question of ‘whose purpose’ central to the publish-with-purpose debate.

(2) Designing around users

Sunlight Foundation recently published a write-up of their engagement with Glendale, Arizona on open data for public procurement. They describe a process that started with a purpose (“get better bids on contract opportunities”), and then engaged with vendors to discuss and test out datasets that were useful to them. The resulting recommendations emphasise particular data elements that could be prioritised by the city administration.

Would Glendale have the same list of required fields if they had started asking citizens about better contract delivery? Or if they had worked with government officials to explore the problems they face when identifying how well a vendor will deliver? For example, the Glendale report doesn’t mention including supplier information and identifiers: central to many contract analysis or anti-corruption use cases.

If we see ‘data as infrastructure’, then we need to consider the appropriate design methods for user engagement. My general sense is that we’re currently applying user centred design methods that were developed to deliver consumer products to questions of public infrastructure: and that this has some risks. Infrastructures differ from applications in their iterability, durability, embeddedness and reach. Premature optimisation for particular data users needs may make it much harder to reach the needs of other users in future.

I also have the concern (though, I should note, not in any way based on the Glendale case) that user-centred design done badly, can be worse than user-centred design done not at all. User engagement and research is a profession with it’s own deep skill set, just as work on technical architecture is, even if it looks at first glance easier to pick up and replicate. Learning from the successes, and failures, of integrating user-centred design approaches into bureacratic contexts and government incentives structures need to be taken seriously. A lot of this is about mapping the moments and mechanisms for user engagement (and remembering that whilst it might help the design process to talk ‘user’ rather than ‘citizen’, sometimes decisions of purpose should be made at the level of the citizenry, not their user stand-ins).

(3) International standards, local adoption

(Open) data standards are a tool for data infrastructure building. They can represent a wide range of user needs to a data publisher, embedding requirement distilled from broad research, and can support interoperabiliy of data between publishers – unlocking cross-cutting use-cases and creating the economic conditions for a marketplace of solutions that build on data. (They can, of course, also do none of these things: acting as interventions to configure data to the needs of a particular small user group).

But in seeking to be generally usable, standard are generally not tailored to particular combinations of local capacity and need. (This pairing is important: if resource and capacity were no object, and each of the requirements of a standard were relevant to at least one user need, then there would be a case to just implement the complete standard. This resource unconstrained world is not one we often find ourselves in.)

How then do we secure the benefits of standards whilst adopting a sequenced publication of data given the resources available in a given context? This isn’t a solved problem: but in the mix are issues of measurement, indicators and incentive structures, as well as designing some degree of implementation levels and flexibility into standards themselves. Validation tools, guidance and templated processes all help too in helping make sure data can deliver both the direct outcomes that might motivate an implementer, whilst not cutting off indirect or alternative outcomes that have wider social value.

(I’m aware that I write this from a position of influence over a number of different data standards. So I have to also introspect on whether I’m just optimising for my own interests in placing the focus on standard design. I’m certainly concerned with the need to develop a clearer articulation of the interaction of policy and technical artefacts in this element of standard setting and implementation, in order to invite both more critique, and more creative problem solving, from a wider community. This somewhat densely written blog post clearly does not get there yet.)

Some preliminary conclusions

In thinking about open data as strategy, we can’t set rules for the relative influence that ‘global’ or ‘local’ factors should have in any decision making. However, the following propositions might act as starting point for decision making at different stages of an open data intervention:

Purpose should govern the choice of dataset to focus on
Standards should be the primary guide to the design of the datasets
User engagement should influence engagement activities ‘on top of’ published data to secure prioritised outcomes
New user needs should feed into standard extension and development
User engagement should shape the initiatives built on top of data

Some open questions

Are there existing theoretical frameworks that could help make more sense of this space?
Which metaphors and stories could make this more tangible?
Does it matter?

Shaping open government in the UK: call for steering committee nominations

[Summary: Looking for great candidates to drive progress on Open Government in the UK through the UK Civil Society OGP Steering Committee and Multi-stakeholder Forum. Nomination deadline: 16th April]

Nominations are now open for civil society members of the UK Open Government Partnership (OGP) Multi-stakeholder Forum. It’s a key time for open government in the UK, as we look to maintain momentum and push forward new reforms, within a wider national and global environment where open, participatory and effective governance is increasingly under threat.

If you are, or you know someone, passionate about open government reforms and with the capacity to drive change, please consider making a nomination. Self nominations are welcome, and membership of the Open Government Civil Society Network (the only pre-condition for nomination) is open to anyone who supports the principles of the network.

Shaping open government

The UK is currently preparing it’s fourth Open Government National Action Plan. In previous plans we’ve pursued and made progress on issues like beneficial ownership transparency (in the news this week as campaigners seek more data on offshore ownership of London property in the context of debates on illicit Russian money invested here), open contracting (equally topical as the Carrillion Crisis, and debates over passport printing unfold), and open policy making.

Yesterday, members of the current Civil Society Network Steering Committee and other guests were hosted at the Speakers House in Parliament to hear an update from Dr Ben Worthy, the independent reviewer of UK progress. The event underscored the importance of active civil society engagement to put issues on the open government agenda, and the unique opportunity offered by the OGP process to accelerate reforms and support deep dialogue between government and civil society. Ben also challenged those assembled to think about the ‘signature reforms’, engagement experiments and high profile interventions that the next National Action Plan should support, and to look to engage more with Parliament to secure parliamentary scrutiny of transparency and open government policy.

.@opengovuk #ogpIRM The speaker emphasising importance of OGP work on transparency and accountability. Closing words on the need to focus on impacts of open gov for the poorest and most vulnerable in society: open gov for all. pic.twitter.com/YvKTJiTfoc

— Tim Davies (@timdavies) March 22, 2018

One of the ways in the UK OGP Civil Society Network we’ve been preparing to meet these challenges is by updating the Terms of Reference for the Civil Society Network Steering Group so that it is ready to act as the civil society half of a standing Multi-stakeholder Forum on Open Government in the UK. This will meet regularly with government, including with Ministers with Open Government responsibility, to secure and monitor open government commitments.

To bring on board a wider set of skills and experience, we’ve also increased the number of places on the Steering Committee, creating five spaces now up for election through an open process that also seeks to secure a good gender balance, and representation of both civil society organisations and independent citizens. I’m personally keen to see us use this opportunity to bring new skills and experience onboard for the Steering Committee and Multi-stakeholder Forum, including people with experience of working on reforms within government (though current government officials working on open gov policy are not eligible to apply), specialists in civic participation, and experts on right to information issues.

Responsibilities of Steering Group members include:

Engaging with the relevant Minister and civil servants with responsibility for the OGP
Participating in the Multistakeholder Forum between government and civil society
Speaking on behalf of the Open Government Network
Supporting and overseeing the work of the Network Coordinator and ensuring the smooth running of the OGN

and to date it’s been a committment of 3 – 15 hours a month (depending on the stage of the National Action Plan process) with a regular Steering Committee call and periodic meetings (usually in London, though we’ve been trying to move around the country whenever possible) with government officials and other members of the civil society network. The nomination form is here if you are interested – and even if you’re not interested in a role on the Steering Committee right now, do join the network via it’s open mailing list for other opportunities to get involved.

As a current Steering Committee member, I’d be happy to answer any questions (@timdavies) about the process and the potential here to take forward open government reforms in the UK, and as part of the 70+ country strong global OGP network.

Where next for Open Contracting in the UK?

[Summary: reflections and ideas building on conversations at the OGP National Action Plan workshop in Bristol yesterday with ideas about a fund for scoping studies, strengthening the ICO role around contract disclosure, and better national Management Information (and a continuation of this blog’s ‘Open Contracting’ season: I promise I’ll write about some other things soon!]

Open Contracting has been a theme in the last two UK Open Government Partnership National Action Plans. In 2013 Commitment 12 said:

*The UK government endorses the principles of open contracting. We will build on the existing foundation of transparency in procurement and contracting and, in consultation with civil society organisations and other stakeholders, we will look at ways to enhance the scope, breadth and usability of published contractual data. *

In 2016, the Open Contracting moved up to slot number 5, with a commitment to:

…implement the Open Contracting Data Standard (OCDS) in the Crown Commercial Service’s operations by October 2016; [and to] begin applying this approach to major infrastructure projects, starting with High Speed Two, and rolling out OCDS across government thereafter.

As we head towards the next National Action Plan in 2018, it’s time to focus on local implementation. Whilst government policies on procurement, and even on asset disposals (e.g. selling off government land), provide clear guidance on transparency and publication of data and documents (including full contract text), local implementation is sorely lacking.

The day after Carillion’s collapse it was only possible to locate less than 30 of the 400+ government contracts with Carillion through the national Contracts Finder dataset. And none had the text of contracts attached. Local authorities continue to invoke ‘commercial confidentiality’ as a blanket reason to keep procurement or asset sale information secret, increasing corruption risks, and undermining opportunities to promote value for money, local economic development and strategic procurement across the public sector.

When policy is good, but implementation is poor, what levers are there? At the recent Bristol workshop we explored a range of opportunities. In general, approaches fall into a few different categories:

Improving enforcement. There are few consequences right now for a government agency that is not following procurement guidance. Although local government is prone to resist new or strengthened requirements that come without funding, there may be opportunities to strengthen regulators, or increase the consequences of non-compliance. However, this often needs to rely on:
Better monitoring. It’s only when we can see which authorities are failing in their procurement transparency obligations, or when we can identify leading and lagging agencies when it comes to use of pre-procurement dialogues for public and supplier engagement, that targeted enforcement of key practices becomes possible. Monitoring alone can sometimes create incentives for improved practice.
Making it easier. Confusion over the meaning of commercial confidentiality may be preventing good practice. Guidance from government, or better design of software tools, can all help streamline the process of complying. Government may have a role in setting the standards for procurement software, as well as the data standards for publishing transparency procurement information.
Show the benefits. The irony of low compliance with procurement best practices on transparency is, well, that best practice is often better. It brings savings, and better services. A programme to demonstrate this has a lot of value.

So, what could this look like in terms of concrete commitments:

Scoping study support fund. Open Contracting has the potential to be win-win-win: efficiency for government, accountability to citizens, and opportunities for local businesses. But building multi-stakeholder support for new initiatives, and setting priorities for local action needs an understanding of the the current situation. Where are the biggest blocks to opening up information on procurement? Are the challenges policy or process? Where will leadership for change come from? How can different stakeholders work together to generate, share and use data and information – and to design better procurement processes? These are all questions that can be answered through a scoping study.

Development Gateway, HIVOS and the Open Contracting Partnership have well-tested scoping study methods that have been used around the world to support national-level Open Contracting initiatives. Adapting this method for city or regional use, and providing kick-start funding to help local partnerships come together, assess their situation, and plan for change, could be a very effective way to catalyse a move from open contracting policy to local, relevant and high-impact practice.

With just £100k investment, Central government could support studies in 10 or more areas.
Improved national metrics. As part of implementation of the last NAP commitment, the Contracts Finder platform now has a (very) basic statistics page, providing an overview of which public authorities are publishing their contracts. With the underlying open data, it’s possible to compute a few more metrics, exploring the kind of contracts different agencies are publishing information on, or assessing gaps between tender and award. However, central government could go a lot further in providing Business Intelligence dashboards on top of the data in Contracts Finder, and publishing much more accessible reports on policy compliance. The OpenTender.eu project demonstrate some of what can be done with the analysis of collated procurement data, calculating a range of indicators.
Empowering the Information Commissioner’s Office. The ICO has a key role in enforcing the public right to information, yet has a substantial backlog of cases, many including FOI requests relating to contracts. Support for the ICO to develop further guidance, monitor compliance and take enforcement activities against authorities who are hiding behind bogus commercial confidentiality arguments, could shift the balance from the current ‘closed by default’ position when it comes to the contract details that matter, to proper implementation of the open-by-default policy.
Extending FOI for contractors. Although the idea that the Freedom of Information Act should apply to any provider of public services, regardless of whether they are public of private sector, is one that has been put forward, and knocked back, in previous National Action Planning processes, it remains as relevant as ever. In light of the recent Carillion Collapse, and with outsourcing arrangements looking increasingly shaky, the public right to know about delivery of public services clearly needs re-asserting more strongly.
Improved model contract clauses. Earlier rounds of the OGP NAP have secured model contract clauses for national government contracts, focussing on provision of performance information. Revisiting the question of model clauses, but with a focus on local government, and on further requirements around transparency of delivery, would offer a parallel route to increase transparency of local service delivery, creating a contractual right to information, pursued alongside efforts to extend the legal right through FOI.

A mix of the commitments above would combine different levers: enforcement, incentives and oversight – with a chance to truly build effective open contracting. Within the wider UK landscape, for the OGP process to remain credible, we will need to see some serious and ambitious commitments, and open contracting is a key area where these could be made.

(Hat-tip to @carla_denyer for the framing of how to motivate government action used in the above, and to all at the Bristol @openGovUK workshop who discussed Open Contracting.)

UK open contracting: good policy & maturing platform – it’s time to invest in implementation

[Summary: relecting on national open contracting progress in the UK]

Last week the Prime Minister issued a letter reminding central government departments of their transparency responsibilities and providing updated guidance on the information that should be disclosed and how. Amongst the guidance, is a revised note on “Publication of Central Government Tenders and Contracts” which provides a good snapshot of the current position for national government contracting (and which is also framed as useful guidance for Local Authorities considering their responsibilities under the local government transparency code).

The note covers:

The legislative requirement to publish most opportunities and awards over £10,000 via the Contracts Finder platform;
The policy committment of central government to see all tender documents, and contract texts attached to those notices on Contracts Finder;
Guidance on all the documents that go to make up the contract (and so that should be attached to Contracts Finder)
Re-iteration of the limitations to redaction of contract documents;
Recommendations on transparency clauses to include in new contracts, to have clear agreement with suppliers over information that will be public.

As contracting transparency policy goes: this is good stuff. We’re not yet at the stage in the UK of having the kind of integrated public financial management systems that give us transparency from planning to final payment, nor are their the kind of lock-in measures such as checking a contract has been published before any invoices against it are paid. But it does provide a clear foundation to build on.

The platform that backs up this policy, Contracts Finder, has also seen some good progress recently. With hundreds of tender and award notices posted every week, it continues to provide good structured data in the Open Contracting Data Standard through an open API. In the last few weeks, the data has also started to capture company registration numbers for suppliers – a really important key to linking up contracting and company ownership information, and to better understanding patterns of public sector contracting. The steady progress of Contracts Finder as a national platform (with a number of features also now added to help capture sub-contracting processes too) makes it absolutely key to monitoring and improving implementation of the policies described above.

There are still some challenges for the platform: data quality (and document availability) for many of the records in Contracts Finder relies upon the features of e-Procurement systems used by departments or local authorities to manage their contracting processes. If these systems don’t encourage inclusion of company identifiers, or contracting documents, we may struggle to reach full policy compliance and the best data quality. Ongoing improvements to the APIs for data entry, and to the tools for monitoring data quality, could certainly help here, as would increased engagement with e-procurement system vendors to get them to bake open contracting into their platforms, as Chris Smith has called for.

However, as we head in 2018, whilst we have to keep working on policy and platforms – the real focus needs to be on implementation: monitoring and motivating each department or public agency to be sure they are not only seeing transparency in procurement as a tick-box compliance excercise, but instead making sure it is embraced as a core part of accountable and open government. To date, Open Contracting in the UK has been the work of a relatively small network of dedicated officials, activists and entrepreneurs. If the vibe at OC Global last month was anything to go by, 2018 may well be the year it moves into the mainstream.

Disclosure/notes

I’m a member of the UK Open Contracting Steering Group, working under Commitment 5 of the UK OGP plan and I work for Open Data Services Co-op as one of the Open Contracting Data Standard helpdesk team.

On the journey: five reflections from #ocglobal17 (Open Contracting Global)

At it’s heart, open contracting is a simple idea: whenever public money and resources are at stake through a contracting processes, transparency and participation should be the norm.

Yet, as the Open Contracting Global Summit (#ocglobal17) in Amsterdam this week has demonstrated, it’s also an idea that brings together a very wide community. Reflecting on conversations from the week, I’ve tried here to capture five key reflections on where we are at, and where we might be heading:

(1) It’s not just procurement

Although the open contracting emphasis is often on the way governments buy goods and services, there are many other contracts where public resources are at stake: from licenses and concessions, to Public Private Partnership deals and grant agreements.

These each have different dynamics, and different approaches might be needed to open up each kind of process.

The Open Contracting Data Standard (OCDS) is primarily designed around procurement processes, although at OCGlobal we gave the first public preview of the OCDS for PPPs profile, that extends the OCDS data model to provide a structured way of recording in-depth disclosures for Public Private Partnership deals.

(2) It’s not just JSON

Thanks to Article 19, the corridoors at OCGlobal had been turned into a ‘gallery of redaction’. Copies of contracting documents obtained through FOI requests provided tantalising hints of government and private sector deals: yet with all the key facts blacked out. These stood as a reminder of how many times the public are kept in the dark over contracts

Neither documents, nor data, on their own will answer all the question citizens or companies might have about contracting. Not will they automatically spark the kinds of participation, scrutiny and engagement that are the essential complement of transparency.

Although publication of standardised data might be the most concrete manifestation of open contracting, it’s problematic to conflate transparency or open contracting with use of the OCDS JSON schema. Indeed, the 5-star model published as part of the guidance for OCDS 1.0 highlights that governments can taken their first steps towards open contracting data by publishing any contracting information on the web, stepping up to machine-readability and standarised data as capacity allows.

Any other approach risks making the perfect into the enemy of the good: preventing publication until data is perfect.

The challenge ahead is in designing and refining the incentive structures that make sure open contracting efforts do not stop at getting a few documents online, or some fields in a JSON dataset – but instead that over time they broadens and deepen both disclosure, and effective use of the information that has been made available.

(3) It’s an iterative journey

There’s a much refreshed implementation section on the Open Contracting website, curating a range of guidance and tools to help put open contracting ideas into practice. The framing of a linear ‘seven steps’ journey towards open contracting is replaced with a ‘hopscotch’ presentation of the steps involved: with interlocking cycles of development and use.

This feels much closer to the reality I’ve experienced supporting open contracting implementations, which involve a dance back and forward between a vision for disclosure, and the reality of getting data and documents published from legacy systems, transparency features added to systems that are in development, or policies and practice changed to support greater citizen engagement in the contracting process.

There was a lot of talk at OC Global about e-procurement systems as the ideal source of open contracting data: yet for many countries, effective e-procurement deployments are a long way off, and so it’s important to keep in mind different ways tools like OCDS can be used:

Based-on – OCDS can provide a guide for reviewing and reflecting on current levels of disclosure, and for turning unstructured information into data to analyse. This is the approach pioneered by projects like Budeshi, who started out transcribing documents to data to demonstrate the value that a more data-driven approach could have to procurement monitoring.
Bolt-on – OCDS can be used as the target format when exporting data from existing contracting data systems. These might be reporting systems that capture regular monitoring returns on the contracting process, or transactional systems through which procurement is run. Here, the process of mapping existing data to OCDS can often reveal data quality issues in the source systems – and with the right feedback loops, this can lead to not only data publication, but also changes to improve data in future.
Built-in – OCDS can be used to inform the design of new systems – providing common shared data models, and a community where extended modelling of data can be discussed. However, it’s important to remember that building new systems is not just about data structures – it’s also about user experience, and right now, the OCDS doesn’t address this.

To my mind, OCDS provide a structured framework that should support use in all these different ways. As we iterate on the standard itself, it’s important we don’t undermine this flexibility – but that instead we use it to establish common ground on which publishers and users can debate issues of data quality. With the standard, those debates should be actionable: but it’s not up the standard itself to settle them.

(4) Contracting is core: but it doesn’t start or end there

Contracting is just one of the government processes that affects how resources are allocated and used. Before contracting starts, budgets are often set, or wide-reaching procurement plans established. During contract implementation, payment processes kick-in. And for the private companies involved in public contracts, there are all sorts of interlocking processes of registration, financing and taxation.

From an architectural perspective it’s important for us to understand the boundaries of the open contracting process, and how it can link up with other processes. For example, whilst OCDS can capture budget information as part of a contracting process (e.g. the amount of budget allocated to that process), it starts stretching the data model to represent a budget process nested within a contracting process.

As one of the break-out groups looking at budget, contract and spend integration noted, the key to joining up data is not putting everything in the same dataset or system, but comes from establishing business processes that ensure common identifiers are used to join up the systems that manage parallel processes.

There’s a lot of work to do before we have easy interoperability between different parts of an overall [accountability architecture](ACSP LINK LINK LINK) – but the biggest issues are of data production and use, rather than of standards and schemas.

(5) It’s time to tidy our terminology

The open contracting community is broad, and, as I recently wrote over here, “the word ‘standard’ means different things to different people.”. So does contracting. And tender. And validation. And assessment. And so-on.

Following recent workshops in London and Argentina, the OCDS team have been thinking a lot about how we tighten up our use of key terms and concepts, establishing a set of draft translation principles and policies, and we’ve been reflecting more on how to also be clearer on ideas like data validity, quality and feedback.

But we also have to recognise that debates over language are laden with power dynamics: and specialist language can be used to impose or exclude. Open contracting should not be about dumbing down complex processes of contracting, but nor should it be able requiring every citizen to learn procurement-speak. Again, for OCDS and other tools designed to support open contracting, we have balancing act: creating boundary objects that help different communities meet in the middle.

The first step towards this is just working out how we’re using words at the moment: checking on current practice, before working out how we can improve.

Gratitude

Asides from sparking a wealth of learning, the other thing an event like #OCGlobal17 does is remind me just how fortunate I am to get to work with such a inspiring network of people: exploring challenging issues with a great collaborative spirit. Thanks all!

The reflections above are more or less fragmentary, and I’m looking forward to working with many of the folk in the picture below to see where the journey takes us next.

Exploring participatory public data infrastructure in Plymouth

[Summary: Slides, notes and references from a conference talk in Plymouth]

Update – April 2020: A book chapter based on this blog post is now published as “Shaping participatory public data infrastructure in the smart city: open data standards and the turn to transparency” in The Routledge Companion to Smart Cities.

Original blog post version below:

A few months back I was invited to give a presentation to a joint plenary of the ‘Whose Right to the Smart City‘ and ‘DataAche 2017‘ conferences in Plymouth. Building on some recent conversations with Jonathan Gray, I took the opportunity to try and explore some ideas around the concept of ‘participatory data infrastructure’, linking those loosely with the smart cities theme.

As I fear I might not get time to turn it into a reasonable paper anytime soon, below is a rough transcript of what I planned to say when I presented earlier today. The slides are also below.

For those at the talk, the promised references are found at the end of this post.

Thanks to Satyarupa Shekar for the original invite, Katharine Willis and the Whose Right to the Smart Cities network for stimulating discussions today, and to the many folk whose ideas I’ve tried to draw on below.

Participatory public data infrastructure: open data standards and the turn to transparency

In this talk, my goal is to explore one potential strategy for re-asserting the role of citizens within the smart-city. This strategy harnesses the political narrative of transparency and explores how it can be used to open up a two-way communication channel between citizens, states and private providers.

This not only offers the opportunity to make processes of governance more visible and open to scrutiny, but it also creates a space for debate over the collection, management and use of data within governance, giving citizens an opportunity to shape the data infrastructures that do so much to shape the operation of smart cities, and of modern data-driven policy and it’s implementation.

In particular, I will focus on data standards, or more precisely, open data standards, as a tool that can be deployed by citizens (and, we must acknowledge, by other actors, each with their own, sometimes quite contrary interests), to help shape data infrastructures.

Let me set out the structure of what follows. It will be an exploration in five parts, the first three unpacking the title, and then the fourth looking at a number of case studies, before a final section summing up.

Participatory public data infrastructure
Transparency
Standards
Examples: Money, earth & air
Recap

Part 1: Participatory public data infrastructure

Data infrastructure

infrastructure. /?nfr?str?kt??/ noun. “the basic physical and organizational structures and facilities (e.g. buildings, roads, power supplies) needed for the operation of a society or enterprise.” 1

The word infrastructure comes from the latin ‘infra-‘ for below, and structure, meaning structure. It provides the shared set of physical and organizational arrangements upon which everyday life is built.

The notion of infrastructure is central to conventional imaginations of the smart city. Fibre-optic cables, wireless access points, cameras, control systems, and sensors embedded in just about anything, constitute the digital infrastructure that feed into new, more automated, organizational processes. These in turn direct the operation of existing physical infrastructures for transportation, the distribution of water and power, and the provision of city services.

However, between the physical and the organizational lies another form of infrastructure: data and information infrastructure.

(As a sidebar: Although data and information should be treated as analytically distinct concepts, as the boundary between the two concepts is often blurred in the literature, including in discussions of ‘information infrastructures’, and as information is at times used as a super-category including data, I won’t be too strict in my use of the terms in the following).

(That said,) It is by being rendered as structured data that the information from the myriad sensors of the smart city, or the submissions by hundreds of citizens through reporting portals, are turned into management information, and fed into human or machine based decision-making, and back into the actions of actuators within the city.

Seen as a set of physical or digital artifacts, the data infrastructure involves ETL (Extract, Transform, Load) processes, APIs (Application Programming Interfaces), databases and data warehouses, stored queries and dashboards, schema, codelists and standards. Seen as part of a wider ‘data assemblage’ (Kitchin 5) this data infrastructure also involves various processes of data entry and management, of design, analysis and use, as well relationships to other external datasets, systems and standards.

However, if is often very hard to ‘see’ data infrastructure. By their very natures, infrastructures moves into the background, often only ‘visible upon breakdown’ to use Star and Ruhleder’s phrase 2. (For example, you may only really pay attention to the shape and structure of the road network when your planned route is blocked…). It takes a process of “infrastructural inversion” to bring information infrastructures into view 3, deliberately foregrounding the background. I will argue shortly that ‘transparency’ as a policy performs much the same function as ‘breakdown’ in making the contours infrastructure more visible: taking something created with one set of use-cases in mind, and placing it in front of a range of alternative use-cases, such that its affordances and limitations can be more fully scrutinized, and building on that scrutiny, it’s future development shaped. But before we come to that, we need to understand the extent of ‘public data infrastructure’ and the different ways in which we might understand a ‘participatory public data infrastructure’.

Public data infrastructure

There can be public data without a coherent public data infrastructure. In ‘The Responsive City’ Goldsmith and Crawford describe the status quo for many as “The century-old framework of local government – centralized, compartmentalized bureaucracies that jealously guard information…” 4. Datasets may exist, but are disconnected. Extracts of data may even have come to be published online in data portals in response to transparency edicts – but it exists as islands of data, published in different formats and structures, without any attention to interoperability.

Against this background, initiatives to construct public data infrastructure have sought to introduce shared technology, standards and practices that provide access to a more coherent collection of data generated by, and focusing on, the public tasks of government.

For example, in 2012, Denmark launched their ‘Basic Data’ programme, looking to consolidate the management of geographic, address, property and business data across government, and to provide common approaches to data management, update and distribution 6. In the European Union, the INSPIRE Directive and programme has been driving creation of a shared ‘Spatial Data Infrastructure’ since 2007, providing reference frameworks, interoperability rules, and data sharing processes. And more recently, the UK Government has launched a ‘Registers programme’ 8 to create centralized reference lists and identifiers of everything from countries to government departments, framed as part of building governments digital infrastructure. In cities, similar processes of infrastructure building, around shared services, systems and standards are taking place.

The creation of these data infrastructures can clearly have significant benefits for both citizens and government. For example, instead of citizens having to share the same information with multiple services, often in subtly different ways, through a functioning data infrastructure governments can pick up and share information between services, and can provide a more joined up experience of interacting with the state. By sharing common codelists, registers and datasets, agencies can end duplication of effort, and increase their intelligence, drawing more effectively on the data that the state has collected.

However, at the same time, these data infrastructures tend to have a particularly centralizing effect. Whereas a single agency maintaining their own dataset has the freedom to add in data fields, or to restructure their working processes, in order to meet a particular local need – when that data is managed as part of a centralized infrastructure, their ability to influence change in the way data is managed will be constrained both by the technical design and the institutional and funding arrangements of the data infrastructure. A more responsive government is not only about better intelligence at the center, it is also about autonomy at the edges, and this is something that data infrastructures need to be explicitly designed to enable, and something that they are generally not oriented towards.

In “Roads to Power: Britain Invents the Infrastructure State” 10, Jo Guldi uses a powerful case study of the development of the national highways networks to illustrate the way in which the design of infrastructures shapes society, and to explore the forces at play in shaping public infrastructure. When metaled roads first spread out across the country in the eighteenth century, there were debates over whether to use local materials, easy to maintain with local knowledge, or to apply a centralized ‘tarmacadam’ standard to all roads. There were questions of how the network should balance the needs of the majority, with road access for those on the fringes of the Kingdom, and how the infrastructure should be funded. This public infrastructure was highly contested, and the choices made over it’s design had profound social consequences. Jo uses this as an analogy for debates over modern Internet infrastructures, but it can be equally applied to explore questions around an equally intangible public data infrastructure.

If you build roads to connect the largest cities, but leave out a smaller town, the relative access of people in that town to services, trade and wider society is diminished. In the same way, if your data infrastructure lack the categories to describe the needs of a particular population, their needs are less likely to be met. Yet, that town connected might also not want to be connected directly to the road network, and to see it’s uniqueness and character eroded; much like some groups may also want to resist their categorization and integration in the data infrastructure in ways that restrict their ability to self-define and develop autonomous solutions, in the face of centralized data systems that are necessarily reductive.

Alongside this tension between centralization and decentralization in data infrastructures, I also want to draw attention to another important aspect of public data infrastructures. That is the issue of ownership and access. Increasingly public data infrastructures may rely upon stocks and flows of data that are not publicly owned. In the United Kingdom, for example, the Postal Address File, which is the basis of any addressing service, was one of the assets transferred to the private sector when Royal Mail was sold off. The Ordnance Survey retains ownership and management of the Unique Property Reference Number (UPRN), a central part of the data infrastructure for local public service delivery, yet access to this is heavily restricted, and complex agreements govern the ability of even the public sector to use it. Historically, authorities have faced major challenges in relation to ‘derived data’ from Ordnance Survey datasets, where the use of proprietary mapping products as a base layer when generating local records ‘infects’ those local datasets with intellectual property rights of the proprietary dataset, and restricts who they can be shared with. Whilst open data advocacy has secured substantially increased access to many publicly owned datasets in recent years, when the datasets the state is using are privately owned in the first place, and only licensed to the state, the potential scope for public re-use and scrutiny of the data, and scrutiny of the policy made on the basis of it, is substantially limited.

In the case of smart cities, I suspect this concern is likely to be particularly significant. Take transit data for example: in 2015 Boston, Massachusetts did a deal with Uber to allow access to data from the data-rich transportation firm to support urban planning and to identify approaches to regulation. Whilst the data shared reveals something of travel times, the limited granularity rendered it practically useless for planning purposes, and Boston turned to senate regulations to try and secure improved data 9. Yet, even if the city does get improved access to data about movements via Uber and Lyft in the city – the ability of citizens to get involved in the conversations about policy from that data may be substantially limited by continued access restrictions on the data.

With the Smart City model often involving the introduction of privately owned sensors networks and processes, the extent to which the ‘data infrastructure for public tasks ceases to have the properties that we will shortly see are essential to a ‘participatory public data infrastructure’ is a question worth paying attention to.

Participatory public data infrastructure

I will posit then that the grown of public data infrastructures is almost inevitable. But the shape they take is not. I want, in particular then, to examine what it would mean to have a participatory public data infrastructure.

I owe the concept of a ‘participatory public data infrastructure’ in particular to Jonathan Gray ([11], [12], [13]), who has, across a number of collaborative projects, sought to unpack questions of how data is collected and structured, as well as released as open data. In thinking about the participation of citizens in public data, we might look at three aspects:

Participation in data use
Participation in data production
Participation in data design

And, seeing these as different in kind, rather than different in degree, we might for each one deploy Arnstein’s ladder of participation [14] as an analytical tool, to understand that the extent of participation can range from tokenism through to full shared decision making. As for all participation projects, we must also ask the vitally important question of ‘who is participating?’.

At the bottom-level ‘non-participation’ runs of Arnstein’s ladder we could see a data infrastructure that captures data ‘about’ citizens, without their active consent or involvement, that excludes them from access to the data itself, and then uses the data to set rules, ‘deliver’ services, and enact policies over which citizens have no influence in either their design of delivery. The citizen is treated as an object, not an agent, within the data infrastructure. For some citizens contemporary experience, and in some smart city visions, this description might not be far from a direct fit.

By contrast, when citizens have participation in the use of a data infrastructure they are able to make use of public data to engage in both service delivery and policy influence. This has been where much of the early civic open data movement placed their focus, drawing on ideas of co-production, and government-as-a-platform, to enable partnerships or citizen-controlled initiatives, using data to develop innovative solutions to local issues. In a more political sense, participation in data use can remove information inequality between policy makers and the subjects of that policy, equalizing at least some of the power dynamic when it comes to debating policy. If the ‘facts’ of population distribution and movement, electricity use, water connections, sanitation services and funding availability are shared, such that policy maker and citizen are working from the same data, then the data infrastructure can act as an enabler of more meaningful participation.

In my experience though, the more common outcome when engaging diverse groups in the use of data, is not an immediate shared analysis – but instead of a lot of discussion of gaps and issues in the data itself. In some cases, the way data is being used might be uncontested, but the input might turn out to be misrepresenting the lived reality of citizens. This takes us to the second area of participation: the ability to not jusT take from a dataset, but also to participate in dataset production. Simply having data collected from citizens does not make a data infrastructure participatory. That sensors tracked my movement around an urban area, does not make me an active participant in collecting data. But by contrast, when citizens come together to collect new datasets, such as the water and air quality datasets generated by sensors from Public Lab 15, and are able to feed this into the shared corpus of data used by the state, there is much more genuine participation taking place. Similarly, the use of voluntary contributed data on Open Street Map, or submissions to issue-tracking platforms like FixMyStreet, constitute a degree of participation in producing a public data infrastructure when the state also participates in use of those platforms.

It is worth noting, however, that most participatory citizen data projects, whether concerned with data use of production, are both patchy in their coverage, and hard to sustain. They tend to offer an add-on to the public data infrastructure, but to leave the core substantially untouched, not least because of the significant biases that can occur due to inequalities of time, hardware and skills to be able to contribute and take part.

If then we want to explore participation that can have a sustainable impact on policy, we need to look at shaping the core public data infrastructure itself – looking at the existing data collection activities that create it, and exploring whether or not the data collected, and how it is encoded, serves the broad public interest, and allows the maximum range of democratic freedom in policy making and implementation. This is where we can look at a participatory data infrastructure as one that enables citizens (and groups working on their behalf) to engage in discussions over data design.

The idea that communities, and citizens, should be involved in the design of infrastructures is not a new one. In fact, the history of public statistics and data owes a lot to voluntary social reform focused on health and social welfare collecting social survey data in the eighteenth and nineteenth centuries to influence policy, and then advocating for government to take up ongoing data collection. The design of the census and other government surveys have long been sources of political contention. Yet, with the vast expansion of connected data infrastructures, which rapidly become embedded, brittle and hard to change, we are facing a particular moment at which increased attention is needed to the participatory shaping of public data infrastructures, and to considering the consequences of seemingly technical choices on our societies in the future.

Ribes and Baker [16], in writing about the participation of social scientists in shaping research data infrastructures draw attention to the aspect of timing: highlighting the limited window during which an infrastructure may be flexible enough to allow substantial insights from social science to be integrated into its development. My central argument is that transparency, and the move towards open data, offers a key window within which to shape data infrastructures.

Part 2: Transparency

transparency /tran?spar(?)nsi/ noun “the quality of being done in an open way without secrets” 21

Advocacy for open data has many distinct roots: not only in transparency. Indeed, I’ve argued elsewhere that it is the confluence of many different agendas around a limited consensus point in the Open Definition that allowed the breakthrough of an open data movement late in the last decade [17] [18]. However, the normative idea of transparency plays an important roles in questions of access to public data. It was a central part of the framing of Obama’s famous ‘Open Government Directive’ in 2009 20, and transparency was core to the rhetoric around the launch of data.gov.uk in the wake of a major political expenses scandal.

Transparency is tightly coupled with the concept of accountability. When we talk about government transparency, it is generally as part of government giving account for it’s actions: whether to individuals, or to the population at large via the third and fourth estates. To give effective account means it can’t just make claims, it has to substantiate them. Transparency is a tool allowing citizens to exercise control over their governments.

Sweden’s Freedom of the Press law from 1766 were the first to establish a legal right to information, but it was a slow burn until the middle of the last century, when ‘right to know’ statutes started to gather pace such that over 100 countries now have Right to Information laws in place. Increasingly, these laws recognize that transparency requires not only access to documents, but also access to datasets.

It is also worth noting that transparency has become an important regulatory tool of government: where government may demand transparency off others. As Fung et. al argue in ‘Full Disclosure’, governments have turned to targeted transparency as a way of requiring that certain information (including from the private sector) is placed in the public domain, with the goal of disciplining markets or influencing the operation of marketized public services, by improving the availability of information upon which citizens will make choices [19].

The most important thing to note here is that demands for transparency are often not just about ‘opening up’ a dataset that already exists – but ultimately are about developing an account of some aspect of public policy. To create this account might require data to be connected up from different silos, and may required the creation of new data infrastructures.

This is where standards enter the story.

Part 3: Standards

standard /?stand?d/ noun

something used as a measure, norm, or model in [comparative] evaluations.

The first thing I want to note about ‘standards’ is that the term is used in very different ways by different communities of practice. For a technical community, the idea of a data standard more-or-less relates to a technical specification or even schema, by which the exact way that certain information should be represented as data is set out in minute detail. To assess if data ‘meets’ the standard is a question of how the data is presented. For a policy audience, talk of data standards may be interpreted much more as a question of collection and disclosure norms. To assess if data meets the standard here is more a question of what data is presented. In practice, these aspects interrelate. With anything more than a few records, to assess ‘what’ has been disclosed requires processing data, and that requires it to be modeled according to some reasonable specification.

The second thing I want to note about standards is that they are highly interconnected. If we agree upon a standard for the disclosure of government budget information, for example, then in order to produce data to meet that standard, government may need to check that a whole range of internal systems are generating data in accordance with the standard. The standard for disclosure that sits on the boundary of a public data infrastructure can have a significant influence on other parts of that infrastructure, or its operation can be frustrated when other parts of the infrastructure can’t produce the data it demands.

The third thing to note is that a standard is only really a standard when it has multiple users. In fact, the greater the community of users, the stronger, in effect, the standard is.

So – with these points in mind, let’s look at how a turn to transparency and open data has created both pressure for application of data standards, and an opening for participatory shaping of data infrastructures.

One of the early rallying cries of the open data movement was ‘Raw Data Now’. Yet, it turns out raw data, as a set of database dumps of selected tables from the silo datasets of the state does not always produce effective transparency. What it does do, however, is create the start of a conversation between citizen, private sector and state over the nature of the data collected, held and shared.

Take for example this export from a council’s financial system in response to a central government policy calling for transparency on spend over £500.

Service Area	ServDiv Code	Type	Code	Date	Transaction No.	Amount	Revenue / Capital	Supplier
Balance Sheet	900551	Insurance Claims Payment (Ext)	47731	31.12.2010	1900629404	50,000.00	Revenue	Zurich Insurance Co
Balance Sheet	900551	Insurance Claims Payment (Ext)	47731	01.12.2010	1900629402	50,000.00	Revenue	Zurich Insurance Co
Balance Sheet	933032	Other income	82700	01.12.2010	1900632614	-3,072.58	Revenue	Unison Collection Account
Balance Sheet	934002	Transfer Values paid to other schemes	11650	02.12.2010	1900633491	4,053.21	Revenue	NHS Pensions Scheme Account
Balance Sheet	900601	Insurance Claims Payment (Ext)	47731	06.12.2010	1900634912	1,130.54	Revenue	Shires (Gloucester) Ltd
Balance Sheet	900652	Insurance Claims Payment (Int)	47732	06.12.2010	1900634911	1,709.09	Revenue	Bluecoat C Of E Primary School
Balance Sheet	900652	Insurance Claims Payment (Int)	47732	10.12.2010	1900637635	1,122.00	Revenue	Christ College Cheltenham

It comes from data generated for one purpose (the council’s internal financial management), now being made available for another purpose (external accountability), but that might also be useful for a range of further purposes (companies looking to understand business opportunities; other council’s looking to benchmark their spending, and so-on). Stripped of its context as part of internal financial systems, the column headings make less sense: what is BVA COP? Is the date the date of invoice? Or of payment? What does each ServDiv Code relate to? The first role of any standardization is often to document what the data means: and in doing so, to surface unstated assumptions.

But standardization also plays a role in allowing the emerging use cases for a dataset to be realized. For example, when data columns are aligned comparison across council spending is facilitated. Private firms interested in providing such comparison services may also have a strong interest in seeing each of the authorities providing data doing so to a common standard, to lower their costs of integrating data from each new source.

If standards are just developed as the means of exchanging data between government and private sector re-users of the data, the opportunities for constructing a participatory data infrastructure are slim. But when standards are explored as part of the transparency agenda, and as part of defining both the what and the how of public disclosure, such opportunities are much richer.

When budget and spend open data became available in Sao Paulo in Brazil, a research group at University of Sao Paulo, led by Gisele Craviero, explored how to make this data more accessible to citizens at a local level. They found that by geocoding expenditure, and color coding based on planned, committed and settled funds, they could turn the data from impenetrable tables into information that citizens could engage with. More importantly, they argue that in engaging with government around the value of geocoded data “moving towards open data can lead to changes in these underlying and hidden process [of government data creation], leading to shifts in the way government handles its own data” [22]

The important act here was to recognize open data-enabled transparency not just as a one-way communication from government to citizens, but as an invitation for dialog about the operation of the public data infrastructure, and an opportunity to get involved – explaining that, if government took more care to geocode transactions in its own systems, it would not have to wait for citizens to participate in data use and to expend the substantial labour on manually geocoding some small amount of spending, but instead the opportunity for better geographic analysis of spending would become available much more readily inside and outside the state.

I want to give three brief examples of where the development, or not, of standards is playing a role in creating more participatory data infrastructures, and in the process to draw out a couple of other important aspects of thinking about transparency and standardization as part of the strategic toolkit for asserting citizen rights in the context of smart cities.

Part 4: Examples

Contracts

My first example looks at contracts for two reasons. Firstly, it’s an area I’ve been working on in depth over the last few years, as part of the team creating and maintaining the Open Contracting Data Standard. But, more importantly, its an under-explored aspect of the smart city itself. For most cities, how transparent is the web of contracts that establishes the interaction between public and private players? Can you easily find the tenders and awards for each component of the new city infrastructure? Can you see the terms of the contracts and easily read-up on who owns and controls each aspect of emerging public data infrastructure? All too often the answer to these questions is no. Yet, when it comes to procurement, the idea of transparency in contracting is generally well established, and global guidance on Public Private Partnerships highlights transparency of both process and contract documents as an essential component of good governance.

The Open Contracting Data Standard emerged in 2014 as a technical specification to give form to a set of principles on contracting disclosure. It was developed through a year-long process of research, going back and forth between a focus on ‘data supply’ and understanding the data that government systems are able to produce on their contracting, and ‘data demand’, identifying a wide range of user groups for this data, and seeking to align the content and structure of the standard with their needs. This resulted in a standard that provides a framework for publication of detailed information at each stage of a contracting process, from planning, through tender, award and signed contract, right through to final spending and delivery.

Meeting this standard in full is quite demanding for authorities. Many lack existing data infrastructures that provide common identifiers across the whole contracting process, and so adopting OCDS for data disclosure may involve some elements of update to internal systems and processes. The transparency standard has an inwards effect, shaping not only the data published, but the data managed. In supporting implementation of OCDS, we’ve also found that the process of working through the structured publication of data often reveals as yet unrecognized data quality issues in internal systems, and issues of compliance with existing procurement policies.

Now, two of the critiques that might be offered of standards is that, as highly technical objects their development is only open to participation from a limited set of people, and that in setting out a uniform approach to data publication, they are a further tool of centralization. Both these are serious issues.

In the Open Contracting Data Standard we’ve sought to navigate them by working hard on having an open governance process for the standard itself, and using a range of strategies to engagement people in shaping the standard, including workshops, webinars, peer-review processes and presenting the standard in a range of more accessible formats. We’re also developing an implementation and extensions model that encourages local debate over exactly which elements of the overall framework should be prioritized for publication, whilst highlighting the fields of data that are needed in order to realize particular use-cases.

This highlights an important point: standards like OCDS are more than the technical spec. There is a whole process of support, community building, data quality assurance and feedback going on to encourage data interoperability, and to support localization of the standard to meet particular needs.

When standards create the space, then other aspects of a participatory data infrastructure are also enabled and facilitated. A reliable flow of data on pipeline contracts may allow citizens to scrutinize the potential terms of tenders for smart city infrastructure before contracts are awarded and signed, and an infrastructure with the right feedback mechanisms could ensure, for example, that performance-based payments to providers are properly influenced by independent citizen input.

The thesis here is one of breadth and depth. A participatory developed open standard allows a relatively small-investment intervention to shape a broad section of public data infrastructure, influencing the internal practice of government and establishing the conditions for more ad-hoc deep-dive interventions, that allow citizens to use that data to pursue particular projects of change.

Earth

The second example explores this in the context of land. Who owns the smart city?

The Open Data Index and Open Data Barometer studies of global open data availability have had a ‘Land Ownership’ category for a number of years, and there is a general principle that land ownership information should, to some extent, be public. However, exactly what should be published is a tricky question. An over-simplified schema might ignore the complex realities of land rights, trying to reduce a set of overlapping claims to a plot number and owner. By contrast, the narrative accounts of ownership that often exist in the documentary record may be to complex to render as data [24]. In working on a refined Open Data Index category, the Cadasta Foundation 23 noted that opening up property owners names in the context of a stable country with functioning rule of law “has very different risks and implications than in a country with less formal documentation, or where dispossession, kidnapping, and or death are real and pervasive issues” 23.

The point here is that a participatory process around the standards for transparency may not, from the citizen perspective, always drive at more disclosure, but that at times, standards may also need to protect the ‘strategic invisibility’ of marginalized groups [25]. In the United Kingdom, although individual titles can be bought for £3 from the Land Registry, no public dataset of title-holders is available. However, there are moves in place to establish a public dataset of land owned by private firms, or foreign owners, coming in part out of an anti-corruption agenda. This fits with the idea that, as Sunil Abraham puts it, “privacy should be inversely proportional to power” 26.

Central property registers are not the only source of data relevant to the smart city. Public authorities often have their own data on public assets. A public conversation on the standards needed to describe this land, and share information about it, is arguable overdue. Again looking at the UK experience, the government recently consulted on requiring authorities to record all information on their land assets through the Property Information Management system (ePIMS): centralizing information on public property assets, but doing so against a reductive schema that serves central government interests. In the consultation on this I argued that, by contrast, we need an approach based on a common standard for describing public land, but that allows local areas the freedom to augment a core schema with other information relevant to local policy debates.

Air

From the earth, let us turn very briefly to the air. Air pollution is a massive issue, causing millions on premature deaths worldwide every year. It is an issue that is particularly acute in urban areas. Yet, as the Open Data Institute note “we are still struggling to ‘see’ air pollution in our everyday lives” 27. They report the case of decision making on a new runway at Heathrow Airport, where policy makers were presented with data from just 14 NO2 sensors. By contrast, a network of citizen sensors provided much more granular information, and information from citizen’s gardens and households, offering a different account from those official sensors by roads or in fields.

Mapping the data from official government air quality sensors reveals just how limited their coverage is: and backs up the ODI’s calls for a collaborative, or participatory, data infrastructure. In a 2016 blog post, Jamie Fawcett describes how:

“Our current data infrastructure for air quality is fragmented. Projects each have their own goals and ambitions. Their sensor networks and data feeds often sit in silos, separated by technical choices, organizational ambition and disputes over data quality and sensor placement. The concerns might be valid, but they stand in the way of their common purpose, their common goals.”

He concludes “We need to commit to providing real-time open data using open standards.”

This is a call for transparency by both public and private actors: agreeing to allow re-use of their data, and rendering it comparable through common standards. The design of such standards will need to carefully balance public and private interests, and to work out how the costs of making data comparable will fall between data publishers and users.

Part 5: Recap

So, to briefly recap:

I want to draw attention to the data infrastructures of the smart city and the modern state;
I’ve suggested that open data and transparency can be powerful tools in performing the kind of infrastructural inversion that brings the context and history of datasets into view and opens them up to scrutiny;
I’ve furthermore argued that transparency policy opens up an opportunity for a two-way dialogue about public data infrastructures, and for citizen participation not only in the use and production of data, but also in setting standards for data disclosure;
I’ve then highlighted how standards for disclosure don’t just shape the data that enters the public domain, but they also have an upwards impact on the shape of the public data infrastructure itself.

Taken together, this is a call for more focus on the structure and standardization of data, and more work on exploring the current potential of standardization as a site of participation, and an enabler of citizen participation in future.

If you are looking for a more practical set of takeaways that flow from all this, let me offer a set of questions that can be asked of any smart cities project, or indeed, any data-rich process of governance:

(1) What information is pro-actively published, or can be demanded, as a result of transparency and right to information policies?
(2) What does the structure of the data reveal about the process/project it relates to?
(3) What standards might be used to publish this data?
(4) Do these standards provide the data I, or other citizens, need to be empowered in relevant to this process/project?
(5) Are these open standards? Whose needs were they designed to serve?
(6) Can I influence these standards? Can I afford not to?

References

1: https://www.google.co.uk/search?q=define%3Ainfrastructure, accessed 17th August 2017

2: Star, S., & Ruhleder, K. (1996). Steps Toward an Ecology of Infrastructure: Design and Access for Large Information Spaces. Information Systems Research, 7(1), 111–134.

3: Bowker, G. C., & Star, S. L. (2000). Sorting Things Out: Classification and Its Consequences. The MIT Press.

4: Goldsmith, S., & Crawford, S. (2014). The responsive city. Jossey-Bass.

5: Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. SAGE Publications.

6: The Danish Government. (2012). Good Basic Data for Everyone – a Driver for Growth and Efficiency, (October 2012)

7: Bartha, G., & Kocsis, S. (2011). Standardization of Geographic Data: The European INSPIRE Directive. European Journal of Geography, 22, 79–89.

10: Guldi, J. (2012). Roads to power: Britain invents the infrastructure state.

[11]: Gray, J., & Davies, T. (2015). Fighting Phantom Firms in the UK : From Opening Up Datasets to Reshaping Data Infrastructures?

[12]: Gray, J., & Tommaso Venturini. (2015). Rethinking the Politics of Public Information: From Opening Up Datasets to Recomposing Data Infrastructures?

[13]: Gray, J. (2015). DEMOCRATISING THE DATA REVOLUTION: A Discussion Paper

[14]: Arnstein, S. R. (1969). A ladder of citizen participation. Journalof the American Institute of Planners, 34(5), 216–224.

[16]: Ribes, D., & Baker, K. (2007). Modes of social science engagement in community infrastructure design. Proceedings of the 3rd Communities and Technologies Conference, C and T 2007, 107–130.

[17]: Davies, T. (2010, September 29). Open data, democracy and public sector reform: A look at open government data use from data.gov.uk.

[18]: Davies, T. (2014). Open Data Policies and Practice: An International Comparison.

[19]: Fung, A., Graham, M., & Weil, D. (2007). Full Disclosure: The Perils and Promise of Transparency (1st ed.). Cambridge University Press.

[22]: Craveiro, G. S., Machado, J. A. S., Martano, A. M. R., & Souza, T. J. (2014). Exploring the Impact of Web Publishing Budgetary Information at the Sub-National Level in Brazil.

[24]: Hetherington, K. (2011). Guerrilla auditors: the politics of transparency in neoliberal Paraguay. London: Duke University Press.

[25]: Scott, J. C. (1987). Weapons of the Weak: Everyday Forms of Peasant Resistance.

Open data for tax justice: the real design challenge is social

[Summary: Thinking aloud about a pragmatic / humanist approach to data infrastructure building]

Stephen Abbott Pugh of Open Knowledge International has just blogged about the Open Data for Tax Justice ‘design sprint’ that took place in London on Monday and Tuesday. I took part in the first day and a half of the workshop, and found myself fairly at-odds with the approach being taken that focussed narrowly on the data-pipelines based creation of a centralised dataset, and that appeared to create barriers rather than bridges between data and domain experts. Rather than the rethink the approach, as I would argue is needed, the Open Knowledge write up appears to show the Open Data for Tax Justice project heading further down this flawed path.

In this post, I’m offering an (I hope) constructive critique of the approach, trying to draw out some more general principles that might inform projects to create more participatory data infrastructures.

The context

As the OKI post relates:

“Country-by-country reporting (CBCR) is a transparency mechanism which requires multinational corporations to publish information about their economic activities in all of the countries where they operate. This includes information on the taxes they pay, the number of people they employ and the profits they report.”

Country by Country reporting has been a major ask of tax justice campaigners since the early 2000s, in order to address tax avoidance by multi-national companies who shift their profits around the world through complex corporate structures and internal transfers. CBCR got a major boost in 2013 with the launch of reporting requirements for EU Banks to publicly disclose Country by Country reports under the CRD IV regulations. In the extractives sector, campaigners have also secured regulations requiring disclosure of tax and licensing payments to government on a project-by-project basis.

Although in the case of UK extractives firms, reporting is taking place to companies house as structured data, with an API available to access reports, for EU Banks, reporting is predominantly in the form of tables at the back of PDF format company reports.

If campaigners are successful, public reporting will be extended to all EU multinationals, holding out the prospect of up to 6000 more annual reports that can provide a breakdown of turnover, profit, tax and employees country-by-country. If the templates for disclosure are based on existing OECD models for private exchange between tax authorities, the data may also include information on the different legal entities that make a corporate group, important for public understanding the structure of the corporate world.

Earlier this year, a report from Alex Cobham, Jonathan Gray and Richard Murphey set out a number of use-cases for such data, making the case that “a global public database on the tax contributions and economic activities of multinational companies” would be an asset for a wide range of users, from journalists, civil society and investors.

Sprinting with a data-pipelines hammer

This week’s design sprint focussed particularly on ‘data extraction’, developing a set of data pipeline scripts and processes that involve downloading a report PDF, marking up the tables where Country by Country data is stored, describing what each column contains using YAML, and then committing this to GitHub where the process can then be replicably run using datapipeline commands. Then, with the data extracted, it can be loaded into an SQL database, and explored by writing queries or building simple charts. It’s a technically advanced approach, and great for ensuring replicability of data extraction.

But, its also an approach that ultimately entirely misses the point, ignoring the social process of data production, creating technical barriers instead of empowering contributors and users, and offering nothing for campaigners who want to ensure that better data is produced ‘at source’ by companies.

Whilst the OKI blog post reports that “The Open Data for Tax Justice network team are now exploring opportunities for collaborations to collect and process all available CRD IV data via the pipeline and tools developed during our sprint.” I want to argue for a refocussed approach, based around a much closer look at the social dynamics of data creation and use.

An alternative approach: crafting collaborations

I’ve tried below to unpack a number of principles that might guide that alternative approach:

Principle 1: Letting people use their own tools

Any approach that involves downloading, installing, signing-up to, configuring or learning new software in order to create or use data is likely to exclude a large community of potential users. If the data you are dealing with is tabular: focus on spreadsheets.

More technical users can transform data into database formats when the questions they want to answer require the additional power that brings, but it is better if the starting workflow is configured to be accessible to the largest number of likely users.

Back in October I put together a rough prototype of a Google spreadsheets based transcription tool for Country by Country reports, that needed just copy-and-paste of data, and a few selections from validated drop-down lists to go from PDFs to normalised data – allowing a large user community to engage directly with the data, with almost zero learning curve.

The only tool this approach needs to introduce is something like tabula or PDFTables to convert from PDF to Excel or CSV: but in this workflow the data comes right back to the user to be able to work with it after it has been converted, rather than being taken away from them into a longer processing pipeline. Plus, it brings the benefit of raising awareness of data extraction from PDF that the user can adopt for other projects in future, and allowing the user to work-around failed conversions using a manual transcription approach if they need to.

(Sidenote: from discussions, I understand that one of the reasons the OKI team made their technical choice was from envisaging the primary users as ‘non-experts’ who would engage in crowdsourcing transcriptions of PDF reports. I think this is both highly optimistic, and relies on a flawed analysis of the relatively small scale of the crowdsourcing task in terms of a few 1000 reports a year, and the potential benefits of involving a more engaged group of contributors in creating a civil society database)

Principle 2: Aim for instant empowerment

One of the striking things about Country by Country reporting data is how simple it ultimately is. The CRD IV disclosures contain just a handful of measures (turnover, pre-tax profits, tax paid, number of employees), a few dimensions (company name, country, year), and a range of annotations in footnotes or explanations. The analysis that can be done with this is data is similarly simple – yet also very powerful. Being able to go from a PDF table of data, to a quick view of the ratios between turnover and tax, or profit and employees for a country can quickly highlight areas to investigate for profit-shifting and tax-avoidance behaviour.

Calculating these ratios is possible almost as soon as you have data in a spreadsheet form. In fact, a well set up template could calculate them directly, or the user with basic ability to write formula could fill in the columns they need.

Many of the use-cases for Country by Country reports are based not on aggregation across hundreds of firms, but on simply understanding the behaviour of one or two firms. Investigators and researchers often have firms they are particularly interested in, and where the combination of simple data, and their contextual knowledge, can go a long way.

Principle 3: Don’t drop context

On the topic of context: all those footnotes and explanations in company reports are an important part of the data. They might not be computable, or easy to query against, but in the data explorations that took place on Monday and Tuesday I was struck by how much the tax justice experts were relying not only on the numerical figures to find stories, but also on the explanations and other annotations from reports.

The data pipelines approach dropped these annotations (and indeed dropped anything that didn’t fit into it’s schema). An alternative approach would work from the principle that, as far as possible, nothing of the source should be thrown away – and structure should be layered on top of the messy reality of accounting judgements and decisions.

Principle 4: Data making is meaning-making

A lot of the analysis of Country by Country reporting data is about look for outliers. But data outliers and data errors can look pretty similar. Instead of trying to separate the process of data preparation and analysis, these two need to be brought closer together.

Creating a shared database of tax disclosures will involve not only processes of data extraction, but also processes of validation and quality control. It will require incentives for contributors, and will require attention to building a community of users.

Some of the current structured data available from Country by Country reports has been transcribed by University students as part of their classes – where data was created as a starting point for a close feedback loop of data analysis. The idea of ‘frictionless data’ makes sense when it comes to getting a list of currency codes, but when it comes to understanding accounts, some ‘friction’ of social process can go a long way to getting reliable data, and building a community of practice who understand the data in more depth.

Principle 5: Standards support distributed collaboration

One of the difficulties in using the data mentioned above, prepared by a group of students, was that it had been transcribed and structured to solve the particular analytical problem of the class, and not against any shared standard for identifying countries, companies or the measures being transcribed.

The absence of agreement on key issues such as codelists for tax jurisdictions, company identifiers, codes and definitions of measures, and how to handle annotations and missing data means that the data that is generated by different researchers, or even different regulatory regimes, is not comparable, and can’t be easily combined.

The data pipelines approach is based on rendering data comparable through a centralised infrastructure. In my experience, such approaches are brittle, particularly in the context of voluntary collaboration, and they tend to create bottlenecks for data sharing and innovation. By contrast, an approach based on building light-weight standards can support a much more distributed collaboration approach – in which different groups can focus first on the data that is of most interest to them (for example, national journalists focussing on the tax record of the top-10 companies in their jurisdiction), easily contributing data to a common pool later when their incentives are aligned.

Campaigners also need to be armed with use-case backed proposals for how disclosures should be structured in order to push for the best quality disclosure regimes

What’s the difference?

Depending on your viewpoint, the approach I’ve started to set out above might look more technically ‘messy’ – but I would argue it is more in-tune with the social realities of building a collaborative dataset of company tax disclosures.

Fundamentally (with the exception perhaps of standard maintenance, although that should be managed as a multi-stakeholder project long-term) – it is much more decentralised. This is in line with the approach in the Open Contracting Data Standard, where the Open Contracting Partnership have stuck well to their field-building aspirations, and where many of the most interesting data projects emerge organically at the edge of the network, only later feeding into cross-collaboration.

Even then, this sketch of an alternative technical approach above is only part of the story in building a better data-foundation for action to address corporate tax avoidance. There will still be a lot of labour to create incentives, encourage co-operation, manage data quality, and build capacity to work with data. But better we engage with that labour, than spending our efforts chasing after frictionless dreams of easily created perfect datasets.