Developing data standards for Open Contracting

logo-open-contractingContracts have a key role to play in effective transparency and accountability: from the contracts government sign with extractives industries for mineral rights, to the contracts for delivery of aid, contracts for provision of key public services, and contracts for supplies. The Open Contracting initiative aims to improve the disclosure and monitoring of public contracts through the creation of global principles, standards for contract disclosure, and building civil society and government capacity. One strand of work that the Open Contracting team have been exploring to support this work is the creation of a set of open data standards for capturing contract information. This blog post reports on some initial ground work designed to inform this strand of work.

Although I was involved in some of the set-up of this short project, and presented the outcomes at last weeks workshop, the bulk of the work was undertaken by Aptivate‘s Sarah Bird.

Update: see also the report of the process here.

Update 2 (12th Sept 2013): Owen Scott has build on the pilot with data from Nepal.

The process

Developing standards is a complex process. Each choice made has implications: for how acceptable the standard will be to different parties; for how easy certain uses of the data will be; and for how extensible the standard will be, or which other standards it will easily align with. However, standards cannot easily be built up choice-by-choice from a blank slate adopting the ideal choice: they are generally created against a background of pre-existing datasets and standards. The Open Contracting data standards team had already gathered together a range of contract information datasets currently published by governments across the world, and so, with just a few weeks between starting this project and the data standards workshop on 28th March, we planned an 5-day development sprint, aiming to generate a very draft first iteration of a standard. Applying an agile methodology, where short iterations are each designed to yield a viable product by the end, but on the anticipating that further early iterations may revise and radically alter this, meant we had to set a reasonable scope for this first sprint.

The focus then was on the supply side, taking a set of existing contract datasets from different parties, and identifying their commonalities and differences. The contract datasets selected were from the UK, USA, Colombia, Philippines and the World Bank. From looking at the fields these existing datasets had in common, an outline structure was developed, working on a principle of taking good ideas from across the existing data, rather than playing to a lowest common denominator. Then, using the International Aid Transparency Initiative activity standard as a basis, Sarah drafted a basic data structure, which can act as a version 0.01 standard for discussion. To test this, the next step was to convert samples from some of the existing datasets into this new structure, and then to analyse how much of the available data was covered by the structure, and how comprehensive the available data was when placed against the draft structure. (The technical approach taken, which can be found in the sprint’s GitHub repository, was to convert the different incoming data to JSON, and post it into a MongoDB instance for analysis).

We discuss the limitations of this process in a later section.

Initial results

The initial pass of data suggested a structure based on:

  • Organisation data – descriptions of organisations, held separately from individual contract information, and linked by a globally unique ID (based on the IATI Organisational ID standard)
  • Contract meta data – general information about the contract in question, such as title, classification, default currency and primary location of supply. Including an area for ‘line items’ of elements the contract covers.
  • Contract stages – a series of separate blocks of data for different stages of the contract, all contained within the overarching contract element.
    • Bid – key dates and classifications about the procurement stage of a contract process.
    • Award – details of the parties awarded the contract and the details of the award.
    • Performance – details of transactions (payments to suppliers) and work activities carried out during the performance of the contract.
    • Termination – details of the ending of the contract.
  • Documents – fields for linking to related documents.

A draft annotated schema for capturing this data can be found in XML and JSON format here, and a high-level overview is also represented in the diagram below. In the diagrams that follow, each block represents one data point in the draft standard.

1-Phases

We then performed an initial analysis to explore how much of the data currently available from the sources explored would fit into the standard, and how comprehensively the standard could be filled from existing data. As the diagram below indicates, no single source covered all the available data fields, and some held no information on particular stages of the contracting process at all. This may be down to different objectives of the available data sources, or deeper differences in how organisations handle information on contracts and contracting workflows.

2-Coverage

Combining the visualisations above into a single views given a sense of which data points in the draft standard have greatest use, illustrated in the schematic heat-map below.

3-Heatma

At this point the analysis is very rough-and-ready, hence the presentation of a rough impression, rather than detailed field-by-field analysis. The last thing to check was how much data was ‘left over’ and not captured in the standard. This was predominantly the case for the UK and USA datasets, where many highly specialised fields and flags were present the dataset, indicating information that might be relevant to capture in local contract datasets, but which might be harder to find standard representations for across contracts.

4-Extra

The next step was to check whether data that could go into the same fields could be easily harmonised. As the existence of organisation details, or dates, and classifications of contracts across different datasets does not necessarily mean these are interoperable. Fields like dates and financial amounts appeared to be relatively easy to harmonise, but some elements present greater challenges, such as organisational identifiers, contact people, and various codelists in use. However some code-lists may possible to harmonise. For example, the ‘Category’ classifications from across datasets were translated, grouped and aggregated, up to 92% of the original data in a sample was retained.

5-Sum and Group

Implications, gaps, next steps

This first iteration provides a basis for future discussions. There are, however, some important gaps. Most significant of all is that this initial development has been supply-side driven, based around the data that organisations are already publishing, rather than developed on the basis of the data that civil society organisations, or scrutiny bodies, are demanding in order to make sense of complex contract situations. It also omits certain kinds of contracts, such as complex extractives contracts (on which, see the fantastic work Revenue Watch have been doing with getting structured data from PDF contracts with Document Cloud), and Public Private Partnership (PPP) contracts. And it has not delved deeply into the data structures needed for properly capturing information that can aid in monitoring contract performance. These gaps will all need to be addressed in future work.

At the moment, this stands as discrete project, and no set next-steps are agreed as far as I’m aware. However, some of the ideas explored in the meeting on the 28th included:

  • A next iteration – focussed on the demand side – working with potential users of contracts data to work out how data needs to be shaped, and what needs to be in a standard to meet different data re-use needs. This could build towards version 0.02.
  • Testing against a wider range of datasets – either following, or in parallel with, a demand-driven iteration, to discover how the work done so far evolves when confronted with a larger set of existing contract datasets to synthesise.
  • Connecting with other standards. This first sprint took the IATI Standard as a reference point. There may be other standards to refer to in development. Discussions on the 28th with those involved in other standards highlighted an interest in more collaborative working to identify shared building blocks or common elements that might be re-used across standards, and to explore the practical and governance implications of this.
  • Working on complementary building blocks of a data standard – such as common approaches to identifying organisations and parties to a contract; or developing tools and platforms that will aggregate data and make data linkable. The experience of IATI, Open Spending and many other projects appears to be that validators, aggregation platforms and data-wrangling tools are important complements to standards for supporting effective re-use of open data.

Keep an eye on the Open Contracting website for more updates.