Open Data – Page 5

Open data in extractives: meeting the challenges

followthedatalinesmaller There’s lots of interest building right now around how open data might be a powerful tool for transparency and accountability in the extractive industries sector. Decisions over where extraction should take place have a massive impact on communities and the environment, yet often decision making is opaque, with wealthy private interests driving exploitation of resources in ways that run counter the public interest. Whilst revenues from oil, gas and mineral resources have the potential to be a powerful tool for development, with a proportion channeled into public funds, massive quantities of revenue frequently ‘go missing’, lost in corruption, and fuelling elements of a resource curse.

For the last ten years the Extractive Industries Transparency Initiative has been working to get companies to commit to ‘publish what they pay‘ to government, and for government to disclose receipts of finance, working to identifying missing money through a document-based audit process. Campaigning coalitions, watchdogs and global initiatives have focussed on increasing the transparency of the sector. Now, with a recognition that we need to link together information on different resources flows for development at all levels, potentially through the use of structured open data, and with an anticipated “data tsunami” of new information on extractives financials anticipated from the Dodd-Frank act in the US, and similar regulation in Europe, groups working on extractives transparency have been looking at what open data might mean for future work in this area.

Right now, DFID are taking that exploration forward through a series of hack days with Rewired State under the ‘follow the data’ banner, with the first in London last weekend, and one coming up next week in Lagos, Nigeria. The idea of the events is to develop rapid prototypes of tools that might support extractives transparency, putting developers and datasets together over 24 hours to see what emerges. I was one of the judging panel at this weekends event, where the three developer teams that formed looked respectively at: making datasets on energy production and prices more accessible for re-use through an API; visualising the relationship between extractives revenues and various development indicators; and designing an interface for ‘nuggets’ of insight discovered through hack-days to be published and shared with useful (but minimal) meta-data.

In their way, these three projects highlight a range of the challenges ahead for the extractives sector in building capacity to track resource flows through open data:

Making data accessible – The APIfy project sought to take a number of available datasets and aggregate them together in a database, before exposing a number of API endpoints that made machine-readable standardised data available on countries, companies and commodities. By translating the data access challenge from one or routing around in disparate datasets, to one of calling a standard API for key kinds of ‘objects’, the project demonstrated the need developers often have for clear platforms to build upon. However, as I’ve discovered in developing tools for the International Aid Transparency Initiative, building platforms to aggregate together data often turns out to be a non-trivial project: technically (it doesn’t take long to get to millions of data items when you are dealing with financial transactions), economically (as databases serving millions of records to even a small number of users need to be maintained and funded), socially (developers want to be able to trust the APIs they build against to be stable, and outreach and documentation are needed to support developers to engage with an API), and in terms of information architecture (as design choices over a dataset or API can have a powerful affect on downstream re-users).
Connecting datasets – none of the applications from the London hack-day were actually able to follow resource flows through the available data. Although visions of a coherent datasphere, in which the challenge is just making the connection between a transaction in one dataset, and a transaction in another, to see where money is flowing, are appealing – traceability in practice turns out to be a lot harder. To use the IATI example again, across the 100,000+ aid activities published so far less than 1% include traceability efforts to show how one transaction relates to another, and even here the relationships exist in the data because of conscious efforts by publishers to link transaction and activity identifiers. In following the money there will be many cases where people have an incentive not to make these linkages explicit. One of the issues raised by developers over the hack-day was the scattered nature of data, and the gaps across it. Yet – when it comes to financial transaction tracking, we’re likely to often be dealing with partial data, full of gaps, and it won’t be easy to tell at first glance when a mis-match between incoming and outgoing finances is a case of missing data or corruption. Right now, a lot of developers attack open data problems with tools optimised for complete and accurate data, yet we need to be developing tools, methods and visualisation approaches that deal with partial and uncertain data. This is developed in the next point.
Correlation, causation and investigation – The Compare the Map project developed on the hack day uses “scraped data from GapMinder and EITI to create graphical tools” that allow a user to eye-ball possible correlations between extractives data and development statistics. But of course, correlation is not causation – and the kinds of analysis that dig deeper into possible relationships are difficult to work through on a hack day. Indeed, many of the relationships mash-ups of this form can show have been written about in papers that control for many more variables, dealing carefully with statistically challenging issues of missing data and imperfectly matched datasets. Rather than simple comparison visualisations that show two datasets side by side, it may be more interesting to look for all the possible statistically significant correlations in a datasets with common reference points, and then to look at how human users could be supported in exploring, and giving feedback on, which of those might be meaningful, and which may or may not already be researched. Where research does show a correlation to exist, then using open data to present a visual narrative to users about this can have a place, though here the theory of change is very different – not about identifying connections – but about communicating them in interactive and engaging ways to those who may be able to act upon them.
Sharing and collaborating – The third project at the London hack-day was ‘Fact Cache‘ – a simple concept for sharing nuggets of information discovered in hack-day explorations. Often as developers work through datasets they may come across discoveries of interest, yet these are often left aside in the rush to create a prototype app or platform. Fact Cache focussed on making these shareable. However, when it was presented discussions also explored how it could make these nuggets of information into social objects, open to discussion and sharing. This idea of making open data findings more usable as social objects was also an aspect of the UN Global Pulse hunchworks project. That project is currently on hold (it would be interesting to know why…), but the idea of supporting collaboration around open data through online tools, rather than seeing apps that present data, or initial analysis as the end point, is certainly one to explore more in building capacity for open data to be used in holding actors to account.
Developing theories of change – as the judges met to talk about the projects, one of the key themes we looked at was whether each project had a clear theory of change. In some sense taken together they represent the complex chain of steps involved in an open data theory of change, from making data more accessible to developers, creating tools and platforms that let end users explore data, andthen allowing findings from data to be communicated and to shape discourses and action. Few datasets or tools are likely to be change-making on their own – but rather can play a key role in shifting the balance of power in existing networks or organisations, activists, companies and governments. Understanding the different theories of change for open data is one of the key themes in the ongoing Open Data in Developing Countries research, where we take existing governance arrangements as a starting point in understanding how open data will bring about impacts.

In a complex world, access to data, and the capacity to use it effectively, are likely to be essential parts of building more accountable governance across a wide range of areas, including in the extractives industry. Although there are many challenges ahead if we are to secure the maximum benefits from open data for transparent and accountable governance, it’s exciting and encouraging to see so many passionate people putting their minds early to tackling them, and building a community ready to innovate and bring about change.

Note: The usage of ‘follow the data’ in this DFID project is distinct from the usage in the work I’m currently doing to explore ‘follow the data’ research methods. In the former, the focus is really on following financial and resource flows through connecting up datasets; in the latter the focus is on tracing the way in which data artefacts have been generated, deployed, transferred and used in order to understand patterns of open data use and impact.

Intelligent Impact: Evaluating an open data capacity building with voluntary sector organisations

[Summary: sharing the evaluation report (9 pages, PDF) of an open data skills workshop for voluntary sector organisations]

Late last year, through the CSO network on the Open Government Partnership, I got talking with Deirdre McGrath of the Your Voice, Your City project about ways of building voluntary sector capacity to engage with open data. We talked about the possibility of a hack-day, but realised the focus at this stage needed to be on building skills, rather than building tools. It also needed to be on discovering what was possible with open data in the voluntary sector, rather than teaching people a limited set of skills. And as the Your Voice, Your City project was hosted within the London Voluntary Services Council (LVSC), an infrastructure organisation with a policy and research team, we had the possibility of thinking about the different roles needed to make the most of open data, and how a capacity building pilot could work both with frontline Voluntary and Community Sector (VCS) organisations, and an infrastructure organisation. A chance meeting with Nick Booth of podnosh gave form to a theme in our conversations about the need to focus on both ‘stats’ and ‘stories’ ensuring that capacity building worked with both quantitative and qualitative data and information. The result: plans for a short project, centred on a one-day workshop on ‘Intelligent Impact’, exploring the use of social media and open data for VCS organisations.

The day involved staff from VCS organisations coming along with questions or issues they wanted to explore, and then splitting into groups with a team of open data and social media mentors (Nick Booth, Caroline Beavon, Steven Flower, Paul Bradshaw and Stuart Harrison) to look at how existing online resources, or self-created data and media, could help respond to those questions and issues. Alex Farrow captured the story of the day for us using Storify and I’ve just completed a short evaluation report telling the story in more depth, capturing key learning from the event, and setting out possible next steps (PDF).

Following on from the event, the LVSC team have been exploring how a combination of free online tools for curating open data, collating questions, and sharing findings can be assembled into a low-cost and effective ‘intelligence hub‘, where data, analysis and presentation layers are all made accessible to VCS organisations in London.

Developing data standards for Open Contracting

Contracts have a key role to play in effective transparency and accountability: from the contracts government sign with extractives industries for mineral rights, to the contracts for delivery of aid, contracts for provision of key public services, and contracts for supplies. The Open Contracting initiative aims to improve the disclosure and monitoring of public contracts through the creation of global principles, standards for contract disclosure, and building civil society and government capacity. One strand of work that the Open Contracting team have been exploring to support this work is the creation of a set of open data standards for capturing contract information. This blog post reports on some initial ground work designed to inform this strand of work.

Although I was involved in some of the set-up of this short project, and presented the outcomes at last weeks workshop, the bulk of the work was undertaken by Aptivate‘s Sarah Bird.

Update: see also the report of the process here.

Update 2 (12th Sept 2013): Owen Scott has build on the pilot with data from Nepal.

The process

Developing standards is a complex process. Each choice made has implications: for how acceptable the standard will be to different parties; for how easy certain uses of the data will be; and for how extensible the standard will be, or which other standards it will easily align with. However, standards cannot easily be built up choice-by-choice from a blank slate adopting the ideal choice: they are generally created against a background of pre-existing datasets and standards. The Open Contracting data standards team had already gathered together a range of contract information datasets currently published by governments across the world, and so, with just a few weeks between starting this project and the data standards workshop on 28th March, we planned an 5-day development sprint, aiming to generate a very draft first iteration of a standard. Applying an agile methodology, where short iterations are each designed to yield a viable product by the end, but on the anticipating that further early iterations may revise and radically alter this, meant we had to set a reasonable scope for this first sprint.

The focus then was on the supply side, taking a set of existing contract datasets from different parties, and identifying their commonalities and differences. The contract datasets selected were from the UK, USA, Colombia, Philippines and the World Bank. From looking at the fields these existing datasets had in common, an outline structure was developed, working on a principle of taking good ideas from across the existing data, rather than playing to a lowest common denominator. Then, using the International Aid Transparency Initiative activity standard as a basis, Sarah drafted a basic data structure, which can act as a version 0.01 standard for discussion. To test this, the next step was to convert samples from some of the existing datasets into this new structure, and then to analyse how much of the available data was covered by the structure, and how comprehensive the available data was when placed against the draft structure. (The technical approach taken, which can be found in the sprint’s GitHub repository, was to convert the different incoming data to JSON, and post it into a MongoDB instance for analysis).

We discuss the limitations of this process in a later section.

Initial results

The initial pass of data suggested a structure based on:

Organisation data – descriptions of organisations, held separately from individual contract information, and linked by a globally unique ID (based on the IATI Organisational ID standard)
Contract meta data – general information about the contract in question, such as title, classification, default currency and primary location of supply. Including an area for ‘line items’ of elements the contract covers.
Contract stages – a series of separate blocks of data for different stages of the contract, all contained within the overarching contract element.
- Bid – key dates and classifications about the procurement stage of a contract process.
- Award – details of the parties awarded the contract and the details of the award.
- Performance – details of transactions (payments to suppliers) and work activities carried out during the performance of the contract.
- Termination – details of the ending of the contract.
Documents – fields for linking to related documents.

A draft annotated schema for capturing this data can be found in XML and JSON format here, and a high-level overview is also represented in the diagram below. In the diagrams that follow, each block represents one data point in the draft standard.

We then performed an initial analysis to explore how much of the data currently available from the sources explored would fit into the standard, and how comprehensively the standard could be filled from existing data. As the diagram below indicates, no single source covered all the available data fields, and some held no information on particular stages of the contracting process at all. This may be down to different objectives of the available data sources, or deeper differences in how organisations handle information on contracts and contracting workflows.

Combining the visualisations above into a single views given a sense of which data points in the draft standard have greatest use, illustrated in the schematic heat-map below.

At this point the analysis is very rough-and-ready, hence the presentation of a rough impression, rather than detailed field-by-field analysis. The last thing to check was how much data was ‘left over’ and not captured in the standard. This was predominantly the case for the UK and USA datasets, where many highly specialised fields and flags were present the dataset, indicating information that might be relevant to capture in local contract datasets, but which might be harder to find standard representations for across contracts.

The next step was to check whether data that could go into the same fields could be easily harmonised. As the existence of organisation details, or dates, and classifications of contracts across different datasets does not necessarily mean these are interoperable. Fields like dates and financial amounts appeared to be relatively easy to harmonise, but some elements present greater challenges, such as organisational identifiers, contact people, and various codelists in use. However some code-lists may possible to harmonise. For example, the ‘Category’ classifications from across datasets were translated, grouped and aggregated, up to 92% of the original data in a sample was retained.

Implications, gaps, next steps

This first iteration provides a basis for future discussions. There are, however, some important gaps. Most significant of all is that this initial development has been supply-side driven, based around the data that organisations are already publishing, rather than developed on the basis of the data that civil society organisations, or scrutiny bodies, are demanding in order to make sense of complex contract situations. It also omits certain kinds of contracts, such as complex extractives contracts (on which, see the fantastic work Revenue Watch have been doing with getting structured data from PDF contracts with Document Cloud), and Public Private Partnership (PPP) contracts. And it has not delved deeply into the data structures needed for properly capturing information that can aid in monitoring contract performance. These gaps will all need to be addressed in future work.

At the moment, this stands as discrete project, and no set next-steps are agreed as far as I’m aware. However, some of the ideas explored in the meeting on the 28th included:

A next iteration – focussed on the demand side – working with potential users of contracts data to work out how data needs to be shaped, and what needs to be in a standard to meet different data re-use needs. This could build towards version 0.02.
Testing against a wider range of datasets – either following, or in parallel with, a demand-driven iteration, to discover how the work done so far evolves when confronted with a larger set of existing contract datasets to synthesise.
Connecting with other standards. This first sprint took the IATI Standard as a reference point. There may be other standards to refer to in development. Discussions on the 28th with those involved in other standards highlighted an interest in more collaborative working to identify shared building blocks or common elements that might be re-used across standards, and to explore the practical and governance implications of this.
Working on complementary building blocks of a data standard – such as common approaches to identifying organisations and parties to a contract; or developing tools and platforms that will aggregate data and make data linkable. The experience of IATI, Open Spending and many other projects appears to be that validators, aggregation platforms and data-wrangling tools are important complements to standards for supporting effective re-use of open data.

Keep an eye on the Open Contracting website for more updates.

Open Data for Poverty Alleviation: Striking Poverty Discussion

Screen Shot 2013-02-03 at 08.43.29

[Summary: join an open discussion on the potential impacts of open data on poverty reduction]

Over the next two weeks, along with Tariq Kochar, Nitya V. Raman and Nathan Eagle, I’m taking part in an online panel hosted by the World Bank’s Striking Poverty platform to discuss the potential impacts of open data on poverty alleviation.

So far we’ve been asked to provide some starting statements on how we see open data and poverty might relate, and now there’s an open discussion where visitors to the site are invited to share their questions and reflections on the topic.

Here’s what I have down as my opening remarks:

Development is complex. No individual or group can process all the information needed to make sense of aid flows, trade patterns, government budgets, community resources and environmental factors (amongst other things) that affect development in a locality. That’s where data comes in: open datasets can be connected, combined and analysed to support debate, decision making and governance.

Projects like the International Aid Transparency Initiative (IATI) have sought to create the technical standards and political commitments for effective data sharing. IATI is putting together one corner of the poverty reduction jigsaw, with detailed and timely forward-looking information on aid. IATI open data can be used by governments to forecast spending, and by citizens to hold donors to account. This is the promise of open data: publish once, use many times and for many purposes.

But data does not use itself. Nor does it transcend political and practical realities. As the papers in a recent Journal of Community Informatics special issue highlight show, open data brings both promise and perils. Mobilising open data for social change requires focus and effort.

We’re only at the start of understanding open data impacts. In the upcoming Exploring the Emerging Impacts of Open Data in Developing Countries (ODDC), the Web Foundation and partners will be looking at how open data affects governance in different countries and contexts across the world. Rather than look at open data in the abstract, the project will explore cases such as open data for budget monitoring in Brazil, or open data for poverty reduction in Uganda. This way it will build up a picture of the strategies that can be used to make a difference with data; it will analyse the role that technologies and intermediaries play in mobilising data; and it will also explore unintended consequences of open data.

I hope in this discussion we can similarly focus on particular places where open data has potential, and on the considerations needed to ensure the supply and use of open data has the best chance possible of improving lives worldwide.

What do you think? You can join the discussion for the next two weeks over on the Striking Poverty site…

Linked-Development: notes from Research to Impact at the iHub

[Summary: notes from a hackathon in Nairobi built around linked open data]

I’ve just got back from an energising week exploring open data and impact in Kenya, working with R4D and IDS at Nairobi’s iHub to run a three-day hackathon titled ‘Research to Impact’. You can read Pete Cranston’s blog posts on the event here (update: and iHub’s here). In this post, after a quick pre-amble, I reflect particularly on working with linked data as part of the event.

The idea behind the event was fairly simple: lots of researchers are producing reports and publications related to international development, and these are logged in catalogues like R4D and ELDIS, but often it stops there, and research doesn’t make it into the hands of those who can use it to bring about economic and social change. By opening up the data held on these resources, and then working with subject experts and developers, we were interested to see whether new ideas would emerge for taking research to where it is needed.

The Research to Impact hack focused in on ‘agriculture and nutrition’ research so that we could spend the first day working with a set of subject experts to identify the challenges research could help meet, and to map out the different actors who might be served by new digital tools. We were hosted for the whole event at the inspiring iHub and mLab venue by iHub Research. iHub provides a space for the growing Kenya tech community, acting as a meeting space, incubator and workspace for developers and designers. With over 10,000 members of it’s network, iHub also helped us to recruit around 20 developers who worked over the second two days of the hackathon to build prototype applications responding to the challenges identified on day one, and to the data available from R4D and IDS.

A big focus of the hackathon development turned out to be on mobile applications: as in Kenya mobile phones are the primary digital tool for accessing information. On day four, our developers met again with the subject experts, and pitched their creations to a judging panel, who awarded first, second and third prizes. Many of the apps created had zeroed in on a number of key issues: working through intermediaries (in this case, the agricultural extension worker), rather than trying to use tech to entirely disinter-mediate information flows; embedding research information into useful tools, rather than providing it through standalone portals (for example, a number of teams build apps which allowed extension workers to keep track of the farmers they were interacting with, and that could then use this information to suggest relevant research); and, most challengingly, the need for research abstracts and descriptions to be translated into easy-to-understand language that can fit into SMS-size packages. Over the coming weeks IDS and R4D are going to be exploring ways to work with some of the hackathon teams to take their ideas further.

Linked-development: exploring the potential of linked data

The event also provided us with an opportunity to take forward explorations of how linked data might be a useful technology in supporting research knowledge sharing. I recently wrote a paper with Duncan Edwards of IDS exploring the potential of linked data for development communication, and I’ve been exploring linked data in development for a while. However, this time we were running a hackathon directly from a linked data source, which was a new experience.

Ahead of the event I set up linked-development.org as a way to integrate R4D data (already available in RDF), and ELDIS data (which I wrote a quick scraper for), both modelled using the FAO’s AGRIS model. In order to avoid having to teach SPARQL for access to the data, I also (after quite a steep learning curve) put together a very basic Puelia Linked Data API implementation over the top of the data. To allow for a common set of subject terms between the R4D and ELDIS data, I made use of the Maui NLP indexer to tag ELDIS agriculture and nutrition documents against the FAO’s Agrovoc (R4D already had editor assigned terms against this vocabulary), giving us a means of accessing the documents from the two datasets alongside each other.

The potential value of this approach become clear on the first day of the event, when one of the subject experts showed us their own repository of Kenyan-focussed agricultural research publications and resources, which was already modelled and theoretically accessible as RDF using the Agris model. Although our attempts to integrate this into our available dataset failed due to the Drupal site serving the data hitting memory limits (linked data still remains something that tends to need a lot of server power thrown at it, and that can have significant impacts where the relative cost of hosting and tech capacity is high), the potential to bring more local content into linked-development.org alongside data from R4D and ELDIS was noted by many of the developers taking part as something which would be likely to make their applications a lot more successful and useful: ensuring that the available information is built around users needs, not around organisational or project boundaries.

At the start of the developer days, we offered a range of ways for developers to access the research meta-data on offer. We highlighted the linked data API, the ELDIS API (although it only provided access to one of the datasets, I found it would be possible for us to create an compatible API speaking to the linked data in future), and SPARQL as means to work with the data. Feedback forms from the event suggest that formats like JSON were new to many of our participants, and linked data was a new concept to all. However, in the end, most teams chose to use some of the prepared SPARQL queries to access the data, returning results as JSON into PHP or Python. In practice, over the two days this did not end up realising the full value of linked data, as teams generally appeared to use code samples to pull SPARQL ‘SELECT’ result sets into relational databases, and then to build their applications from there (a common issue I’ve noted at hack days, where the first step of developers is to take data into the platform they use most). However, a number of teams were starting to think about both how they could use more advanced queries or direct access to the linked data through code libraries in future, and most strikingly, were talking about how they might be able to write data back to the linked-development.org data store.

This struck me as particularly interesting. A lot of the problems teams faced in creating their application was that the research meta-data available was not customised to agricultural extension workers or farmers. Abstracts would need to be re-written and translated. Good quality information needed to be tagged. New classifications of the resources were needed, such as tagging research that is useful in the planting season. Social features on mobile apps could help discover who likes what and could be used to rate research. However, without a means to write back to the shared data store, all this added value will only ever exist in the local and fragmented ecosystems around particular applications. Getting feedback to researchers about whether their research was useful was also high on the priority list of our developers: yet without somewhere to put this feedback, and a commitment from upstream intermediaries like R4D and ELDIS to play a role feeding back to authors, this would be very difficult to do effectively.

This links to one of the points that came out in our early IKM Emergent work on linked data, noting that the relatively high costs and complexity of the technology, and the way in which servers and services are constructed, may lead to an information environment dominated by those with the capacity to publish; but that it has the potential, with the right platforms, configurations and outreach, to bring about a more pluralistic space, where the annotations from local users of information can be linked with, and equally accessible as, the research meta-data coming from government funded projects. I wish we had thought about this more in advance of the hackathon, and provided each team with a way to write data back to the linked-development.org triple store (e.g. giving them named graphs to write to; and providing some simple code samples or APIs), as I suspect this would have opened up a whole new range of spaces for innovation.

Overall though, the linked-development.org prototype appears to have done some useful work, not least providing a layer to connect two DFID funded projects working on mobilising research. I hope it is something we can build upon in future.

Final papers in JCI Special Issue on Open Data

Earlier this year I blogged about the first release of papers on Open Data in a Special Issue of the Journal of Community Informatics that I had been co-editing with Zainab Bawa. A few days ago we added the last few papers to the issue, finalising it as a collection of critical thinking about the development of Open Government Data.

You can find the full table of contents below (new papers noted with (New)).

How might open data contribute to good governance?

[Summary: sharing an introductory article on open data and governance]

Thanks to an invite via the the great folk at CYEC, earlier this year I was asked to write a contribution for the Commonwealth Governance Handbook around emerging technology trends, so I put down a few thoughts on how open data might contribute to good governance in a Commonwealth context. The book isn’t quite out yet, but as I’m preparing for the next few days I’ll be spending at an IDRC Information and Networks workshop with lots of open access advocates, talking about open data and governance, I thought I should at least get a pre-print uploaded. So here is the PDF for download.

The article starts:

Access to information is increasingly recognised as a fundamental component of good governance. Citizens need access to information on the decision-making processes of government, and on the performance of the state to be able to hold governments to account.

And ends by saying:

Whether open data initiatives will fully live up to high expectations many have for them remains to be seen. However, it is likely that open data will come to play a part in the governance landscape across many Commonwealth countries in coming years, and indeed, could provide a much needed tool to increase the transparency of Commonwealth institutions. Good governance, pro-social and civic outcomes of open data are not inevitable, but with critical attention they can be realised^?.

The bit in-between tries to provide a short introduction to open data for beginners, and to consider some of the ways open data and governance meet, drawing particular on examples from the Commonwealth.

Comments and feedback welcome.

Download paper: PDF (128Kb)

Opening the National Pupil Database?

[Summary: some preparatory notes for a response to the National Pupil Database consultation]

The Department for Education are currently consulting on changing the regulations that govern who can gain access to the National Pupil Database (NPD). The NPD holds detailed data on every student in England, going back over ten years, and covering topics from test and exam results, to information on gender, ethnicity, first language, eligibility for free school meals, special educational needs, and detailed information on absences or school exclusion. At present, only a specified list of government bodies are able to access the data, with the exception that it can be shared with suitably approved “persons conducting research into the educational achievements of pupils”. The DFE consultation proposed opening up access to a far wider range of users, in order “to maximise the value of this rich dataset“.

The idea that government should maximise the value of the data it holds has been well articulated in the open data policies and white paper that suggests open data can be an “effective engine of economic growth, social wellbeing, political accountability and public service improvement.”. However, the open data movement has always been pretty unequivocal on the claim that ‘personal data’ is not ‘open data’ – yet the DFE proposals seek to apply an open data logic to what is fundamentally a personal, private and sensitive dataset.

The DFE is not, in practice, proposing that the NPD is turned into an open dataset, but it is consulting on the idea that it should be available not only for a wider range of research purposes, but also to “stimulate the market for a broader range of services underpinned by the data, not necessarily related to educational achievement”. Users of the data would still go through an application process, with requests for the most sensitive data subject to additional review, and users agreeing to hold the data securely: but, the data, including easily de-anonymised individual level records, would still be given out to a far wider range of actors, with increased potential for data leakage and abuse.

Consultation and consent

I left school in 2001 and further education in 2003, so as far as I can tell, little of my data is captured by the NPD – but, if it was, it would have been captured based not on my consent to it being handled, but simple on the basis that it was collected as an essential part of running the school system. The consultation documents state that “The Department makes it clear to children and their parents what information is held about pupils and how it is processed, through a statement on its website. Schools also inform parents and pupils of how the data is used through privacy notices”, yet, it would be hard to argue this would constitute informed consent for the data to now be shared with commercial parties for uses far beyond the delivery of education services.

In the case of the NPD, it would appear particularly important to consult with children and young people on their views of the changes – as it is, after all, their personal data held in the NPD. However the DFE website shows no evidence of particular efforts being taken to make the consultation accessible to under 18s. I suspect a carefully conducted consultation with diverse groups of children and young people would be very instructive to guide decision making in the DFE.

The strongest argument for reforming the current regulations in the consultation document is that, in the past, the DFE has had to turn down requests to use the data for research which appears to be in the interests of children and young people’s wellbeing. For example, “research looking at the lifestyle/health of children; sexual exploitation of children; the impact of school travel on the environment; and mortality rates for children with SEN”. It might well be that, consulted on whether the would be happy for their data to be used in such research, many children, young people and parents would be happy to permit a wider wording of the research permissions for the NPD, but I would be surprised if most would happily consent to just about anyone being able to request access to their sensitive data. We should also note that, whilst some of the research DFE has turned down sound compelling, this does not necessarily mean this research could not happen in any other way: nor that it could not be conducted by securing explicit opt-in consent. Data protection principles that require data to only be used for the purpose it was collected cannot just be thrown away because they are inconvenient, and even if consultation does highlight people may be willing for some wider sharing of their personal data for good, it is not clear this can be applied retroactively to data already collected.

Personal data, state data, open data

The NPD consultation raises an important issue about the data that the state has a right to share, and the data it holds in trust. Aggregate, non-disclosive information about the performance of public services is data the state has a clear right to share and is within the scope of open data. Detailed data on individuals that it may need to collect for the purpose of administration, and generating that aggregate data, is data held in trust – not data to be openly shared.

However, there are many ways to aggregate or process a dataset – and many different non-personally identifying products that could be built from a dataset, Many of these government will never have the need to create – yet they could bring social and economic value. So perhaps there are spaces to balance the potential value in personally sensitive datasets with the the necessary primacy of data protection principles.

Practice accommodations: creating open data products

In his article for the Open Data Special Issue of the Journal of Community Informatics I edited earlier this year, Rollie Cole talks about ‘practice accommodations’ between open and closed data. Getting these accommodations right for datasets like the NPD will require careful thought and could benefit from innovation in data governance structures. In early announcements of the Public Data Corporation (now the Public Data Group and Open Data User Group), there was a description of how the PDC could “facilitate or create a vehicle that can attract private investment as needed to support its operations and to create value for the taxpayer”. At the time I read this as exploring the possibility that a PDC could help private actors with an interest in public data products that were beyond the public task of the state, but were best gathered or created through state structures, to pool resources to create or release this data. I’m not sure that’s how the authors of the point intended it, but the idea potentially has some value around the NPD. For example, if there is a demand for better “demographic models [that can be] used by the public and commercial sectors to inform planning and investment decisions” derived from the NPD, are there ways in which new structures, perhaps state-linked co-operatives, or trusted bodies like the Open Data Institute, can pool investment to create these products, and to release them as open data? This would ensure access to sensitive personal data remained tightly controlled, but would enable more of the potential value in a dataset like NPD to be made available through more diverse open aggregated non-personal data products.

Such structures would still need good governance, including open peer-review of any anonymisation taking place, to ensure it was robust.

The counter argument to such an accommodation might be that it would still stifle innovation, by leaving some barriers to data access in place. However, the alternative, of DFE staff assessing each application for access to the NPD, and having to make a decision on whether a commercial re-use of the data is justified, and the requestor has adequate safeguards in place to manage the data effectively, also involves barriers to access – and involves more risk – so the counter argument may not take us that far.

I’m not suggesting this model would necessarily work – but introduce it to highlight that there are ways to increase the value gained from data without just handing it out in ways that inevitably increase the chance it will be leaked or mis-used.

A test case?

The NPD consultation presents a critical test case for advocates of opening government data. It requires us to articulate more clearly the different kinds of data the state holds, to be be much more nuanced about the different regimes of access that are appropriate for different kinds of data, and to consider the relative importance of values like privacy over ideas of exploiting value in datasets.

I can only hope DFE listen to the consultation responses they get, and give their proposals a serious rethink.

Further reading and action: Privacy International and Open Rights Group are both preparing group consultation inputs, and welcome input from anyone with views of expert insights to offer.

Complexity and complementarity – why more raw material alone won’t necessarily bring open data driven growth

[Summary: reflections on an open data hack day, complexity, and complements to open data for economic and social impact. Cross posted from Open Data Impacts blog.]

“Data is the raw material of the 21st Century”.

It’s a claim that has been made in various forms by former US CIO Vivek Kundra (PDF), by large consultancies and tech commentators, and that is regularly repeated in speeches by UK Cabinet Office Minister Francis Maude, mostly in relation to the drive to open up government data. This raw material, it is hoped, will bring about new forms of economic activity and growth. There is certainly evidence to suggest that for some forms of government data, particularly ‘infrastructural’ data, moving to free and open access can stimulate economic activity. But, for many open data advocates, the evidence is not showing the sorts of returns on investment, or even the ‘gold rush’ of developers picking over data catalogues to exploit newly available data that they had expected.

At a hack-event held at the soon-to-be-launched Open Data Institute in London this week, a number of speakers highlighted the challenge of getting open data used: the portals are built, but the users do not necessarily come. Data quality, poor meta-data, inaccessible language, and the difficulty of finding wheat amongst the chaff of data were all diagnosed as part of the problem, with some interesting interfaces and tools developed to try and improve data description and discovery. Yet these diagnosis and solutions are still based on linear thinking: when a dataset is truly accessible, then it will be used, and economic benefits will flow.

Owen Barder identifies the same sort of linear thinking in much macro-economic international development policy of the 70s and 80s in his recent Development Drums podcast lecture on complexity and development. The lecture explores the question of how countries with similar levels of ‘raw materials’ in terms of human and physical capital, could have had such different growth rates over the last half century. The answer, it suggests, lies in the complexity of economic development – where we need not just raw materials, but diverse sets of skills and supply chains, frameworks, cultures and practices. Making the raw materials available is rarely enough for economic growth. And this something that open data advocates focussed on economic returns on data need to grapple with.

Thinking about open data use as part of a complex system involves paying attention to many different dimensions of the environment around data. Jose Alonso highlights “the political, legal, organisation, social, technical and economic” as all being important areas to focus on. One way of grounding notions of complexity in thinking about open data use, that I was introduced to in working on a paper with George Kuk last year, is through the concept of ‘complementarity’. Essentially A complements B if A and B together are more than the sum of their parts. For example, a mobile phone application and an app store are complements: as the software in one, needs the business model and delivery mechanisms in the other in order to be used.

The challenge then is to identify all the things that may complement open data for a particular use; or, more importantly, to identify all those processes already out there in the economy to which certain open data sets are a complement. Whilst the example above of complements appears at first glance technological (apps and app stores), behind it are economic, social and legal complementarities, amongst others. Investors, payment processing services, app store business models, remmitance to developers, and often-times, stable jobs for developers in an existing buoyant IT industry that allow them to either work on apps for fun in spare time, or to leave work with enough capital to take a risk on building their own applications are all part of the economic background. Developer meet-ups, online fora, clear licensing of data, no fear of state censorship of applications built and so-on contribute to the social and legal background. These parts of the complex landscape generally cannot be centrally planned or controlled, but equally they cannot be ignored when we are asking why the provision of a raw material has not brought about anticipated use.

As I start work on the ‘Exploring the Emerging Impacts of Open Data in the South‘ project with the Web Foundation and IDRC, understanding the possible complements of open data for economic, political and social use may provide one route to explore which countries and contexts are likely to see strong returns from open data policy, and to see what sorts of strategies states, donors and communities can adopt to increase their opportunity to gain potential benefits and avoid possible pitfalls of greater access to open data. Perhaps for further Open Data Institute hack days, it can also encourage more action to address the complex landscape in which open data sits, rather than just linear extensions of data platforms that exist in the hope that the users will eventually come*.

Where co-operatives and open data meet…

[Summary: thoughts on ways in which co-operatives could engage with open data]

With the paper I worked on with Web Science Trust for Nominet Trust on ‘Open Data and Charities‘ just released (find the PDF for download here), and this post on ‘Open Data and Co-operatives’ it might feel like I’m just churning through a formula for working on ‘organisation structure’ + ‘open data’ for writing articles and blog posts. It is however, just a fortuitous co-incidence of timing, thanks to a great event organised today by Open Data Manchester and Co-operatives UK.

The event was a workshop on ‘Co-operative business models for open data‘ and involved an exploration of some of the different ways in which co-operatives might have a role to play in creating, sharing and managing open data resources. Below are my notes from some of the presentations and discussions, and some added reflections jotted down during this write-up.

What are co-operatives?

Many people in the UK are familiar with the high-street retail co-operative; but there are thousands more co-operatives in the UK active in all sectors of the economy; and the co-operative is a business form established right across the world.

The co-operative is a model of business ownership and governance. Unlike limited or public companies which are owned and essential run in the interests of their shareholders, co-operatives are owned by their members, and are run in the interest of those members. Co-ops legal expert Ged explained this still leaves a vast range of possible governance models for co-ops, depending on who the members are, and how they are structured. For example, the retail coop is a ‘Consumers’ co-operative, where shoppers who use its services can become members and have a say in the governance of the institution. By contrast, the John Lewis Partnership is an employee owned, or ‘producer’ co-operative, which is run for the collective benefit of its staff. Some co-operatives are jointly owned by producers and consumers, and others, like Co-ops UK are owned by their member organisations – existing to provide a service to other co-ops.

There’s been a lot of focus on co-ops in recent years. This year is UN Year of the Co-operative, and the current UK Government has talked a lot about mutualisation of public services.

What do co-operatives have to do with open data?

There are many different perspectives on what open data is, but at its most basic, open data involves making datasets accessible online, in standard formats, and under licenses that allow them to be re-used. In discussions we explored a range of ways in which co-operative structures might meet open data.

Share: Co-operatives sharing data

As businesses, co-operatives have a wide range of data they might consider making available as open data. Discussions in today’s workshop highlighted the wide variety of possible data: from locations of retail coop outlets, to energy usage data gathered by an energy co-operative, or turnstile data from a co-operative football club.

Co-operatives might also hold datasets that contain personal or commercially sensitive data, such as the records held by the co-operative bank, or the shopping data held by the retail co-operative, but that could be used to generate derived datasets that could be made openly available to support innovation, or to inform action on key social challenges.

There are a number of motivations for co-ops to release data as open data:

Firstly, releasing data may allow others to re-use it in a way that benefits the coop economy. For example, Co-operatives UK recently released a mobile app for locating a wide range of co-ops and retail outlets. If the data for this was also available, third parties could build information on coop services into their own apps, tools and services, potentially increasing awareness of co-operatives.
Secondly, sharing data might support the wider social aims of a co-operative. For example, an energy co-operative might have gathered lots of data on the sorts of renewable energy sources that work in different settings, and sharing this data openly would support other people working on sustainable energy to make better choices; or retail co-operatives might share information on the grants they give to community groups in a structured form in a way that would support them to better target resources on areas with the most impact.
Thirdly, transparency, accountability and trust might be important drivers for co-ops to release data – with open data supporting new models of co-operative governance. For example, co-ops might release detailed financial information as open data to allow their members to understand their performance, or to analyse staff remuneration. Or a coop might provide aggregate data on its supply chain to show how it is improving the percentage of supplies from other co-operatives or from Fairtrade suppliers. For public service co-operatives, like the Youth Mutual forming in Lambeth, it may be important to publish structured data on how public money is being spent, ensuring that the contracting out of services through co-operatives does not undermine the local authority spending transparency that has been established over recent years.

Collaborate: Co-operatives as data sharing clubs

Discussions also looked at how we can put data into co-operatives, rather than get data out. A lot of the open data agenda so far has focussed on open data from government (OGD), but often the data needed to answer key questions comes from a variety of stakeholders, including governments, community groups and individuals.

Co-operatives could provide a model to manage the ownership of shared data resources. Most open data licenses are still based upon data being owned somewhere (apart from CC-zero, and Public Domain Dedications which effectively waive ownership rights over a dataset). Co-operatives can provide a model for ownership of open data resources, giving different stakeholders a say in how shared data is managed. For example, if government releases a dataset of public transport provision, and invites citizens and organisations to take part in crowdsourced improvement of the data, people may be reluctant to contribute if the data is just going back into state ownership. However, if contributors to the improved dataset also gain a shared stake in ownership of that enhanced data, they may be more interested in giving their input. This was an issue that came up at the PMOD conference in Brussels last month.

We also discussed how co-operative structures could provide a vehicle for combining open and private data, or for the limited pooling of private data. For example, under the MiData programme, government is working to give citizens better access to their personal data from corporations, such as phone and energy companies. Pooling their personal data (in secure, non-open ways) could allow consumers to get better deals on products or to engage in collective purchasing. Undoubtedly private companies will emerge offering services based on pooled personal data, but where this sort of activity takes place through co-operative structures, consumers sharing their data can have a guarantee that the benefits of the pooled data are being shared amongst the contributors to it, not appropriated by some private party.

Create and curate: Co-operative governance of datasets and portals

Linked to the idea of co-operatives as data sharing clubs, Julian Tait highlighted the potential for co-operative governance of data portals – taking a mutual approach to managing the meta-data and services that they provide.

As I’ve argued elsewhere, open data portals need to go beyond just listing datasets, to also be a hub of engagement – building the capacity of diverse groups to make use of data.

Ideas of joint producer and consumer co-operatives might also provide a means to involve users of data in deciding how data is created and collected. Choices made about data schemas, frequency of update etc. can have a big impact on what can be done with data – yet users of data are rarely involved in these choices.

Mobilise: Collaborating to add value to data

The claim is often implicitly or explicitly made that publishing this data will lead to all sort of benefits, from greater transparency, accountability and trust, to innovation and economic growth.

However, looked at in detail, we find that there are many elements to the value chain between raw open data and social or economic value. Data may need cleaning, linking, contextualising, analysing and interpreting before it can be effectively used. In talking about the Swirl business model for open data, Ric Roberts explained that if you charge too early on in the value chain for data, it will be underused. However, efforts to add value to data in the open can suffer a public good problem – everyone benefits, but no-one wants to cover the full cost alone. If everyone duplicates the tasks involved in adding value to data, less will be done – so establishing co-operative structures around data in particular areas or sectors might provide a means to pool efforts on improving data, adding value, and generating shared tools and services with data that can benefit all the members of a coop.

This might be something we explore in thinking about a ‘commissioning fund’ around the International Aid Transparency Initiative to help different stakeholders in IATI to pool resources to develop useful tools and services based on the data.

Where next?

We ended today’s workshop by setting up a Google Document to develop a short paper on co-operatives and open data. You can find the draft here, and join in to help fill out a map of all the different ways co-ops could engage with open data, and to develop plans for some pilots and shared activities to explore the co-operative-opendata connection more.

Keep an eye on the Co-operative News ‘Open’ pages for more on the co-operative open data journey.

Category: Open Data