Monthly Archives: May 2013

Open data in extractives: meeting the challenges

followthedatalinesmallerThere’s lots of interest building right now around how open data might be a powerful tool for transparency and accountability in the extractive industries sector. Decisions over where extraction should take place have a massive impact on communities and the environment, yet often decision making is opaque, with wealthy private interests driving exploitation of resources in ways that run counter the public interest. Whilst revenues from oil, gas and mineral resources have the potential to be a powerful tool for development, with a proportion channeled into public funds, massive quantities of revenue frequently ‘go missing’, lost in corruption, and
fuelling elements of a resource curse.

For the last ten years the Extractive Industries Transparency Initiative has been working to get companies to commit to ‘publish what they pay‘ to government, and for government to disclose receipts of finance, working to identifying missing money through a document-based audit process. Campaigning coalitions, watchdogs and global initiatives have focussed on increasing the transparency of the sector. Now, with a recognition that we need to link together information on different resources flows for development at all levels, potentially through the use of structured open data, and with an anticipated “data tsunami” of new information on extractives financials anticipated from the Dodd-Frank act in the US, and similar regulation in Europe, groups working on extractives transparency have been looking at what open data might mean for future work in this area.

8713819458_08a1bf9c10_zRight now, DFID are taking that exploration forward through a series of hack days with Rewired State under the ‘follow the data’ banner, with the first in London last weekend, and one coming up next week in Lagos, Nigeria. The idea of the events is to develop rapid prototypes of tools that might support extractives transparency, putting developers and datasets together over 24 hours to see what emerges. I was one of the judging panel at this weekends event, where the three developer teams that formed looked respectively at: making datasets on energy production and prices more accessible for re-use through an API; visualising the relationship between extractives revenues and various development indicators; and designing an interface for ‘nuggets’ of insight discovered through hack-days to be published and shared with useful (but minimal) meta-data.

In their way, these three projects highlight a range of the challenges ahead for the extractives sector in building capacity to track resource flows through open data:

  • Making data accessibleThe APIfy project sought to take a number of available datasets and aggregate them together in a database, before exposing a number of API endpoints that made machine-readable standardised data available on countries, companies and commodities. By translating the data access challenge from one or routing around in disparate datasets, to one of calling a standard API for key kinds of ‘objects’, the project demonstrated the need developers often have for clear platforms to build upon. However, as I’ve discovered in developing tools for the International Aid Transparency Initiative, building platforms to aggregate together data often turns out to be a non-trivial project: technically (it doesn’t take long to get to millions of data items when you are dealing with financial transactions), economically (as databases serving millions of records to even a small number of users need to be maintained and funded), socially (developers want to be able to trust the APIs they build against to be stable, and outreach and documentation are needed to support developers to engage with an API), and in terms of information architecture (as design choices over a dataset or API can have a powerful affect on downstream re-users).
  • Connecting datasets – none of the applications from the London hack-day were actually able to follow resource flows through the available data. Although visions of a coherent datasphere, in which the challenge is just making the connection between a transaction in one dataset, and a transaction in another, to see where money is flowing, are appealing – traceability in practice turns out to be a lot harder. To use the IATI example again, across the 100,000+ aid activities published so far less than 1% include traceability efforts to show how one transaction relates to another, and even here the relationships exist in the data because of conscious efforts by publishers to link transaction and activity identifiers. In following the money there will be many cases where people have an incentive not to make these linkages explicit. One of the issues raised by developers over the hack-day was the scattered nature of data, and the gaps across it. Yet – when it comes to financial transaction tracking, we’re likely to often be dealing with partial data, full of gaps, and it won’t be easy to tell at first glance when a mis-match between incoming and outgoing finances is a case of missing data or corruption. Right now, a lot of developers attack open data problems with tools optimised for complete and accurate data, yet we need to be developing tools, methods and visualisation approaches that deal with partial and uncertain data. This is developed in the next point.
  • Correlation, causation and investigation – The Compare the Map project developed on the hack day uses “scraped data from GapMinder and EITI to create graphical tools” that allow a user to eye-ball possible correlations between extractives data and development statistics. But of course, correlation is not causation – and the kinds of analysis that dig deeper into possible relationships are difficult to work through on a hack day. Indeed, many of the relationships mash-ups of this form can show have been written about in papers that control for many more variables, dealing carefully with statistically challenging issues of missing data and imperfectly matched datasets. Rather than simple comparison visualisations that show two datasets side by side, it may be more interesting to look for all the possible statistically significant correlations in a datasets with common reference points, and then to look at how human users could be supported in exploring, and giving feedback on, which of those might be meaningful, and which may or may not already be researched. Where research does show a correlation to exist, then using open data to present a visual narrative to users about this can have a place, though here the theory of change is very different – not about identifying connections – but about communicating them in interactive and engaging ways to those who may be able to act upon them.
  • Sharing and collaborating – The third project at the London hack-day was ‘Fact Cache‘ – a simple concept for sharing nuggets of information discovered in hack-day explorations. Often as developers work through datasets they may come across discoveries of interest, yet these are often left aside in the rush to create a prototype app or platform. Fact Cache focussed on making these shareable. However, when it was presented discussions also explored how it could make these nuggets of information into social objects, open to discussion and sharing. This idea of making open data findings more usable as social objects was also an aspect of the UN Global Pulse hunchworks project. That project is currently on hold (it would be interesting to know why…), but the idea of supporting collaboration around open data through online tools, rather than seeing apps that present data, or initial analysis as the end point, is certainly one to explore more in building capacity for open data to be used in holding actors to account.
  • Developing theories of change – as the judges met to talk about the projects, one of the key themes we looked at was whether each project had a clear theory of change. In some sense taken together they represent the complex chain of steps involved in an open data theory of change, from making data more accessible to developers, creating tools and platforms that let end users explore data, andthen allowing findings from data to be communicated and to shape discourses and action. Few datasets or tools are likely to be change-making on their own – but rather can play a key role in shifting the balance of power in existing networks or organisations, activists, companies and governments. Understanding the different theories of change for open data is one of the key themes in the ongoing Open Data in Developing Countries research, where we take existing governance arrangements as a starting point in understanding how open data will bring about impacts.

In a complex world, access to data, and the capacity to use it effectively, are likely to be essential parts of building more accountable governance across a wide range of areas, including in the extractives industry. Although there are many challenges ahead if we are to secure the maximum benefits from open data for transparent and accountable governance, it’s exciting and encouraging to see so many passionate people putting their minds early to tackling them, and building a community ready to innovate and bring about change.

Note: The usage of ‘follow the data’ in this DFID project is distinct from the usage in the work I’m currently doing to explore ‘follow the data’ research methods. In the former, the focus is really on following financial and resource flows through connecting up datasets; in the latter the focus is on tracing the way in which data artefacts have been generated, deployed, transferred and used in order to understand patterns of open data use and impact.