ODDC Update at Developers for Development, Montreal

[Summary: Cross posted from the Open Data Research Network website. Notes from a talk at OD4DC Montreal] 

I’m in Montreal this week for the Developers for Development hackathon and conference. Asides from having fun building a few things as part of our first explorations for the Open Contracting Data Standard, I was also on a panel with the fantastic Linda Raftree, Laurent Elder and Anahi Ayala Iacucci focussing on the topic of open data impacts in developing country: a topic I spend a lot of time working on. We’re still in the research phase of the Emerging Impacts of Open Data in Developing Countries research network, but I tried to pull together a talk that would capture some of the themes that have been coming up in our network meetings so far. So – herewith the slides and raw notes from that talk.

Introduction

In this short presentation I want to focus on three things. Firstly, I want to present a global snapshot of open data readiness, implementation and impacts around the world.

Secondly, I want to offer some remarks on the importance of how research into open data is framed, and what social research can bring to our understanding of the open data landscape in developing countries.

Lastly, I want to share a number of critical reflections emerging from the work of the ODDC network.

Part 1: A global snapshot

I’ve often started presentations and papers about open data by commenting on how ‘it’s just a few short years since the idea of open data gained traction’, yet, in 2014 that line is starting to get a little old. Data.gov launched in 2009, Kenya’s data portal in 2011. IATI has been with us for a while. Open data is no longer a brand new idea, just waiting to be embraced – it is becoming part of the mainstream discourse of development and government policy. The issue now is less about convincing governments to engage with the open data agenda, than it is about discovering whether open data discourses are translating into effective implementation, and ultimately open data impacts.

Back in June last year, at the Web Foundation we launched a global expert survey to help address that question. All-in-all we collected data covering 77 countries, representing every region, type of government and level of development, and asking about government, civil society and business readiness to secure benefits from open data, the actual availability of key datasets, and observed impacts from open data. The results were striking: over 55% of these diverse countries surveyed had some form of open data policy in place, many with high-level ministerial support.

The policy picture looks good. Yet, when it came to key datasets actually being made available as open data, the picture was very different. Less than 7% of the dataset surveyed in the Barometer were published both in bulk machine-readable forms, and under open licenses: that is, in ways that would meet the open definition. And much of this percentage is made up of the datasets published by a few leading developed states. When it comes to essential infrastructural datasets like national maps, company registers or land registries, data availability, of even non-open data, is very poor, and particularly bad in developing countries. In many countries, the kinds of cadastral records that are cited as a key to the economic potential of open data are simple not yet collected with full country coverage. Many countries have long-standing capacity building programs to help them create land registries or detailed national maps – but with many such programmes years or even decades behind on delivering the required datasets.

The one exception where data was generally available and well curated, albeit not provided in open and accessible forms, was census data. National statistics offices have been the beneficiaries of years of capacity building support: yet the same programmes that have enabled them to manage data well have also helped them to become quasi-independent of governments, complicating whether or not they will easily be covered by government open data policies.

If the implementation story is disappointing, the impact story is even more so. In the Barometer survey we asked expert researchers to cite examples of where open data was reported in the media, or in academic sources, to have had impacts across a range of political, social and economic domains, and to score questions on a 10-point scale for the breadth and depth of impacts identified. The scores were universally low. Of course, whilst the idea of open data can no longer be claimed to be brand new, many country open data initiatives are – and so it is far to day that outcomes and impacts take time – and are unlikely to be seen over in any substantial way over the very short term. Yet, even in countries where open data has been present for a number of years, evidence of impact was light. The impacts cited were often hackathon applications, which, important as they are, generally only prototype and point to potential impacts. Without getting to scale, few demo applications along can deliver substantial change.

Of course, some of this impact evidence gap may also be down to weaknesses in existing research. Some of the outcomes from open data publication are not easily picked up in visible applications or high profile news stories. That’s where the need for a qualitative research agenda really comes in.

Part 2: The Open Data Barometer

The Open Data Barometer is just one part of a wider open data programme at the World Wide Web Foundation, including the Open Data in Development Countries research project supported by Canada’s International Development Research Center. The main focus of that project over the last 12 months has been on establishing a network of case study research partners based in developing countries, each responding to both local concerns, and a shared research agenda, to understand how open data can be put to use in particular decision making and governance situations.

Our case study partners are drawn from Universities, NGOs and independent consultancies, and were selected from responses to an open call for proposals issues in mid 2012. Interestingly, many of these partners were not open data experts, or already involved in open data – but were focussed on particular social and policy issues, and were interested in looking at what open data meant for these. Focus areas for the cases range from budget and aid transparency, to higher education performance, to the location of sanitation facilities in a city. Together, these foundations gives the research network a number of important characteristics:

Firstly, whilst we have a shared research framework that highlights particular elements that each case study seeks to incorporate – from looking at the political, social and economic context of open data, through to the technical features of datasets and the actions of intermediaries – cases are also able to look at the different constraints exogenous to datasets themselves which affect whether or not data has a chance of making a difference.

Secondly, the research network works to build critical research capacity around open data – bringing new voices into the open data debate. For example, in Kenya, the Jesuit Hakimani Trust have an established record working on citizens access to information, but until 2013 had not looking at the issue of open data in Kenya. By incorporating questions about open data in their large-scale surveys of citizen attitudes, they start generating evidence that treats open data alongside other forms of access to information for poor and marginalisd citizens, generating new insights.

Thirdly, the research is open to unintended consequences of open data publication: good and bad – and can look for impacts outside the classic logic model of ‘data + apps = impact’. Indeed, as researchers in both Sao Paulo and Chennai have found, they have, as respected research intermediaries exploring open data use, been invited to get involved with shaping future government data collection practices. Gisele Craviero from the University of Sao Paulo uses the metaphor of an iceberg to highlight this importance of looking below the surface. The idea that opening data ultimately changes what data gets collected, and how it is handled inside the state should not be an alien idea for those involved in IATI – which has led to many aid agencies starting to geocode their data. But it is a route to effects often underplayed in explorations of the changes open data may be part of bringing about.

Part 3: Emerging findings

As mentioned, we’ve spent much of 2013 building up the Open Data in Developing Countries research network – and our case study parters are right now in the midst of their data collection and analysis. We’re looking forward to presenting full findings from this first phase of research towards the summer, but there are some emerging themes that I’ve been hearing from the network in my role as coordinator that I want to draw out. I should note that these points of analysis are preliminary, and are the product of conversations within the network, rather than being final statements, or points that I claim specific authorship over.

We need to unpack the definition of open data.

Open data is generally presented as a package with a formal definition. Open data is data that is proactively published, in machine-readable formats, and under open licenses. Without all of these: there isn’t open data. Yet, ODDC participants have been highlighting how the relative importance of these criteria varies from country to country. In Sierra Leone, for example, machine-readable formats might be argued to be less important right now than proactive publication, as for many datasets the authoritative copy may well be the copy on paper. In India, Nigeria or Brazil, the question of licensing may by mute: as it is either assumed that government data is free to re-use, regardless or explicit statements, or local data re-users may be unconcerned with violating licenses, based on a rational expectation that no-one will come after them.

Now – this is not to say that the Open Definition should be abandoned, but we should be critically aware of it’s primary strength: it helps to create a global open data commons, and to deliver on a vision of ‘Frictionless data’. Open data of this form is easier to access ‘top down’, and can more easily be incorporated into panopticon-like development dashboards, but the actual impact on ‘bottom up’ re-use may be minimal. Unless actors in a developing country are equipped with the skills and capacities to draw on this global commons, and to overcome other local ‘frictions’ to re-using data effectively, the direct ROI on the extra effort to meet a pure open definition might not accrue to those putting the effort in: and a dogmatic focus on strict definitions might even in some cases slow down the process of making data relatively more accessible. Understanding the trade offs here requires more research and analysis – but the point at least is made that there can be differences of emphasis in opening data, and these prioritise different potential users.

Supply is weak, but so is demand.

Talking at the Philippines Good Governance Summit a few weeks ago, Michael Canares presented findings from his research into how the local government Full Disclosure Policy (FDP) is affecting both ‘duty bearers’ responsible for supplying information on local budgets, projects, spend and so-on, and ‘claim holders’ – citizens and their associations who seek to secure good services from government. A major finding has been that, with publishers being in ‘compliance mode’, putting required information but in accessible formats, citizen groups articulated very little demand for online access to Full Disclosure Policy information. Awareness that the information was available was low, interest in the particular data published was low (that is, information made available did not match with any specific demand), and where citizen groups were accessing the data they often found they did not have the knowledge to make sense of or use it. The most viewed and download documents garnered no more than 43 visits in the period surveyed.

In open data, as we remove the formal or technical barriers to data re-use that come from licenses and non-standard formats, we encounter the informal hurdles, roadblocks and thickets that lay behind them. And even as those new barriers are removed through capacity building and intermediation, we may find that they were not necessarily holding back a tide of latent demand – but were rather theoretical barriers in the way of a progressive vision of an engaged citizenry and innovative public service provision. Beyond simply calling for the removal of barriers, this vision needs to be elaborated – whether through the designs of civic leaders, or through the distributed actions of a broad range of social activists and entrepreneurs. And the tricky challenge of culture change – changing expectations of who is, and can be, empowered – needs to be brought to the fore.

Innovative intermediation is about more than visualisation.

Early open data portals listed datasets. Then they started listing third party apps. Now, many profile interactive visualisations built with data, or provide visualisation tools. Apps and infographics have become the main thing people think of when it comes to ‘intermediaries’ making open data accessible. Yet, if you look at how information flows on the ground in developing countries, mobile messaging, community radio, notice boards, churches and chiefs centres are much more likely to come up as key sites of engagement with public information.

What might open data capacity building look like if we started with these intermediaries, and only brought technology in to improve the flow of data where that was needed? What does data need to be shaped like to enable these intermediaries to act with it? And how do the interests of these intermediaries, and the constituencies they serve, affect what will happen with open data? All these are questions we need to dig into further.

Summary

I said in the opening that this would be a presentation of critical reflections. It is important to emphasise that none of this constitutes an argument against open data. The idea that government data should be accessible to citizens retains its strong intrinsic appeal. Rather, in offering some critical remarks, I hope this can help us to consider different directions open data for development can take as it matures, and that ultimately we can move more firmly towards securing impacts from the important open data efforts so many parties are undertaking.

Joined Up Philanthropy data standards: seeking simplicity, and depth

[Summary: technical notes on work in progress for the Open Philanthropy data standard]

I’m currently working on sketching out a alpha version of a data standard for the Open Philanthropy project(soon to be 360giving). Based on work Pete Bass has done analysing the supply of data from trusts and foundations, a workshop on demand for the data, and a lot of time spent looking at existing standards at the content layer (eGrant/hGrantIATISchema.orgGML etc) and deeper technical layers (CSV, SDFXMLRDF,JSONJSON-Schema and JSON-LD), I’m getting closer to having a draft proposal. But – ahead of that – and spurred on by discussions at the Berkman Center this afternoon about the role of blogging in helping in the idea-formation process, here’s a rough outline of where it might be heading. (What follows is ‘thinking aloud’ from my work in progress, and does not represent any set views of the Open Philanthropy project)

Building Blocks: Core data plus

Joined Up Data Components

There are lots of things that different people might want to know about philanthropic giving, from where money is going, to detailed information on the location of grant beneficiaries, information on the grant-making process, and results information. However, few trusts and foundations have all this information to hand, and very few are likely to have it in a single system such that creating an single open data file covering all these different areas of the funding process would be an easy task. And if presented with a massive spreadsheet with 100s of columns to fill in, many potential data publishers are liable to be put off by the complexity. We need a simple starting point for new publishers of data, and a way for those who want to say more about their giving to share deeper and more detailed information.

The approach to that should be a modular, rather than monolithic standard: based on common building blocks. Indeed, in line with the Joined Up Data efforts initiated by Development Initiatives, many of these building blocks may be common across different data standards.

In the Open Philanthropy case, we’ve sketched out seven broad building blocks, in addition to the core “who, what and how much” data that is needed for each of the ‘funding activities’ that are the heart of an open philanthropy standard. These are:

  • Organisations – names, addresses and other details of the organisations funding, receiving funds and partnering in a project
  • Process – information about the events which take place during the lifetime of a funding activity
  • Locations – information about the geography of a funded activity – including the location of the organisations involved, and the location of beneficiaries
  • Transactions – information about pledges and transfers of funding from one party to another
  • Results – information about the aims and targets of the activity, and whether they have been met
  • Classifications – categorisations of different kinds that are applied to the funded activity (e.g. the subject area), or to the organisations involved (e.g. audited accounts?)
  • Documents – links to associated documents, and more in-depth descriptions of the activity

Some of these may provide more in-depth information about some core field (e.g. ‘Total grant amount’ might be part of the core data, but individual yearly breakdowns could be expressed within the transactions building block), whilst others provide information that is not contained in the core information at all (results or documents for example).

An ontological approach: flat > structured > linked

One of the biggest challenges with sketching out a possible standard data format for open philanthropy is in balancing the technical needs of a number of different groups:

  • Publishers of the data need it to be as simple as possible to share their information. Publishing open philanthropy must be simple, with a minimum of technical skills and resources required. In practice, that means flat, spreadsheet-like data structures.
  • Analysts like flat spreadsheet-style data too – but often want to be able to cut it in different ways. Standards like IATI are based on richly structured XML data, nested a number of levels deep, which can make flattening the data for analysts to use it very challenging.
  • Coders prefer structured data. In most cases for web applications that means JSON. Whilst someexpressive path languages for JSON are emerging, ideally a JSON structure should make it easy for a coder to simply drill-down in the tree to find what they want, so being able to look foractivity.organisations.fundingOrganisation[0] is better than having to iterate through all theactivity.organisation nodes to find the one which has “type”:”fundingOrganisation”.
  • Data integrators want to read data into their own preferred database structures, from noSQL to relational databases. Those wanting to integrate heterogeneous data sources from different ‘Joined Up Data’ standards might also benefit from Linked Data approaches, and graph-based data using cross-mapped ontologies.

It’s pretty hard to see how a single format for representing data can meet the needs of all these different parties: if we go with a flat structure it might be easier for beginners to publish, but the standard won’t be very expressive, and will be limited to use in a small niche. If we go with richer data structures, the barriers to entry for newcomers will be too high. Standards like IATI have faced challenges through the choice of an expressive XML structure which, whilst able to capture much of the complexity of information about aid flows, is both tricky for beginners, and programatically awkward to parse for developers. There are a lot of pitfalls an effective, and extensible, open philanthropy data standard will have to avoid.

In considering ways to meet the needs of these different groups, the approach I’ve been exploring so far is to start from a detailed, ontology based approach, and then to work backwards to see how this could be used to generate JSON and CSV templates (and as JSON-LD context), allowing transformation between CSV, JSON and Linked Data based only on rules taken from the ontology.

In practice that means I’ve started sketching out an ontology using Protege in which there are top entities for ‘Activity’, ‘Organisation’, ‘Location’, ‘Transaction’, ‘Documents’ and so-on (each of the building blocks above), and more specific sub-classed entities like ‘fundedActivity’, ‘beneficiaryOrganisation’, ‘fundingOrganisation’, ‘beneficiaryLocation’ and so-on. Activities, Organisations, Locations etc. can all have many different data properties, and there are then a range of different object properties to relate ‘fundedActivities’ to other kinds of entity (e.g. a fundedActivity can have a fundingOrganisation and so-on). If this all looks very rough right now, that’s because it is. I’ve only built out a couple of bits in working towards a proof-of-concept (not quite there yet): but from what I’ve explored so far it looks like building a detailed ontology should also allow mappings to other vocabularies to be easily managed directly in the main authoritative definition of the standard: and should mean when converted into Linked Data heterogenous data using the same or cross-mapped building blocks can be queried together. Now – from what I’ve seen ontologies can tend to get out of hand pretty quickly – so as a rule I’m trying to keep things as flat as possible: ideally just relationships between Activities and the other entities, and then data properties.

What I’ve then been looking at is how that ontology could be programatically transformed:

  • (a) Into a JSON data structure (and JSON-LD Context)
  • (b) Into a set of flat tables (possibly described with Simple Data Format if there are tools for which that is useful)

And so that using the ontology, it should be possible to take a set of flat tables and turn them into structure JSON and, via JSON-LD, into Linked Data. If the translation to CSV takes place using the labels of ontology entities and properties rather than their IDs as column names, then localisation of spreadsheets should also be in reach.

Rough work in progress... worked example coming soon
Rough work in progress. From ontology to JSON structure (and then onwards to flat CSV model). Full worked example coming soon…

I hope to have a more detailed worked example of this to post shortly, or, indeed, a post detailing the dead-ends I came to when working this through further. But – if you happen to read this in the next few weeks, before that occurs – and have any ideas, experience or thoughts on this approach – I would be really keen to hear your ideas. I have been looking for any examples of this being done already – and have not come across anything: but that’s almost certainly because I’m looking in the wrong places. Feel free to drop in a comment below, or tweet @timdavies with your thoughts.