Over the horizons: reflections from a week discussing the State of Open Data

[Summary: thinking aloud with five reflections on future directions for ope data related work, following discussions around the US east coast]

Over the last week I’ve had the opportunity to share findings from The State of Open Data: Histories and Horizons in a number of different settings: from academic roundtables, to conference presentations, and discussion panels.

Each has been an opportunity not only to promote the rich open access collection of essays just published, but also a chance to explore the many and varied chapters of the book as the starting point for new conversation about how to take forward an open approach to data in different settings and societies.

In this post I’m going to try and reflect on a couple of themes that have struck me during the week. (Note: These are, at this stage, just my initial and personal reflections, rather than a fully edited take on discussions arising from the book.)

Panel discussion at the GovLab with Tariq Khokhar, Adrienne Schmoeker and Beth Noveck.

Renewing open advocacy in a changed landscape

The timeliness of our look at the Histories and Horizons of open data was underlined on Monday when a tweet from Data.gov announced this week as their 10th anniversary, and the Open Knowledge Foundation, also celebrated their 15th birthday with a return to their old name, a re-focussed mission to address all forms of open knowledge, and an emphasis on creating “a future that is fair, free and open.”As they put it:

  …in 2019, our world has changed dramatically. Large unaccountable technology companies have monopolised the digital age, and an unsustainable concentration of wealth and power has led to stunted growth and lost opportunities. “

going on to say

“we recognise it is time for new rules for this new digital world.”

Not only is this a welcome and timely example of the kind of “thinking politically we call for in the State of Open Data conclusion, but it chimes with many of the discussions this week, which have focussed as much on the ways in which private sector data should be regulated as they have on opening up government data. 

While, in tools like the Open Data Charter’s Open Up Guides, we have been able to articulate a general case for opening up data in a particular sector, and then to enumerate ‘high value’ datasets that efforts should attend to, future work may need to go even deeper into analysing the political economy around individual datasets, and to show how a mix of voluntary data sharing, and hard and soft regulation, can be used to more directly address questions about how power is created, structured and distributed through control of data.

As one attendee at our panel at the Gov Lab put it, right now, open data is still often seen as a “perk not a right”.  And although ‘right to data’ advocacy has an important role, it is by linking access to data to other rights (to clean air, to health, to justice etc.) that a more sophisticated conversation can develop around improving openness of systems as well as datasets (a point I believe Adrienne Schmoeker put in summing up a vision for the future).

Policy enables, problems drive

So does a turn towards problem-focussed open data initiatives mean we can put aside work on developing open data policies or readiness assessments? In short, no.

In a lunchtime panel at the World Bank, Anat Lewin offered an insightful reflection on The State of Open Data from a multilateral’s perspective, highlighting the continued importance of developing a ‘whole of government’ approach to open data. This was echoed in Adrienne Schmoeker’s description at The Gov Lab of the steps needed to create a city-wide open data capacity in New York. In short, without readiness assessment and open data policies put in place, initiatives that use open data as a strategic tool are likely to rub up against all sorts of practical implementation challenges.

Where in the past, government open data programmes have often involved going out to find data to release, the increasing presence of data science and data analytics teams in government means the emphasis is shifting onto finding problems to solve. Provided data analytics teams recognise the idea of ‘data as a team sport’, requiring not just technical skills, but also social science, civic engagement and policy development skill sets – and providing professional values of openness are embedded in such teams – then we may be moving towards a model in which ‘vertical’ work on open data policy, works alongside ‘horizontal’ problem-driven initiatives that may make less use of the language of open data, but which still benefit from a framework of openness.

Chapter discussions at the OpenGovHub, Washington DC

Political economy really matters

It’s been really good to see the insights that can be generated by bringing different chapters of the book into conversation. For example, at the Berkman-Klein Centre, comparing and contrasting attitudes in North America vs. North Africa towards the idea that governments might require transport app providers like Uber to share their data with the state, revealed the different layers of concern, from differences in the market structure in each country, to different levels of trust in the state. Or as danah boyd put it in our discussions at Data and Society, “what do you do when the government is part of your threat model?”.  This presents interesting challenges for the development of transnational (open) data initiatives and standards – calling for a recognition that the approach that works in one country (or even one city), may not work so well in others. Research still does too little to take into account the particular political and market dynamics that surround successful open data and data analytic projects.

A comparisons across sectors, emerging from our ‘world cafe’ with State of Open Data authors at the OpenGovHub also shows the trade-offs to be made when designing transparency, open data and data sharing initiatives. For example, where the extractives transparency community has the benefit of hard law to mandate certain disclosures, such law is comparatively brittle, and does not always result in the kind of structured data needed to drive analysis. By contrast, open contracting, in relying on a more voluntary and peer-pressure model, may be able to refine it’s technical standards more iteratively, but perhaps at the cost of weaker mechanisms to enforce comprehensive disclosure. As Noel Hidalgo put it, there is a design challenge in making a standard that is a baseline, on top of which more can be shared, rather than one that becomes a ceiling, where governments focus on minimal compliance.

It is also important to recognise that when data has power, many different actors may seek to control, influence and ultimately mess with it. As data systems become more complex, the vectors for attack can increase. In discussions at Data & Society, we briefly touched on one cases where a government institution has had to take considerable steps to correct for external manipulation of it’s network of sensors. When data is used to trigger direct policy response (e.g. weather data triggering insurance payouts, or crime data triggering policing action), then the security and scrutiny of that data becomes even more important.

Open data as a strategic tool for data justice

I heard the question “Is open data dead?” a few times over this week. As the introductory presentation I gave for a few talks noted, we are certainly beyond peak open data hype. But, the jury is, it seems, still very much out on the role that discourses around open data should play in the decade ahead. At our Berkman-Klein Centre roundtable, Laura Bacon shared work by Omidyar/Luminate/Dalberg that offered a set of future scenarios for work on open data, including the continued existence of a distinct open data field, and an alternative future in which open data becomes subsumed within some other agenda such as ‘data rights’. However, as we got into discussions at Data & Society of data on police violence, questions of missing data, and debates about the balancing act to be struck in future between publishing administrative data and protecting privacy, the language of ‘data justice’ (rather than data rights) appeared to offer us the richest framework for thinking about the future.

Data justice is broader than open data, yet open data practices may often be a strategic tool in bringing it about. I’ve been left this week with a sense that we have not done enough to date to document and understand ways of drawing on open data production, consumption and standardisation as a form of strategic intervention. If we had a better language here, better documented patterns, and a stronger evidence base on what works, it might be easier to both choose when to prioritise open data interventions, and to identify when other kinds of interventions in a data ecosystem are more appropriate tools of social progress and justice.

Ultimately, a lot of discussions the book has sparked have been less about open data per-se, and much more about the shape of data infrastructures, and questions of data interoperability.  In discussions of Open Data and Artificial Intelligence at the OpenGovHub, we explored the failure of many efforts to develop interoperability within organisations and across organisational boundaries. I believe it was Jed Miller who put the challenge succinctly: to build interoperable systems, you need to “think like an organiser” – recognising data projects also as projects of organisational change and mass collaboration. Although I think we have mostly moved past the era in which civic technologists were walking around with an open data hammer, and seeing every problem as a nail, we have some way to go before we have a full understanding of the open data tools that need to be in everyones toolbox, and those that may still need a specialist.

Reconfiguring measurement to focus on openness of infrastructure

One way to support advocacy for openness, whilst avoiding reifying open data, and integrating learning from the last decade on the need to embed open data practices sector-by-sector, could be found in an updated approach to measurement. David Eaves made the point in our Berkman-Klein Centre roundtable that the number of widely adopted standards, as opposed to the number of data portals or datasets, is a much better indicator of progress.

As resource for monitoring, measuring or benchmarking open data per-se becomes more scarce, there is an opportunity to look at new measurement frames that look at the data infrastructure and ecosystem around a particular problem, and ask about the extent of openness, not only of data, but also of governance. A number of conversations this week have illustrated the value of shifting the discussion onto data infrastructure and interoperability: yet (a) the language of data infrastructure has not yet taken hold, and can be hard to pin down; and (b) there is a risk of openness being downplayed in favour of a focus on centralised data infrastructures. Updating open data measurement tools to look at infrastructures and systems rather than datasets may be one way to intervene in this unfolding space.

Exploring Arts Engagement with (Open) Data

[Summary: Over the next few months I’m working with Create Gloucestershire with a brief to catalyse a range of organisational data projects. Amongst these will be a hackathon of sorts, exploring how artists and analysts might collaborate to look at the cultural education sector locally. The body of this post shares some exploratory groundwork. This is a variation cross-posted from the Create Gloucestershire website.]

Update: For part 2 – about the event we held, see Creative Lab Report: Data | Culture | Learning.

Pre-amble…

Create Gloucestershire have been exploring data for a while now, looking to understand what the ever-increasing volume of online forms, data systems and spreadsheets arts organisations encounter every day might mean for the local cultural sector. For my part, I’ve long worked with data-rich projects, focussing on topics from workers co-operatives and youth participation, to international aid and corruption in government contracting, but the cultural sector is a space I’ve not widely explored.

Often, the process of exploring data can feel like a journey into the technical: where data stands in opposition to all things creative. So, as I join CG for the next three months as a ‘digital catalyst’, working on the use of data within the organisation, I wanted to start by stepping back, and exploring the different places at which data, art and creativity meet with an exploratory blog post..

…and a local note on getting involved…

In a few weeks (late February 2019) we’ll be exploring these issues through a short early-evening workshop in Stroud: with a view to hosting a day-long data-&-art hackathon in late Spring. If you would like to find out more, drop me a line.

Post: Art meets data | Data meets art

For some, data and art are diametrically opposed. Data is about facts. Art about feelings.

Take a look at writings from the data visualisation community [1], and you will see some suggest that data art is just bad visualisation. Data visualisation, the argument runs, uses graphical presentation to communicate information concisely and clearly. Data art, by contrast, places beauty before functionality. Aesthetics before information.

Found on Flickr: “I’m not even sure what this chart says … but I think its gorgeous!” (Image CC-BY Carla Gates / Original image source: ZSL)

I prefer to see data, visualisation and art all as components of communication. Communication as the process of sharing information, knowledge and wisdom.

The DIKW pyramid proposes a relationship between Data, Information, Knowledge and Wisdom, in which information involves the representation of data into ‘knowing that’, whilst knowledge requires experience to ‘know how’, and wisdom requires perspective and trained judgement in order to ‘know why’. (Image CC BY-SA. Wikimedia Commons)

Turning data into information requires a process of organisation and contextualisation. For example, a collection of isolated facts may be made more informative when arranged into a table. That table may be made more easily intelligible when summarised through counts and averages. And it may communicate more clearly when visualisation is included.

An Information -> Data -> Information journey. GCSE Results in Arts Subjects. (Screenshots & own analysis)

But when seeking to communicate a message from the data, there is another contextualisation that matters: contextualising to the recipient: to what they already know, or what you may want to them to come to know. Here, the right tools may not only be those of analysis and visualisation, but also those of art: communicating a message shaped by the data, though not entirely composed of it.

Artistic expression could focus on a finding, or a claim from the data, or may seek to support a particular audience to explore, interrogate and draw interpretations from a dataset. (Image CC BY-SA Toby Oxborrow)

In our upcoming workshop, we’ll be taking a number of datasets about the state of cultural education in Gloucestershire, and asking what they tell us. We’ll be thinking about the different ways to make sense of the data, and the ways to communicate messages from it. My hope is that we will find different ways to express the same data, looking at the same topic from a range of different angles, and bringing in other data sources of our own. In that way, we’ll be able to learn together both about practical skills for working with data, and to explore the subjects the data represents.

In preparing for this workshop I’ve been looking at ways different practitioners have connected data and art, through a range of media, over recent years.

The Open Data Institute: Data as Culture

Since it’s inception, The Open Data Institute in London has run a programme called ‘Data as culture’, commissioning artists to respond to the increasing datification of society.

Some works take a relatively direct approach to representation, selecting particular streams of data from the web and using different media to represent them. Text trends, for example, selected and counterposes different google search trends on a simple graph over time. And the ODIs infamous vending machine provides free crisps in response to news media mentions of recession.

Text Trends. From ODI Website and Data Soliloquies book.

In representative works, the artist has chosen the signal to focus on, and the context in which it is presented. However, the underlying data remains more or less legible, and depending on the contextual media and the literacies of the ‘reader’, certain factual information can also be extracted from the artwork. Whilst it might be more time-consuming to read, the effort demanded by both the act of creation, and the act of reading, may invite a deeper engagement with the phenomena described by the data. London EC2 explores this idea of changing the message through changing the media: by woodblock printing twitter messages, thus slowing down the pace of social media, encouraging the viewer to rethink otherwise ephemeral information.

In other works that are directly driven by datasets, data is used more to convey an impression rather than to convey specific information. In the knitted Punchcard Economy banners, a representation working hours is combined with a pre-defined message resulting in data that can be read as texture, more than it can be read as pattern. In choosing how far to ‘arrange’ the data, the work finds its place on a spectrum between visualisation or aesthetic organisation.

Punchcard Economy, Sam Meech, 2013. ODI: 3.5 x 0.5m knitted banner, FutureEverything: 5 x 3m knitted banner & knitting machines.

Other works in the data as culture collection start not from datasets, but from artists responses to wider trends of datification. Works such as metographyflipped clock and horizon respond to forms of data and it’s presentation in the modern world, raising questions about data and representation – but not necessarily about the specific data which happens to form part of the work.

Flipped Clock, Thomson & Craighead, 2008. ODI Data as Culture.

Other works still, look for the data within art, such as pixelquipu which takes it’s structure from pre-Columbian quipu (necklace-shaped, knotted threads from the Inca empire, that are thought to contain information relating to calendars and accounting in the empire). In these cases, turning information into data, and then representing it back in other way, is used to explore patterns that might not have otherwise been visible.

YoHa: Invisible Airs

Although it has also featured in the ODI’s Data as Culture collection, I want to draw out and look specifically at YoHa’s ‘Invisible Airs’ project. Not least because it was the first real work of ‘open data art’ I encountered, stumbling across it at an event in Bristol.

As newly released public spending records appear on screen, a pneumatically powered knife stabs a library book, sending a message about budget cuts, and inviting scrutiny of the data on screen.

It is a hard project to describe, but fortunately YoHa have a detailed project description and video on their website, showing the contraptions (participatory kinetic sculptures?) they created in 2014, driven by pneumatic tubes and actuated by information from Bristol City Council’s database of public spending.


In the video, Graham Harwood describes how their different creations (from a bike seat that rises up in response to spending transactions, to a pneumatic knife stabbing a book to highlight library service cuts) seek to ‘de-normalise’ data, not in the database designers sense of finding a suitable level of data abstraction, but in the sense of engaging the participant to understand otherwise dry data in new ways. The learning from the project is also instructive: in terms of exploring how far the works kept the attention of those engaging with them, or how far they were able to communicate only a conceptual point, before viewers attention fell away, and messages from the underlying data were lost.

Ultimately though, Invisible Airs (and other YoHa works engaging with the theme of data) are not so much communicating data, as communicating again about the role, and power, of data in our society. Their work seeks to bring databases, rather than the individual data items they contain, into view. As project commissioner Prof Jon Dovey puts it, “If you are interested in the way that power works, if you are interested in the way that local government works, if you are interested in the way that corporations work, if you are interested in the way that the state works, then data is at the heart of it…. The way your council tax gets calculated… the way your education budget gets calculated, all these things function through databases.”

Everyday data arts

Data as art need not involve costly commissions. For example, the media recently picked up on the story of a german commuter who had knitted a ‘train-delay scarf’, with choice of wool and colour representing length of delays. The act of creating was both a means to record, and to communicate, and in the process communicate much more effectively than the same data might have done if simply recorded in a spreadsheet, or even placed onto a chart with data visualisation.

‘Train Delay Scarf’ – a twitter sensation in January 2019.

Data sculpture and data-driven music

In a 2011 TED Talk, Nathalie Miebach has explored both how weather data can be turned into a work of art through sculpture and music, as well as questioning how the setting in which the resulting work is show affects how it is perceived.

She describes the creation of a vocabulary for turning the data into a creative work, but also the choice of a media that is not entirely controlled by the data, such that the resulting work is not entirely determined by the data, but also by its interaction with other environmental factors.

Dance your PhD, and dancing data

When reflecting on data and art, I was reminded of the annual Dance your PhD competition. Although the focus is more on expressing algorithms and research findings, than underlying datasets, it offers a useful way to reflect on ways to explain data, not only express what it contains.

In a similar vein, AlgoRythmics explain sorting algorithms using folk dance – a playful way of explaining what’s going on inside the machine when processing data.

There is an interesting distinction though between these two. Whilst Dance your PhD entries generally ‘annotate’ the dance with text to explain the phenomena that the dance engages with audience with, in AlgoRythmics, the dance itself is the entirety of the explanation.

Visualisation

The fields of InfoViz and DataViz have exploded over the last decade. Blog such as InformationIsBeautiful, Flowing Data and Visualising Data provide a regular dose of new maps, charts and novel presentation of data. However, InfoViz and DataViz are not simply synonyms: they represent work that starts from different points of a Data/Information/Knowledge model, and with often different goals in mind.

Take, for example, David McCandless’ work in the ‘Information in Beautiful’ book (also presented in this TED Talk). The images, although often based on data, are not a direct visualisation of the data, but an editorialised story. The data has already been analysed to identify a message before it is presented through charts, maps and diagrams.

 

By contrast, in Edward Tufte’s work on data visualisation, or even statistical graphics, the role of visualisation is to present data in order to support the analytical process and the discovery of information. Tufte talks of ‘the thinking eye’, highlighting the way in which patterns that may be invisible when data is presented numerically, can become visible and intelligible when the right visual representation is chosen. However, for Tufte, the idea of the correct approach to visualisation is important: presenting data effectively is both an art and a technical skill, informed by insights and research from art and design, but fundamentally something that can be done right, or done wrong.

Graphical Practices: Page 14 of Edward Tufte ‘The Visual Display of Quantitative Information

Other data visualisation falls somewhere between the extremes I’ve painted here. Exploratory data visualisations can seek to both support analysis, but also to tell a particular story through their selection of visualisation approach. A look at the winners of the recent 360 Giving Data Visualisation Challenge illustrates this well. Each of these visualisation draws on the same open dataset about grant making, but where ‘A drop in the bucket’ uses a playful animation to highlight the size of grants from different funders, Funding Themes extracts topics from the data and presents an interactive visualisation, inviting users to ‘drill down’ into the data and explore it in more depth. Others, like trend engine use more of a dashboard approach to present data, allowing the user to skim through and find, if not complete answers, at least refined questions that they may want to ask of the raw dataset.

Funding Trends for a ‘cluster’ of arts-related grants, drawing on 360 Giving data. Creator: Xavi Gimenez

Arts meet data | Data meet arts | Brokering introductions

Writing this post has given me a starting point to explore some data-art-dichotomies and to survey and link to a range of shared examples that might be useful for conversations in the coming weeks.

It’s also sparked some ideas for workshop methods we might be able to use to keep analytical, interpretative and communicative modes in mind when planning for a hackathon later this year. But that will have to wait for a future post…

 

Footnotes

[1]: I am overstating the argument in the blog post on art and data visualisation slightly for effect. The post, and comments in fact offer a nuanced dialogue worth exploring on the relationship of data visualisation and art, although still seeking to draw a clear disjunct relationship.

On the journey: five reflections from #ocglobal17 (Open Contracting Global)

At it’s heart, open contracting is a simple idea: whenever public money and resources are at stake through a contracting processes, transparency and participation should be the norm.

Yet, as the Open Contracting Global Summit (#ocglobal17) in Amsterdam this week has demonstrated, it’s also an idea that brings together a very wide community. Reflecting on conversations from the week, I’ve tried here to capture five key reflections on where we are at, and where we might be heading:

(1) It’s not just procurement

Although the open contracting emphasis is often on the way governments buy goods and services, there are many other contracts where public resources are at stake: from licenses and concessions, to Public Private Partnership deals and grant agreements.

These each have different dynamics, and different approaches might be needed to open up each kind of process.

The Open Contracting Data Standard (OCDS) is primarily designed around procurement processes, although at OCGlobal we gave the first public preview of the OCDS for PPPs profile, that extends the OCDS data model to provide a structured way of recording in-depth disclosures for Public Private Partnership deals.

(2) It’s not just JSON

Thanks to Article 19, the corridoors at OCGlobal had been turned into a ‘gallery of redaction’. Copies of contracting documents obtained through FOI requests provided tantalising hints of government and private sector deals: yet with all the key facts blacked out. These stood as a reminder of how many times the public are kept in the dark over contracts

Neither documents, nor data, on their own will answer all the question citizens or companies might have about contracting. Not will they automatically spark the kinds of participation, scrutiny and engagement that are the essential complement of transparency.

Although publication of standardised data might be the most concrete manifestation of open contracting, it’s problematic to conflate transparency or open contracting with use of the OCDS JSON schema. Indeed, the 5-star model published as part of the guidance for OCDS 1.0 highlights that governments can taken their first steps towards open contracting data by publishing any contracting information on the web, stepping up to machine-readability and standarised data as capacity allows.

Any other approach risks making the perfect into the enemy of the good: preventing publication until data is perfect.

The challenge ahead is in designing and refining the incentive structures that make sure open contracting efforts do not stop at getting a few documents online, or some fields in a JSON dataset – but instead that over time they broadens and deepen both disclosure, and effective use of the information that has been made available.

(3) It’s an iterative journey

There’s a much refreshed implementation section on the Open Contracting website, curating a range of guidance and tools to help put open contracting ideas into practice. The framing of a linear ‘seven steps’ journey towards open contracting is replaced with a ‘hopscotch’ presentation of the steps involved: with interlocking cycles of development and use.

This feels much closer to the reality I’ve experienced supporting open contracting implementations, which involve a dance back and forward between a vision for disclosure, and the reality of getting data and documents published from legacy systems, transparency features added to systems that are in development, or policies and practice changed to support greater citizen engagement in the contracting process.

There was a lot of talk at OC Global about e-procurement systems as the ideal source of open contracting data: yet for many countries, effective e-procurement deployments are a long way off, and so it’s important to keep in mind different ways tools like OCDS can be used:

  • Based-on – OCDS can provide a guide for reviewing and reflecting on current levels of disclosure, and for turning unstructured information into data to analyse. This is the approach pioneered by projects like Budeshi, who started out transcribing documents to data to demonstrate the value that a more data-driven approach could have to procurement monitoring.

  • Bolt-on – OCDS can be used as the target format when exporting data from existing contracting data systems. These might be reporting systems that capture regular monitoring returns on the contracting process, or transactional systems through which procurement is run. Here, the process of mapping existing data to OCDS can often reveal data quality issues in the source systems – and with the right feedback loops, this can lead to not only data publication, but also changes to improve data in future.

  • Built-in – OCDS can be used to inform the design of new systems – providing common shared data models, and a community where extended modelling of data can be discussed. However, it’s important to remember that building new systems is not just about data structures – it’s also about user experience, and right now, the OCDS doesn’t address this.

To my mind, OCDS provide a structured framework that should support use in all these different ways. As we iterate on the standard itself, it’s important we don’t undermine this flexibility – but that instead we use it to establish common ground on which publishers and users can debate issues of data quality. With the standard, those debates should be actionable: but it’s not up the standard itself to settle them.

(4) Contracting is core: but it doesn’t start or end there

Contracting is just one of the government processes that affects how resources are allocated and used. Before contracting starts, budgets are often set, or wide-reaching procurement plans established. During contract implementation, payment processes kick-in. And for the private companies involved in public contracts, there are all sorts of interlocking processes of registration, financing and taxation.

From an architectural perspective it’s important for us to understand the boundaries of the open contracting process, and how it can link up with other processes. For example, whilst OCDS can capture budget information as part of a contracting process (e.g. the amount of budget allocated to that process), it starts stretching the data model to represent a budget process nested within a contracting process.

As one of the break-out groups looking at budget, contract and spend integration noted, the key to joining up data is not putting everything in the same dataset or system, but comes from establishing business processes that ensure common identifiers are used to join up the systems that manage parallel processes.

There’s a lot of work to do before we have easy interoperability between different parts of an overall [accountability architecture](ACSP LINK LINK LINK) – but the biggest issues are of data production and use, rather than of standards and schemas.

(5) It’s time to tidy our terminology

The open contracting community is broad, and, as I recently wrote over here, “the word ‘standard’ means different things to different people.”. So does contracting. And tender. And validation. And assessment. And so-on.

Following recent workshops in London and Argentina, the OCDS team have been thinking a lot about how we tighten up our use of key terms and concepts, establishing a set of draft translation principles and policies, and we’ve been reflecting more on how to also be clearer on ideas like data validity, quality and feedback.

But we also have to recognise that debates over language are laden with power dynamics: and specialist language can be used to impose or exclude. Open contracting should not be about dumbing down complex processes of contracting, but nor should it be able requiring every citizen to learn procurement-speak. Again, for OCDS and other tools designed to support open contracting, we have balancing act: creating boundary objects that help different communities meet in the middle.

The first step towards this is just working out how we’re using words at the moment: checking on current practice, before working out how we can improve.

Gratitude

Asides from sparking a wealth of learning, the other thing an event like #OCGlobal17 does is remind me just how fortunate I am to get to work with such a inspiring network of people: exploring challenging issues with a great collaborative spirit. Thanks all!

The reflections above are more or less fragmentary, and I’m looking forward to working with many of the folk in the picture below to see where the journey takes us next.

Open data for tax justice: the real design challenge is social

[Summary: Thinking aloud about a pragmatic / humanist approach to data infrastructure building]

Stephen Abbott Pugh of Open Knowledge International has just blogged about the Open Data for Tax Justice ‘design sprint’ that took place in London on Monday and Tuesday. I took part in the first day and a half of the workshop, and found myself fairly at-odds with the approach being taken that focussed narrowly on the data-pipelines based creation of a centralised dataset, and that appeared to create barriers rather than bridges between data and domain experts. Rather than the rethink the approach, as I would argue is needed, the Open Knowledge write up appears to show the Open Data for Tax Justice project heading further down this flawed path.

In this post, I’m offering an (I hope) constructive critique of the approach, trying to draw out some more general principles that might inform projects to create more participatory data infrastructures.

The context

As the OKI post relates:

“Country-by-country reporting (CBCR) is a transparency mechanism which requires multinational corporations to publish information about their economic activities in all of the countries where they operate. This includes information on the taxes they pay, the number of people they employ and the profits they report.”

Country by Country reporting has been a major ask of tax justice campaigners since the early 2000s, in order to address tax avoidance by multi-national companies who shift their profits around the world through complex corporate structures and internal transfers. CBCR got a major boost in 2013 with the launch of reporting requirements for EU Banks to publicly disclose Country by Country reports under the CRD IV regulations. In the extractives sector, campaigners have also secured regulations requiring disclosure of tax and licensing payments to government on a project-by-project basis.

Although in the case of UK extractives firms, reporting is taking place to companies house as structured data, with an API available to access reports, for EU Banks, reporting is predominantly in the form of tables at the back of PDF format company reports.

If campaigners are successful, public reporting will be extended to all EU multinationals, holding out the prospect of up to 6000 more annual reports that can provide a breakdown of turnover, profit, tax and employees country-by-country. If the templates for disclosure are based on existing OECD models for private exchange between tax authorities, the data may also include information on the different legal entities that make a corporate group, important for public understanding the structure of the corporate world.

Earlier this year, a report from Alex Cobham, Jonathan Gray and Richard Murphey set out a number of use-cases for such data, making the case that “a global public database on the tax contributions and economic activities of multinational companies” would be an asset for a wide range of users, from journalists, civil society and investors.

Sprinting with a data-pipelines hammer

This week’s design sprint focussed particularly on ‘data extraction’, developing a set of data pipeline scripts and processes that involve downloading a report PDF, marking up the tables where Country by Country data is stored, describing what each column contains using YAML, and then committing this to GitHub where the process can then be replicably run using datapipeline commands. Then, with the data extracted, it can be loaded into an SQL database, and explored by writing queries or building simple charts. It’s a technically advanced approach, and great for ensuring replicability of data extraction.

But, its also an approach that ultimately entirely misses the point, ignoring the social process of data production, creating technical barriers instead of empowering contributors and users, and offering nothing for campaigners who want to ensure that better data is produced ‘at source’ by companies.

Whilst the OKI blog post reports that “The Open Data for Tax Justice network team are now exploring opportunities for collaborations to collect and process all available CRD IV data via the pipeline and tools developed during our sprint.” I want to argue for a refocussed approach, based around a much closer look at the social dynamics of data creation and use.

An alternative approach: crafting collaborations

I’ve tried below to unpack a number of principles that might guide that alternative approach:

Principle 1: Letting people use their own tools

Any approach that involves downloading, installing, signing-up to, configuring or learning new software in order to create or use data is likely to exclude a large community of potential users. If the data you are dealing with is tabular: focus on spreadsheets.

More technical users can transform data into database formats when the questions they want to answer require the additional power that brings, but it is better if the starting workflow is configured to be accessible to the largest number of likely users.

Back in October I put together a rough prototype of a Google spreadsheets based transcription tool for Country by Country reports, that needed just copy-and-paste of data, and a few selections from validated drop-down lists to go from PDFs to normalised data – allowing a large user community to engage directly with the data, with almost zero learning curve.

The only tool this approach needs to introduce is something like tabula or PDFTables to convert from PDF to Excel or CSV: but in this workflow the data comes right back to the user to be able to work with it after it has been converted, rather than being taken away from them into a longer processing pipeline. Plus, it brings the benefit of raising awareness of data extraction from PDF that the user can adopt for other projects in future, and allowing the user to work-around failed conversions using a manual transcription approach if they need to.

(Sidenote: from discussions, I understand that one of the reasons the OKI team made their technical choice was from envisaging the primary users as ‘non-experts’ who would engage in crowdsourcing transcriptions of PDF reports. I think this is both highly optimistic, and relies on a flawed analysis of the relatively small scale of the crowdsourcing task in terms of a few 1000 reports a year, and the potential benefits of involving a more engaged group of contributors in creating a civil society database)

Principle 2: Aim for instant empowerment

One of the striking things about Country by Country reporting data is how simple it ultimately is. The CRD IV disclosures contain just a handful of measures (turnover, pre-tax profits, tax paid, number of employees), a few dimensions (company name, country, year), and a range of annotations in footnotes or explanations. The analysis that can be done with this is data is similarly simple – yet also very powerful. Being able to go from a PDF table of data, to a quick view of the ratios between turnover and tax, or profit and employees for a country can quickly highlight areas to investigate for profit-shifting and tax-avoidance behaviour.

Calculating these ratios is possible almost as soon as you have data in a spreadsheet form. In fact, a well set up template could calculate them directly, or the user with basic ability to write formula could fill in the columns they need.

Many of the use-cases for Country by Country reports are based not on aggregation across hundreds of firms, but on simply understanding the behaviour of one or two firms. Investigators and researchers often have firms they are particularly interested in, and where the combination of simple data, and their contextual knowledge, can go a long way.

Principle 3: Don’t drop context

On the topic of context: all those footnotes and explanations in company reports are an important part of the data. They might not be computable, or easy to query against, but in the data explorations that took place on Monday and Tuesday I was struck by how much the tax justice experts were relying not only on the numerical figures to find stories, but also on the explanations and other annotations from reports.

The data pipelines approach dropped these annotations (and indeed dropped anything that didn’t fit into it’s schema). An alternative approach would work from the principle that, as far as possible, nothing of the source should be thrown away – and structure should be layered on top of the messy reality of accounting judgements and decisions.

Principle 4: Data making is meaning-making

A lot of the analysis of Country by Country reporting data is about look for outliers. But data outliers and data errors can look pretty similar. Instead of trying to separate the process of data preparation and analysis, these two need to be brought closer together.

Creating a shared database of tax disclosures will involve not only processes of data extraction, but also processes of validation and quality control. It will require incentives for contributors, and will require attention to building a community of users.

Some of the current structured data available from Country by Country reports has been transcribed by University students as part of their classes – where data was created as a starting point for a close feedback loop of data analysis. The idea of ‘frictionless data’ makes sense when it comes to getting a list of currency codes, but when it comes to understanding accounts, some ‘friction’ of social process can go a long way to getting reliable data, and building a community of practice who understand the data in more depth.

Principle 5: Standards support distributed collaboration

One of the difficulties in using the data mentioned above, prepared by a group of students, was that it had been transcribed and structured to solve the particular analytical problem of the class, and not against any shared standard for identifying countries, companies or the measures being transcribed.

The absence of agreement on key issues such as codelists for tax jurisdictions, company identifiers, codes and definitions of measures, and how to handle annotations and missing data means that the data that is generated by different researchers, or even different regulatory regimes, is not comparable, and can’t be easily combined.

The data pipelines approach is based on rendering data comparable through a centralised infrastructure. In my experience, such approaches are brittle, particularly in the context of voluntary collaboration, and they tend to create bottlenecks for data sharing and innovation. By contrast, an approach based on building light-weight standards can support a much more distributed collaboration approach – in which different groups can focus first on the data that is of most interest to them (for example, national journalists focussing on the tax record of the top-10 companies in their jurisdiction), easily contributing data to a common pool later when their incentives are aligned.

Campaigners also need to be armed with use-case backed proposals for how disclosures should be structured in order to push for the best quality disclosure regimes

What’s the difference?

Depending on your viewpoint, the approach I’ve started to set out above might look more technically ‘messy’ – but I would argue it is more in-tune with the social realities of building a collaborative dataset of company tax disclosures.

Fundamentally (with the exception perhaps of standard maintenance, although that should be managed as a multi-stakeholder project long-term) – it is much more decentralised. This is in line with the approach in the Open Contracting Data Standard, where the Open Contracting Partnership have stuck well to their field-building aspirations, and where many of the most interesting data projects emerge organically at the edge of the network, only later feeding into cross-collaboration.

Even then, this sketch of an alternative technical approach above is only part of the story in building a better data-foundation for action to address corporate tax avoidance. There will still be a lot of labour to create incentives, encourage co-operation, manage data quality, and build capacity to work with data. But better we engage with that labour, than spending our efforts chasing after frictionless dreams of easily created perfect datasets.

Following the money: preliminary remarks on IATI Traceability

[Summary: Exploring the social and technical dynamics of aid traceability: let’s learn what we can from distributed ledgers, without thinking that all the solutions are to be found in the blockchain.]

My colleagues at Open Data Services are working at the moment on a project for UN Habitat around traceability of aid flows. With an increasing number of organisations publishing data using the International Aid Transparency Initiative data standard, and increasing amounts of government contracting and spending data available online, the theory is that it should be possible to track funding flows.

In this blog post I’ll try and think aloud about some of the opportunities and challenges for traceability.

Why follow funds?

I can envisage a number of hypothetical use cases traceability of aid.

Firstly, donors want to be able to understand where their money has gone. This is important for at least three reasons:

  1. Effectiveness & impact: knowing which projects and programmes have been the most effective;
  2. Understanding and communication: being able to see more information about the projects funded, and to present information on projects and their impacts to the public to build support for development;
  3. Addressing fraud and corruption: identifying leakage and mis-use of funds.

Traceability is important because the relationship between donor and delivery is often indirect. A grant may pass through a number of intermediary organisations before it reaches the ultimately beneficiaries. For example, a country donor may fund a multi-lateral fund, which in turn commissions an international organisation to deliver a programme, and they in turn contract with country partners, who in turn buy in provision from local providers.

Secondly, communities where projects are funded, or where funds should have been receieved, may want to trace funding upwards: understanding the actors and policy agendas affecting their communities, and identifying when funds they are entitled to have not arrived (see the investigative work of Follow The Money Nigeria for a good example of this latter use case).

Short-circuiting social systems

It is important to consider the ways in which work on the traceability of funds potentially bypasses, ‘routes around’ or disrupts* (*choose your own framing) existing funding and reporting relationships – allowing donors or communities to reach beyond intermediaries to exert such authority and power over outcomes as they can exercise.

Take the example given above. We can represent the funding flows in a diagram as below:

downwards

But there are more than one-way-flows going on here. Most of the parties involved will have some sort of reporting responsibility to those giving them funds, and so we also have a report

upwards

By the time reporting gets to the donor, it is unlikely to include much detail on the work of the local partners or providers (indeed, the multilateral, for example, may not report specifically on this project, just on the development co-operation in general). The INGO may even have very limited information about what happens just a few steps down the chain on the ground, having to trust intermediary reports.

In cases where there isn’t complete trust in this network of reporting, and clear mechanisms to ensure each party is excercising it’s responsibility to ensure the most effective, and corruption-free, use of resources by the next party down, the case for being able to see through this chain, tracing funds and having direct ability to assess impacts and risks is clearly desirable.

Yet – it also needs to be approached carefully. Each of the relationships in this funding chain is about more than just passing on some clearly defined packet of money. Each party may bring specific contextual knowledge, skills and experience. Enabling those at the top of a funding chain to leap over intermediaries doesn’t inevitably having a positive impact: particularly given what the history of development co-operative has to teach about how power dynamics and the imposition of top-down solutions can lead to substantial harms.

None of this is a case against traceability – but it is a call for consideration of the social dynamics of traceability infrastructures – and considering of how to ensure contextual knowledge is kept accessible when it becomes possible to traverse the links of a funding chain.

The co-ordination challenge of traceability

Right now, the IATI data standard has support for traceability at the project and transaction level.

  • At the project level the related-activity field can be used to indicate parent, child and co-funded activities.
  • At the transaction level, data on incoming funds can specify the activity-id used by the upstream organisation to identify the project the funds come from, and data on outgoing funds can specify the activity-id used by the downstream organisation.

This supports both upwards and downwards linking (e.g. a funder can publish the identified of the funded project, or a receipient can publish the identifier of the donor project that is providing funds), but is based on explicit co-ordination and the capture of additional data.

As a distributed approach to the publication of open data, there are no consistency checks in IATI to ensure that providers and recipients agree on identifiers, and often there can be practical challenges to capture this data, not least that:

  • A) Many of the accounting systems in which transaction data is captured have no fields for upstream or downstream project identifier, nor any way of conceptually linking transactions to these externally defined projects;
  • B) Some parties in the funding chain may not publish IATI data, or may do so in forms that do not support traceability, breaking the chain;
  • C) The identifier of a downstream project may not be created at the time an upstream project assigns funds – exchanging identifiers can create a substantial administrative burden;

At the last IATI TAG meeting in Ottawa, this led to some discussion of other technologies that might be explored to address issues of traceability.

Technical utopias and practical traceability

Let’s start with a number of assorted observations:

  • UPS can track a package right around the world, giving me regular updates on where it is. The package has a barcode on, and is being transferred by a single company.
  • I can make a faster-payments bank transfer in the UK with a reference number that appears in both my bank statements, and the receipients statements, travelling between banks in seconds. Banks leverage their trust, and use centralised third-party providers as part of data exchange and reconciling funding transfers.
  • When making some international transfers, the money has effectively disappeared from view for quite a while, with lots of time spent on the phone to sender, recipient and intermediary banks to track down the funds. Trust, digital systems and reconciliation services function less well across international borders.
  • Transactions on the BitCoin Blockchain are, to some extent, traceable. BitCoin is a distributed system. (Given any BitCoin ‘address’ it’s possible to go back into the public ledger and see which addresses have transferred an amount of bitcoins there, and to follow the chain onwards. If you can match an address to an identity, the currency, far from being anonymous, is fairly transparent*. This is the reason for BitCoin mixer services, designed to remove the trackability of coins.)
  • There are reported experiments with using BlockChain technologies in a range of different settings, incuding for land registries.
  • There’s a lot of investment going into FinTech right now – exploring ways to update financial services

All of this can lead to some excitement about the potential of new technologies to render funding flows traceable. If we can trace parcels and BitCoins, the argument goes, why can’t we have traceability of public funds and development assistance?

Although I think such an argument falls down in a number of key areas (which I’ll get to in a moment), it does point towards a key component missing from the current aid transparency landscape – in the form of a shared ledger.

One of the reasons IATI is based on a distributed data publishing model, without any internal consistency checks between publishers, is prior experience in the sector of submitting data to centralised aid databases. However, peer-to-peer and block-chain like technologies now offer a way to separate out co-ordination and the creation of consensus on the state of the world, from the centralisation of data in a single database.

It is at least theoretically possible to imagine a world in which the data a government publishes about it’s transactions is only considered part of the story, and in which the recipient needs to confirm receipt in a public ledger to complete the transactional record. Transactions ultimately have two parts (sending and receipt), and open (distributed) ledger systems could offer the ability to layer an auditable record on top of the actual transfer of funds.

However (as I said, there are some serious limitations here), such a system is only an account giving of the funding flows, not the flows themself (unlike BitCoin) which still leaves space for corruption through maintaining false information in the ledger. Although trusted financial intermediaries (banks and others) could be brought into the picture, as others responsible for confirming transactions, it’s hard to envisage how adoption of such a system could be brought about over the short and medium term (particularly globally). Secondly, although transactions between organisations might be made more visible and traceable in this way, the transactions inside an organisation remain opaque. Working out which funds relate to which internal and external projects is still a matter of the internal businesses processes in organisations involved in the aid delivery chain.

There may be other traceability systems we should be exploring as inspirations for aid and public money traceable. What my brief look at BitCoin leads me to reflect on is potential role over the short-term of reconciliation services that can, at the very least, report on the extent to which different IATI publisers are mutually confirming each others information. Over the long-term, a move towards more real-time transparency infrastructures, rather than periodic data publication, might open up new opportunities – although with all sorts of associated challenges.

Ultimately – creating traceable aid still requires labour to generate shared conceptual understandings of how particular transactions and projects relate.

How much is enough?

Let’s loop back round. In this post (as in many of the conversations I’ve had about traceable), we started with some use cases for traceability; we saw some of the challenges; we got briefly excited about what new technologies could do to provide traceability; we saw the opportunities, but also the many limitations. Where do we end up then?

I think important is to loop back to our use cases, and to consider how technology can help but not completely solve, the problems set out. Knowing which provider organisations might have been funded through a particular donors money could be enough to help them target investigations in cases of fraud. Or knowing all the funders who have a stake in projects in a particular country, sector and locality can be enough for communities on the ground to do further research to identify the funders they need to talk to.

Rather than searching after a traceability data panopticon, can we focus traceability-enabling practices on breaking down the barriers to specific investigatory processes?

Ultimately, in the IATI case, getting traceability to work at the project level alone could be a big boost. But doing this will require a lot of social coordination, as much as technical innovation. As we think about tools for traceability, thinking about tools that support this social process may be an important area to focus on.

Where next

Steven Flower and the rest of the Open Data Services team will be working on coming weeks on a deeper investigation of traceability issues – with the goal of producing a report and toolkit later this year. They’ve already been digging into IATI data to look for the links that exist so far and building on past work testing the concept of traceability against real data.

Drop in comments below, or drop Steven a line, if you have ideas to share.

Data, openness, community ownership and the commons

[Summary: reflections on responses to the GODAN discussion paper on agricultural open data, ownership and the commons – posted ahead of Africa Open Data Conference GODAN sessions]

Photo Credit - CC-BY - South Africa Tourism
]3 Photo Credit – CC-BY – South Africa Tourism

Key points

  • We need to distinguish between claims to data ownership, and claims to be a stakeholder in a dataset;
  • Ownership is a relevant concept for a limited range of datasets;
  • Openness can be a positive strategy, empowering farmers vis-a-vis large corporate interests;
  • Openness is not universally good: can also be used as a ‘data grab’ strategy;
  • We need to think critically about the configurations of openness we are promoting;
  • Commons and cooperative based strategies for managing data and open data are a key area for further exploration;

Open or owned data?

Following the publication of a discussion paper by the ODI for the Global Open Data for Agriculture and Nutrition initiative, putting forward a case for how open data can help improve agriculture, food and nutrition, debate has been growing about how open data should be approached in the context of smallholder agriculture. In this post, I explore some provisional reflections on that debate.

Respondents to the paper have pointed to the way in which, in situations of unequal power, and in complex global markets, greater accessibility of data can have substantial downsides for farmers. For example, commodity speculation based on open weather data can drive up food prices, or open data on soil profiles can be used in order to extract greater margins from farmers when selling fertilizers. A number of responses to the ODI paper have noted that much of the information that feeds into emerging models of data-driven agriculture is coming from small-scale farmers themselves: whether through statistical collection by governments, or hoovered up by providers of farming technology, all aggregated into big datasets that practically inaccessible to local communities and farmers.

This has led to some focussing in response on the concept of data ownership: asserting that more emphasis should be placed on community ownership of the data generated at a local level. Equally, it has led to the argument that “opening data without enabling effective, equitable use can be considered a form of piracy”, making direct allusions to the biopiracy debate and the consequent responses to such concerns in the form of interventions such as the International Treaty on Plant Genetic Resources.

There are valid concerns here. Efforts to open up data must be interrogated to understand which actors stand to benefit, and to identify whether the configuration of openness sought is one that will promote the outcomes claimed. However, claims of data ownership and data sovereignty need to be taken as a starting point for designing better configurations of openness, rather than as a blocking counter-claim to ideas of open data.

Community ownership and openness

My thinking on this topic is shaped, albeit not to a set conclusion, by a debate that took place last year at a Berkman Centre Fellows Hour based on a presentation by Pushpa Kumar Lakshmanan on the Nagoya Protocol which sets out a framework for community ownership and control over genetic resources.

The debate raised the tension between the rights of communities to gain benefits from the resources and knowledge that they have stewarded, potentially over centuries, with an open knowledge approach that argues social progress is better served when knowledge is freely shared.

It also raised important questions of how communities can be demarcated (a long-standing and challenging issue in the philosophy of community rights) – and whether drawing a boundary to protect a community from external exploitation risks leaving internal patterns of power and exploitation within the community unexplored. For example, does community ownership of data really lead to certain elites in the community controlling it.

Ultimately, the debate taps into a conflict between those who see the greatest risk as being the exploitation of local communities by powerful economic actors, and those who see the greater risk as a conservative hoarding of knowledge in local communities in ways that inhibit important collective progress.

Exploring ownership claims

It is useful to note that much of the work on the Nagoya Protocol that Pushpa described was centred on controlling borders to regulate the physical transfer of plant genetic material. Thinking about rights over intangible data raises a whole new set of issues: ownership cannot just be filtered through a lens of possession and physical control.

Much data is relational. That is to say that it represents a relationship between two parties, or represents objects that may stand in ownership relationships with different parties. For example, in his response to the GODAN paper, Ajit Maru reports how “John Deere now considers its tractors and other equipment as legally ‘software’ and not a machine… [and] claims [this] gives them the right to use data generated as ‘feedback’ from their machinery”. Yet, this data about a tractor’s operation is also data about the farmers land, crops and work. The same kinds of ‘trade data for service’ concerns that have long been discussed with reference to social media websites are becoming an increasing part of the agriculture world. The concern here is with a kind of corporate data-grab, in which firms extract data, asserting their absolute ownership over something which is primarily generated by the farmer, and which is at best a co-production of farmer and firm.

It is in response to this kind of situation that grassroots data ownership claims are made.

These ownership claims can vary in strength. For example:

  • The farmer can claim that ‘this is my data’, and I should have ultimate control over how it is used, and the ability to treat it as a personally held asset;

  • The second runs that ‘I have a stake in this data’, and as a consequence, I should have access to it, and a say in how it is used;

Which claim is relevant depends very much on the nature of the data. For example, we might allow ownership claims over data about the self (personal data), and the direct property of an individual. For datasets that are more clearly relational, or collectively owned (for example, local statistics collected by agricultural extension workers, or weather data funded by taxation), the stakeholding claim is the more relevant.

It is important at this point to note that not all (perhaps even not many) concerns about the potential misuse of data can be dealt with effectively through a property right regime. Uses of data to abuse privacy, or to speculate and manipulate markets may be much better dealt with by regulations and prohibitions on those activities, rather than attempts to restrict the flow of data through assertions of data ownership.

Openness as a strategy

Once we know whether we are dealing with ownership claims, or stakeholding claims, in data, we can start thinking about different strategic configurations of openness, that take into account power relationships, and that seek to balance protection against exploitation, with the benefits that can come from collaboration and sharing.

For example, each farmer on their own has limited power vis-a-vis a high-tech tractor maker like John Deere. Even if they can assert a right to access their own data, John Deere will most likely retain the power to aggregate data from 1000s of farmers, maintaining an inequality of access to data vis-a-vis the farmer. If the farmer seeks to deny John Deere the right to aggregate their data with that of others: changes that (a) they will be unsuccessful, as making an absolute ownership claim here is difficult – using the tractor was a choice after all; and (b) they will potentially inhibit useful research and use of data that could improve cropping (even if some of the other uses of the data may run counter to the farmers interest). Some have suggested that creating a market in the data, where the data aggregator would pay the farmers for the ability to use their data, offers an alternative path here: but it is not clear that the price would compensate the farmer adequately, or lead to an efficient re-use of data.

However, in this setting openness potentially offers an alternative strategy. If farmers argue that they will only give data to John Deere if John Deere makes the aggregated data open, then they have the chance to challenge the asymmetry of power that otherwise develops. A range of actors and intermediaries can then use this data to provide services in the interests of the farmers. Both the technology provider, and the farmer, get access to the data in which they are both stakeholders.

This strategy (“I’ll give you data only if you make the aggregate set of data you gather open”), may require collective action from farmers. This may be the kind of arrangement GODAN can play a role in brokering, particularly as it may also turn out to be in the interest of the firm as well. Information economics has demonstrated how firms often under-share information which, if open, could lead to an expansion of the overall market and better equilibria in which, rather than a zero-sum game, there are benefits to be shared amongst market actors.

There will, however, be cases in which the power imbalances between data providers and those who could exploit the data are too large. For example, the above discussion assumes intermediaries will emerge who can help make effective use of aggregated data in the interests of farmers. Sometimes (a) the greatest use will need to be based on analysis of disaggregated data, which cannot be released openly; and (b) data providers need to find ways to work together to make use of data. In these cases, there may be a lot to learn from the history of commons and co-operative structures in the agricultural realm.

Co-operative and commons based strategies

Many discussions of openness conflate the concept of openness, and the concept of the commons. Yet there is an important distinction. Put crudely:

  • Open = anyone is free to use/re-use a resource;
  • Commons = mutual rights and responsibilities towards the resource;

In the context of digital works, Creative Commons provide a suite of licenses for content, some of which are ‘open’ (they place no responsibilities on users of a resource, but grant broad rights), and others of which adopt a more regulated commons approach, placing certain obligations on re-users of a document, photo or dataset, such as the responsibility to attribute the source, and share any derivative work under the same terms.

The Creative Commons draws upon an imagery from the physical commons. These commons were often in the form of land over which farmers held certain rights to graze cattle, of fisheries in which each fisher took shared responsibility for avoiding overfishing. Such commons are, in practice, highly regulated spaces – but that seek to pursue an approach based on sharing and stakeholding in resources, rather than absolute ownership claims. As we think about data resources in agriculture, reflecting more on learning from the commons is likely to prove fruitful. Of course, data, unlike land, is not finite in the same ways, nor does it have the same properties of excludability and rivalrousness.

In thinking about how to manage data commons, we might look towards another feature prevalent in agricultural production: that of the cooperative. The core idea of a data cooperative is that data can be held in trust by a body collectively owned by those who contribute the data. Such data cooperatives could help manage the boundary between data that is made open at some suitable level of aggregation, and data that is analysed and used to generate products of use to those contributing the data.

With Open Data Services Co-operative I’ve just started to dig more into learning about the cooperative movement: co-founding a workers cooperative that supports open data projects. However, we’ve also been thinking about how data cooperatives might work – and I’m certain there is scope for a lot more work in this area, helping deal with some of the critical questions that have come up for open data from the GODAN discussion paper.

Getting the incentives right: an IATI enquiry service?

[Summary: Brief notes exploring a strategic and service-based approach to improve IATI data quality]

Filed under: rough ideas

At the International Aid Transparency Initiative (IATI) Technical Advisory Group meeting (#tag2015) in Ottawa last week I took part in two sessions exploring the need for Application Programming Interfaces (APIs) onto IATI data. It quickly became clear that there were two challenges to address:

(1) Many of the questions people around the table were asking were complex queries, not the simple data retrieval kinds of questions that an API is well suited to;

(2) ‘Out of the box’ IATI data is often not able to answer the kinds of questions being asked, either because

  • (a) the quality and consistency of data from distributed sources means that there are a range of special cases to handle when performing cross-donor analysis;
  • (b) the questions asked invite additional data preparation, such as currency conversion, or identifying a block of codes that relate to a particular sector (.e.g. identifying all the Water and Sanitation related codes)

These challenges also underlie the wider issue explored at TAG2015: that even though five years of effort have gone into data supply, few people are actually using IATI data day-today.

If the goal of the International Aid Transparency Initiative as a whole, distinct from the specific goal of securing data, is more informed decision making in the sector, then this got me thinking about the extent to which what we need right now is a primary focus on services rather than data and tools. And from that, thinking about whether intelligent funding of such services could lead to the right kinds of pressures for improving data quality.

Improving data through enquiries

Using any dataset to answer complex questions takes both domain knowledge, and knowledge of the data. Development agencies might have lots of one-off and ongoing questions, from “Which donors are spending on Agriculture and Nutrition in East Africa?”, to “What pipeline projects are planned in the next six months affecting women and children in Least Developed Countries?”. Against a suitably cleaned up IATI dataset, reasonable answers to questions like these could be generated with carefully written queries. Authoriative answers might require further cleaning and analysis of the data retrieved.

For someone working with a dataset every day, such queries might take anything from a few minutes to a few hours to develop and execute. Cleaning data to provide authoritative answers might take a bit longer.

For a programme officer, who has the question, but not the knowledge of the data structures, working out how to answer these questions might take days. In fact, the learning curve will mean often these questions are simply not asked. Yet, having the answers could save months, and $millions.

So – what if key donors sponsored an enquiries service that could answer these kinds of queries on demand? With the right funding structure, it could have incentives not only to provide better data on request, but also to put resources into improving data quality and tooling. For example: if there is a set price paid per enquiry successfully answered, and the cost of answering that enquiry is increased by poor data quality from publishers, then there can be an incentive on the service to invest some of it’s time in improving incoming data quality. How to prioritise such investments would be directly connected to user demand: if all the questions are made trickier to answer because of a particular donor’s data, then focussing on improving that data first makes most sense. This helps escape the current situation in which the goal is to seek perfection for all data. Beyond a certain point, the political pressures to publish may ceases to work to increase data quality, whereas requests to improve data that are directly connected to user demand and questions may have greater traction.

Of course, the incentive structures here are subtle: the quickest solution for an enquiry service might be to clean up data as it comes into its own data store rather than trying to improve data at source – and there remains a desire in open data projects to avoid creating single centralised databases, and to increase the resiliency of the ecosystem by improving original open data, which would oppose this strategy. This would need to be worked through in any full proposal.

I’m not sure what appetite there would be for a service like this – but I’m certain that in, what are ultimately niche open data ecosystems like IATI, strategic interventions will be needed to build the markets, services and feedback loops that lead to their survival.

Comments and reflection welcome

OCDS – Notes on a standard

logo-open-contracting Today sees the launch of the first release of the Open Contracting Data Standard (OCDS). The standard, as I’ve written before, brings together concrete guidance on the kinds of documents and data that are needed for increased transparency in processes of public contracting, with a technical specification describing how to represent contract data and meta-data in common ways.

The video below provides a brief overview of how it works (or you can read the briefing note), and you can find full documentation at http://standard.open-contracting.org.

When I first jotted down a few notes on how to go forward from the rapid prototype I worked on with Sarah Bird in 2012, I didn’t realise we would actually end up with the opportunity to put some of those ideas into practice. However: we did – and so in this post I wanted to reflect on some aspects of the standard we’ve arrived at, some of the learning from the process, and a few of the ideas that have guided at least my inputs into the development process.

As, hopefully, others pick up and draw upon the initial work we’ve done (in addition to the great inputs we’ve had already), I’m certain there will be much more learning to capture.

(1) Foundations for ‘open by default’

Early open data advocacy called for ‘raw data now‘, asking for governments to essentially export and dump online existing datasets, with issues of structure and regular publishing processes to be sorted out later. Yet, as open data matures, the discussion is shifting to the idea of ‘open by default’, and taken seriously this means more than just data dumps that are created being openly licensed as the default position, but should mean that data is released from government systems as a matter of course in part of their day-to-day operation.

green_compilation.svgThe full OCDS model is designed to support this kind of ‘open by default’, allowing publishers to provide small releases of data every time some event occurs in the lifetime of a contracting process. A new tender is a release. An amendment to that tender is a release. The contract being awarded, or then signed, are each releases. These data releases are tied together by a common identifier, and can be combined into a summary record, providing a snapshot view of the state of a contracting process, and a history of how it has developed over time.

This releases and records model seeks to combine together different user needs: from the firm seeking information about tender opportunities, to the civil society organisation wishing to analyse across a wide range of contracting processes. And by allowing core stages in the business process of contracting to be published as they happen, and then joined up later, it is oriented towards the development of contracting systems that default to timely openness.

As I’ll be exploring in my talk at the Berkman Centre next week, the challenge ahead for open data is not just to find standards to make existing datasets line-up when they get dumped online, but is to envisage and co-design new infrastructures for everyday transparent, effective and accountable processes of government and governance.

(2) Not your minimum viable product

Different models of standard

Many open data standard projects adopt either a ‘Minimum Viable Product‘ approach, looking to capture only the few most common fields between publishers, or are developed through focussing on the concerns of a single publisher or users. Whilst MVP models may make sense for small building blocks designed to fit into other standardisation efforts, when it came to OCDS there was a clear user demand to link up data along the contracting process, and this required an overarching framework from into which simple component could be placed, or from which they could be extracted, rather than the creation of ad-hoc components, with the attempt to join them up made later on.

Whilst we didn’t quite achieve the full abstract model + idiomatic serialisations proposed in the initial technical architecture sketch, we have ended up with a core schema, and then suggested ways to represent this data in both structured and flat formats. This is already proving useful for example in exploring how data published as part of the UK Local Government Transparency Code might be mapped to OCDS from existing CSV schemas.

(3) The interop balancing act & keeping flex in the framework

OCDS is, ultimately, not a small standard. It seeks to describe the whole of a contracting process, from planning, through tender, to contract award, signed contract, and project implementation. And at each stage it provides space for capturing detailed information, linking to documents, tracking milestones and tracking values and line-items.

This shape of the specification is a direct consequence of the method adopted to develop it: looking at a diverse set of existing data, and spending time exploring the data that different users wanted, as well as looking at other existing standards and data specifications.

However, OCDS by not means covers all the things that publishers might want to state about contracting, nor all the things users may want to know. Instead, it focusses on achieving interoperability of data in a number of key areas, and then providing a framework into which extensions can be linked as the needs of different sub-communities of open data users arise.

We’re only in the early stages of thinking about how extensions to the standard will work, but I suspect they will turn out to be an important aspect: allowing different groups to come together to agree (or contest) the extra elements that are important to share in a particular country, sector or context. Over time, some may move into the core of the standard, and potentially elements that appear core right now might move into the realm of extensions, each able to have their own governance processes if appropriate.

As Urs Gasser and John Palfrey note in their work on Interop, the key in building towards interoperability is not to make everything standardised and interoperable, but is to work out the ways in which things should be made compatible, and the ways in which they should not. Forcing everything into a common mould removes the diversity of the real world, yet leaving everything underspecified means no possibility to connect data up. This is both a question of the standards, and the pressures that shape how they are adopted.

(4) Avoiding identity crisis

green_organisation.svgData describes things. To be described, those things need to be identified. When describing data on the web, it helps if those things can be unambiguously identified and distinguished from other things which might have the same names or identification numbers. This generally requires the use of globally unique identifiers (guid): some value which, in a universe of all available contracting data, for example, picks out a unique contracting process; or, in the universe of all organizations, uniquely identifies a specific organization. However, providing these identifiers can turn out to be both a politically and technically challenging process.

The Open Data Institute have recently published a report on the importance of identifiers that underlines how important identifiers are to processes of opening data. Yet, consistent identifiers often have key properties of public goods: everyone benefits from having them, but providing and maintaining them has some costs attached, which no individual identifier user has an incentive to cover. In some cases, such as goods and service identifiers, projects have emerged which take a proprietary approach to fund the maintenance of those identifiers, selling access to the lookup lists which match the codes for describing goods and services to their descriptions. This clearly raises challenges for an open standard, as when proprietary identifiers are incorporated into data, then users may face extra costs to interpret and make sense of data.

In OCDS we’ve sought to take as distributed an approach to identifiers as possible, only requiring globally unique identifiers where absolutely necessary (identifying contracts, organizations and goods and services), and deferring to existing registration agencies and identity providers, with OCDS maintaining, at most, code lists for referring to each identity ‘scheme’.

In some cases, we’ve split the ‘scheme’ out into a separate field: for example, an organization identifier consists of a scheme field with a value like ‘GB-COH’ to stand for UK Companies House, and then the identifier given in that scheme, like ‘5381958’. This approach allows people to store those identifiers in their existing systems without change (existing databases might hold national company numbers, with the field assumed to come from a particular register), whilst making explicit the scheme they come from in the OCDS. In other cases, however, we look to create new composite string identifiers, combining a prefix, and some identifier drawn from an organizations internal system. This is particularly the case for the Open Contracting ID (ocid). By doing this, the identifier can travel between systems more easily as a guid – and could even be incorporated in unstructured data as a key for locating documents and resources related to a given contracting process.

However, recent learning from the project is showing that many organisations are hesistant about the introduction of new IDs, and that adoption of an identifier schema may require as much advocacy as adoption of a standard. At a policy level, bringing some external convention for identifying things into a dataset appears to be seen as affecting the, for want of a better word, sovereignty of a specific dataset: even if in practice the prefix approach of the ocid means it only need to be hard coded in the systems that expose data to the world, not necessarily stored inside organizations databases. However, this is an area I suspect we will need to explore more, and keep tracking, as OCDS adoption moves forward.

(5) Bridging communities of practice

If you look closely you might in fact notice that the specification just launched in Costa Rica is actually labelled as a ‘release candidate‘. This points to another key element of learning in the project, concerning the different processes and timelines of policy and technical standardisation. In the world of funded projects and policy processes, deadlines are often fixed, and the project plan has to work backwards from there. In a technical standardisation process, there is no ‘standard’ until a specification is in use: and has been robustly tested. The processes for adopting a policy standard, and setting a technical one, differ – and whilst perhaps we should have spoken from the start of the project of an overall standard, embedding within it a technical specification, we were too far down the path towards the policy launch before this point. As a result, the Release Candidate designation is intended to suggest the specification is ready to draw upon, but that there is still a process to go (and future governance arrangements to be defined) before it can be adopted as a standard per-se.

(6) The schema is just the start of it

This leads to the most important point: that launching the schemas and specification is just one part of delivering the standard.

In a recent e-mail conversation with Greg Bloom about elements of standardisation, linked to the development of the Open Referral standard, Greg put forward a list of components that may be involved in delivering a sustainable standards project, including:

  • The specification – with its various components and subcomponents);
  • Tools that assesses compliance according to the spec (e.g. validation tools, and more advanced assessment tools);
  • Some means of visualizing a given set of data’s level of compliance;
  • Incentives of some kind (whether positive or negative) for attaining various levels of compliance;
  • Processes for governing all of the above;
  • and of course the community through which all of this emerges and sustains;

To this we might also add elements like documentation and tutorials, support for publishers, catalysing work with tool builders, guidance for users, and so-on.

Open government standards are not something to be published once, and then left, but require labour to develop and sustain, and involve many social processes as much as technical ones.

In many ways, although we’ve spent a year of small development iterations working towards this OCDS release, the work now is only just getting started, and there are many technical, community and capacity-building challenges ahead for the Open Contracting Partnership and others in the open contracting movement.

Two senses of standard

[Summary: technical standards play a role in both interoperability, and in target-setting for policy.]

I’ve been doing lots of thinking about standardisation recently, particularly as part of work on the Open Contracting Data Standard (feedback invited on the latest draft release…), and thanks to the opportunity to work with Samuel Goëta on a paper around data standards (hopefully out some time next year).

One of the themes I’ve been seeking to explore is how standards play both a technical and a political role, and how standards processes (at least at the level of content standards) can sensitively engage with this. Below is a repost of my earlier contribution to a GitHub thread discussing some of this in the context of Open Contracting.

Two senses of standard

In Open Contracting I believe we’re dealing with two different senses of ‘standard’, and two purposes which we need to keep in balance. Namely:

  • Standards as a basis for interoperability – as in *”their data complies with the standard, and can be used by standards-compliant tools.”
  • Standards as targets – as in, “they have achieved a high standard of disclosure”.

To unpack these a bit:

(Note: the arguments below are predominantly theoretical, and so some of the edge cases considered may not come up at all in practice in the Open Contracting Data Standard, but considering them is a useful exercise to test the intuitions and principles directing our action.)

Standards as interoperability

We’re interested in interoperability in two directions: vertical (can a single dataset be used by other actors and tools in a value-chain of re-use), and horizontal (can two datasets from different publishers be easily analysed alongside one another).

Where data is already published, then the goal should be to achieve the largest possible set of data publishers who can richly represent their data in the standard, and of data users who can draw on data in the standard to meet their needs. This supports the idea that for any element in the standard where (a) data already exists; and (b) use cases already exist; we should be looking for reference implementations to test that data can be rendered in the standard, and that users (or tools they create) can read, analyse and use that data effectively.

However, it is important that in this we look at both both horizontal and vertical interoperability in making this judgement. E.g. there could be a country as the sole publisher of a field that is used by 5 different users in their country. This should clearly not be a required field in a standard, but articulating how it is standardised is useful to this community of users (one way to accommodate such cases may be in extensions, although the judgement on whether or not to move something to an extension might come down to whether it is likely that other publishers could be providing this data in future).

In many cases, underlying data from different sources is not perfectly interoperable, or there is a mismatch between the requirements of users, and the requirements of data holders. In these cases, the way a standard is designed affects the distribution of labour between publishers and users with respect to rendering data interoperable. For example, a use case might involve ‘Identifying which different government agencies, each publishing data independently, have contracts with a particular firm’. In this case, a standard could require all publishers, who may store different identifiers in their systems, to map these to a common identifier, or a standard could allow publishers to use whatever identifier they hold, leaving the costs of reconciling these on the user. Making things interoperable then involves can involve then a process of negotiation, and this process may play out differently in different places at different times, leaving certain elements of a standard less stable than others. The concept of ‘designing for the tussle’ (PDF) may be relevant here, thinking about how we can modularise stable (or ‘neutral’) and unstable elements of a standard (this is what the proposed Organisation ID standard does, but having a common way to represent identifiers, but separating this off from the choice of identifier itself, and then allowing for the emergence of a set of third-party tools and validation routines to help manage the tussle).

In seeking to maximise the set of publishers and users interoperable through the standard we need to be critically aware of both short-term and long-term interoperability, as organisations modify their practices in order to be able to publish to, or draw upon, a common standard. We need to balance out a ‘Lowest Common Denominator’ (LCD) of ‘Minimum Viable Product’ (MVP) approach that means that the majority of publishers can achieve substantial coverage of the standard, with a richer standard that supports the greatest chance of different producer and consumer groups being able to exchange data through the standard.

initial-sketch-thinking-about-standards

(Initial attempt to sketch distinction between maximising set of common fields across publisher and users, and maximising set of publishers and users)

Standards as targets

Open Contracting is a political process. The Open Contracting Partnership have articulated a set of Global Principles which set out the sorts of information about contracting that governments and other parties should disclose, and they are working to secure government sign-up to these principles. In policy circles, a standard is often seen as a form of measure, qualitative or quantitative, against which process towards some policy goal is measured. Some targets might be based on ‘best practice’, others are based on ‘stretch goals’: things which perhaps no-one is yet doing particularly well, but which a community of actors agree are worth aiming for. A standard, whether specified in terms of indicators and measures, or in terms of fields and formats, provides a means of agreeing what meeting the target will look like.

The Open Contracting Principles call for a lot of things which no governments appear to yet be publishing in machine-readable forms. In many cases we’ve not touched the standardisation of these right now (e.g. “Risk assessments, including environmental and social impact assessments”) recognising that standards for these will either exist in different domains that can be linked or embedded into our standard, or, recognising that interoperability of such information is hard to achieve and ultimately what is needed for most use cases may be legal text or plain language documents, rather than structured data. However, there may be cases where something is a strong candidate for standardisation, having both the potential to be published (i.e. this is something which evidence suggests governments either do, or could, capture in their existing information systems), and for which clearly articulated use cases exist. In these cases a proposed field-level standard can act as an important target for those seeking to provide this data to move towards. It also acts to challenge unwarranted ‘first mover advantage’ where the first person to publish, even if publishing less than an idea target would require, gets to set the standard, and instead makes the ‘target’ subject to community discussion.

Clearly any ‘aspirational’ elements of a standard should not predominate or make up the majority of a standard if it seeks to effectively support interoperability, but in standards that play a part in policy and political processes (as, in practice, all standards do to some extent (c.f. Lessig).

Implications for Open Contracting Data Standard

There are a number of ways we might respond to a recognition of the dual role that standardisation plays in Open Contracting.

Purposes and validation sets

One approach, suggested in the early technical scoping is to identify different sets of users, or ‘purposes’ for the standard, and for each of these to identify the kinds of fields (subset of the data) these purposes require. As Jeni Tennison’s work on the scoping describes “…each purpose can have a status (eg proposed vs implemented) and … purposes are only marked as implemented when there are implementations that use the given subset of data for the specified purpose”.

If their are neither purposes requiring a field, nor datasets providing a field, then it would not be suitable for inclusion in a standard. And if a purpose either went unimplemented for a long period, or required a field that no supplier could publish, then careful evaluation would be needed of whether to remove that purpose (or remove that field from the purpose) against which elements of the standard could be evaluated for relevance to remain in the model.

Purposes could also be used to validate datasets, identifying how many datasets are fit for which purpose.

Stable, ordinary and target elements

We could maintain a distinction in how the standard is described between fields and elements which are ‘stable’ (and thus very unlikely to change), ‘ordinary’ elements (which may have reference implementations, but could change if there was some majority interest amongst those governing a standard in seeing changes), and ‘target’ elements, which may lack any reference implementations, but which are considered useful to help publishers moving towards implementing a political commitment to publish.

Q: Could we build this information into the schema meta-data somehow?

We might need to have quite a long time horizon for keeping target elements provisionally in the standard, and to only remove them if there is agreement that no-one is likely to publish to them. However, being able to represent them visually as distinct in the schema, and clearly documenting the distinction may be valuable.

Extensions

Some ‘target’ elements may best belong in extensions, with some process for merging extensions into the core standard if they are widely enough adopted.

Regular implementation monitoring

The IATI Team run a dashboard which tracks use of particular fields in the data. Doing similar for Open Contracting would be valuable, and it may even be useful to feed such information into the display of the schema or documentation (or at least to make it easy for publishers and users to look up who is implementing a given property)

Implementation schedules

Another approach IATI uses for ‘target elements’ is to ask publishers to prepare ‘Implementation Schedules‘ which outline which fields they expect to be able to publish by when. This allows an indication of whether there is political will to reach some of the ‘stretch targets’ that might be involved in a standard, and holds out the potential to convene together to define and refine target standardisations those who are most likely to publish that data in the near to medium term.

Discussion

What theoretical writing on standardisation could I be drawing on here?

What experience from other standards could we be drawing upon in Open Contracting and in other standard processes?

Exploring Wikidata

WikiData[Summary: thinking aloud – brief notes on learning about the wikidata project, and how it might help addressing the organisational identifiers problem]

I’ve spent a fascinating day today at the Wikimania Conference at the Barbican in London, mostly following the programmes ‘data’ track in order to understand in more depth the Wikidata project. This post shares some thinking aloud to capture some learning, reflections and exploration from the day.

As the Wikidata project manager, Lydia Pintscher, framed it, right now access to knowledge on wikipedia is highly skewed by language. The topics of articles you have access to, the depth of meta-data about them (such as the locations they describe), and the detail of those articles, and their liklihood of being up to date, is greatly affected by the language you speak. Italian or Greek wikipedia may have great coverage of places in Italy or Greece, but go wider and their coverage drops off. In terms of seeking more equal access to knowledge, this is a problem. However, whilst the encyclopedic narrative of a French, Spanish of Catalan page about the Barbican Center in London will need to be written by someone in command of that language, many of the basic facts that go into an article are language-neutral, or translatable as small units of content, rather than sentences and paragraphs. The date the building was built, the name of the architect, the current capacity of the building – all the kinds of things which might appear in infoboxes – are all things that could be made available to bootstrap new articles, or that, when changed, could have their changes cascaded across all the different language pages that draw upon them.

That is one of the motivating cases for Wikidata: separating out ‘items’ and their ‘properties’ that might belong in Wikipedia from the pages, making this data re-usable, and using it to build a better encyclopedia.

However, wikidata is also generating much wider interest – not least because it is taking on a number of problems that many people want to see addressed. These include:

  • Somewhere ‘institutional’ and well governed on the web to put data – and where each data item also gains the advantage of a discussion page.
  • The long-term preservation, and versioning, of data;
  • Providing common identifiers on the web for arbitrary things – and providing URIs for these things that can be looked up (building on the idea of DBPedia as a crystalisation point for the web of linked data);
  • Providing a data model that can cope with change over time, and with data from heterogenous sources – all of the properties in wikidata can have qualifiers, such as when the statement is true from, or until, source information, and other provenance data.

Wikidata could help address these issues on two levels:

  • By allowing anyone to add items and properties to the central wikidata instance, and making these available for re-use;
  • By providing an open source software platform for anyone to use in managing their own corpus of wikified, versioned data*;

A particular use case I’m interested in is whether it might help in addressing the perenial Organisational Identifiers problem faced by data standards such as IATI and Open Contracting, where it turns out that having shared identifiers for government agencies, and lots of existing, but non-registered, entities like charities and associations that give and recieve funds, is really difficult. Others at Wikimania spoke of potential use cases around maintaining national statistics, and archiving the datasets underlying scientific publications.

However, in thinking about the use cases wikidata might have, its important to keep in mind it’s current scope:

  • It is a store of ‘items’ and then ‘statements’ about them (essentially a graph store). This is different from being a place to store datasets (as you might want to do with the archival of the dataset used in a scientific paper), and it means that, once created, items are the first class entities of wikidata, able to exist in multiple collection.
  • It currently inherits Wikipedia’s notability criteria for items. That is, the basic building blocks of wikidata – the items that can be identified and described, such as the Barbican, Cheese or Government of Grenada – can only be included in the main wikidata instance if they have a corresponding wikipedia page in some language wikipedia (or similar: this requirement is a little more complex).
  • It can be edited by anyone, at any time. That is, systems that rely on the data need to consider what levels of consistence they need. Of course, as wikipedia has shown, editability is often a great strength – and as Rufus Pollock noted in the ‘data roundtable’ session, updating and versioning of open data are currently big missing parts of our data infrastructures.

Unlike the entirely distributed open world assumption on the web of data, where the AAA assumption holds (Anyone can say Anything about Anything), wikidata brings both a layer of regulation to the statements that can be made, and the potential of community driven editorial control. It sits somewhere between the controlled description sets of Schema.org, and an entirely open proliferation of items and ontologies to describe them.

Can it help the organisational identifiers problem?

I’ve started to carry out some quick tests to see how far wikidata might be a resource to help with the aforementioned organisational identifiers problem.

Using Kasper Brandt‘s fantastically useful linked data rendering of IATI, I queried for the names of a selection of government and non-government organisations occurring in the International Aid Transparency Initiative data. I then used Open Refine to look up a selection of these on the DBPedia endpoint (which it seems now incorporates wikidata info as well). This was very rough-and-ready (just searching for full name matches), but by cross-checking negative results (where there were no matches) by searching wikipedia manually, it’s possible to get a sense of how many organisations might be identifiable within Wikipedia.

So far I’ve only tested the method, and haven’t run a large scale test – but I found around 1/2 the organisations I checked had a Wikipedia entry of some form, and thus would currently be eligible to be Wikidata items right away. For others, Wikipedia pages would need to be created, and whether or not all the small voluntary organisations that might occur in an IATI or Open Contracting dataset would be notable for inclusion is something that would need to be explored more.

Exploring the Wikidata pages for some of the organisations I did find threw up some interesting additional possibilities to help with organisation identifiers. A number of pages were linked to identifiers from Library Authority Files, including VIAF identifiers such as this set of examples returned for a search on Malawi Ministry of Finance. Library Authority Files would tend to only include entries when a government agency has a publication of some form in that library, but at a quick glance coverage seems pretty good.

Now, as Chris Taggart would be quick to point out, neither wikipedia pages, nor library authority file identifiers, act as a registry of legal entities. They pick out everyday concepts of an organisation, rather than the legally accountably body which enters into contracts. Yet, as they become increasingly backed by data, these identifiers do provide access to look up lots of contextual information that might help in understanding issues like organisational change over time. For example, the Wikipedia page for the UK’s Department for Education includes details on the departments that preceeded it. In wikidata form, a statement like this could even be qualified to say if that relationship of being a preceeding department is one that passes legal obligations from one to the other.

I’ve still got to think about this a lot more, but it seems that:

  • There are many things it might be useful to know about organisations, but which are not going to be captured in official registries anytime soon. Some of these things will need to be subject of discussion, and open to agreement through dialogue. Wikidata, as a trusted shared space with good community governance practices might be a good place to keep these things, albeit recognising that in its current phase it has no goal of being a comprehensive repository of records about all organisations in the world (and other spaces such as Open Corporates are already solving the comprehensive coverage problem for particular classes of organiastion).

  • There are some organisations for which, in many countries, no official registry exists (particularly Government Departments and Agencies). Many of these things are notable (Government Departments for example), and so even if no Wikipedia entry yet exists, one could and should. A project to manage and maintain government agency records and identifiers in Wikidata may be worth exploring.

Whether a shift from seeking to solve some aspects of the organisational identifiers problem through finding some authority to provide master lists, to developing a distributed best-efforts community approach is one that would make sense to the open government community is something yet to be explored.

Notes

*I here acknowledge SJ Klein‘s counsel was that this (encouraging multiple domain specific instances of a wikidata platform) is potentially a very bad idea, as the ‘forking’ of wiki-projects has rarely been a successful journey: particularly with respect to the sustainability of forked content. As SJ outlined, even though there may be technical and social challenges to a mega graph store, these could be compared to the apparant challenges of making the first encyclopedias (the idea of 50,000 page book must have seemed crazy at first), or the social challenges envisioned to Wikipedia at its genesis (‘how could non-experts possible edit an enecylopedia?’). On this view, it is only by setting the ambition of a comprehensive shared store of the worlds propositional data (with the qualifiers that Wikidata supports to make this possible without a closed world assumption) that such limits might be overcome. Perhaps with data there is a greater possibility to support forking, and remerging, of wikidata instances, permitting short-term pragmatic creation of datasets outside the core wikidata project, which can later be brought back in if they are considered, as a set, notable (although this still carries risks that forked projects diverge in their values, governance and structure so far that re-connecting later is made prohibitively difficult).