Open Data

How data.gov.uk is laying foundations for open data engagement

Originally posted as a Guest Post on data.gov.uk

When the first data.gov.uk platform was launched, it was a great example of the ‘rewired state’ spirit: pioneering the rapid development of a new digital part of government using open source code, and developed through fluid collaboration between government staff, academics, open source developers, and open data activists from outside government. But essentially, the first data.gov.uk was bolted onto the existing machinery of government: a data outpost scraping together details of datasets from across departments, and acting as the broker providing the world with information on where to access that data. And it is fair to say data.gov.uk was designed by data-geeks, for data-geeks.

Tom Steinberg has argued that data portals need not appeal to the masses , and that most people will access government data through apps, but there are thousands of citizens who want direct access to data, and it is vital that data portals don’t exclude those unfamiliar with the design metaphors of source and software repositories. That’s why it is great to see a redesign of data.gov.uk that takes steps to simplify the user experience for anyone seeking out data, whether as a techie, or not.

The most interesting changes to data.gov.uk though are more subtle than the cleaner navigation and unexpected (but refreshing) green colour scheme. Behind the scenes Antonio Acuna and his team have been overhauling the admin system where data records are managed, with some important implications. Firstly, the site includes a clear hierarchy of publishing organisations (over 700 of them) and somewhere in each hierarchy there is a named contact to be found. That means that when you’re looking at any dataset it’s now easier to find out who you can contact to ask questions about it, or, if the data doesn’t tell you what you want, the new data.gov.uk lets you exercise your Right to Information (and hopefully soon Right to Data) and points you to how you can submit a Freedom of Information request.

Whilst at first most of these enquiries will go off to the lead person in each publishing organisation who updates their records ondata.gov.uk, the site allows contact details to be set at the dataset level, moving towards the idea of data catalogues not as a firewall sitting between government and citizens, but as the starting point of a conversation between data owners/data stewards and citizens with an interest in the data. Using data to generate conversation, and more citizen-state collaboration, is one of the key ideas in the 5 stars for open data engagement , drafted at this year’s UKGovCamp.

The addition of a Library section with space for detailed documentation on datasets, including space to share the PDF handbooks that often accompany complex datasets and that share lots of the context that can’t be reduced down into neat meta-data, is a valuable addition too. I hope we’ll see a lot more of the ‘social life’ of the datasets that government holds becoming apparent on the new site over time – highlighting that not only can data be used to tell stories, but that there is a story behind each dataset too.

Open data portals have a hard balance to strike – between providing ‘raw’ datasets and disintermediating data, separating data from the analysis and presentation layers government often fixes on top – and becoming new intermediaries, giving citizens and developers the tools they need to effectively access data. Data portals take a range of approaches, and most are still a long way from striking the perfect balance. But the re-launched data.gov.uk lays some important foundations for a continued focus on user needs, and making sure citizens get the data they need, and, in the future, access to all the tools and resources that can help them make sense of it, whether those tools come from government or not.

What does Internet Governance have to do with open data?

[Summary: What do Internet Governance and Open Data have to do with each other?]

As a proposal I worked on for a workshop at this years Internet Governance Forum on the Internet Governance issues of Open Government Data has been accepted, I’ve been starting to think through the different issues that the background paper for that session will need to cover. This week I took advantage of a chance to guest blog over on the Commonwealth IGF website to start setting them out.

It started with high profile Open Government Data portals like Data.gov in the US, and Data.gov.uk in the UK giving citizens access to hundreds of government datasets. Now, open data has become a key area of focus for many countries across the world, forming a core element of the Open Government Partnership agenda, and sparking a plethora ofInternational conferences, events and online communities. Proponents of open data argue it has the potential to stimulate economic growth, promote transparency and accountability of governments, and to support improved delivery of public services. This year’s Internet Governance Forum in Baku will see a number of open data focussed workshops, following on from open data and PSI panels in previous years. But when it comes to Open Data and Internet Governance, what are the issues we might need to explore? This post is a first attempt to sketch out some of the possible areas of debate.

In 2009 David Eaves put forward ‘three laws of open government data‘ that describe what it takes for a dataset to be considered effectively open. They boil down to requirements that data should be accessible online, machine readable, and under licenses that permit re-use. Explore these three facets of open data offers one route into potential internet governance issues that need to be critically discussed if the potential benefits of open data are to be secured in equitable ways.

1) Open Data as data accessible online

Online accessibility does not equate to effective access, and we should be attentive to new data divides. We also need to address bandwidth for open data, the design of open data platforms, cross-border cloud hosting of open data, and to connect open data and internet freedom issues. Furthermore, the online accessibility of public data may create or compound privacy and security issues that need addressing.

Underlying the democratic arguments for open data is the idea that citizens should have access to any data that affects their lives, to be able to use and analyse it for themselves, to critique official interpretations, and to offer policy alternatives. Economic growth arguments for open data often note the importance of a reliable, timely supply of data on which innovative products and services can be built. But being able to use data for democratic engagement, to support economic activity, is not just a matter of having the data – it also requires the skills to use it. Michael Gurstein has highlighted the risk that open data might ‘empower the empowered’ creating a new ‘data divide’. Addressing grassroots skills to use data, ensuring countries have capacity to exploit their own national open data, and identifying the sorts of intermediary institutions and capacity building to ensure citizens can make effective use of open data is a key challenge.

There are also technical dimensions of the data divide. Many open data infrastructures have developed in environment of virtually unlimited bandwidth, and are based on the assumption that transferring large data files is not problematic: an assumption that cannot be made everywhere in the world. Digital interfaces for working with data often rely on full size computers, and large datasets can be difficult to work with on mobile platforms. As past IGF cloud computing discussions have highlighted, where data is hosted may also matter. Placing public data, albeit openly licensed so sidestepping some of the legal issues, into cloud hosting, could have impacts on the accessibility, and the costs of a access, to that data. How far this becomes an issue may depend on the scale of open data programmes, which as yet can only constitute a very small proportion of Internet traffic in any country. However, when data that matters to citizens is hosted in a range of different jurisdictions, Internet Freedom and filtering issues may have a bearing on who really has access to open data. As Walid Al-Saqaf’s power presentation at the Open Government Partnership highlighted, openness in public debate can be dramatically restricted when governments have arbitrary Internet filtering powers.

Last, but not least, in the data accessibility issues, whilst most advocates of open data explicitly state that they are concerned only with public data, and exclude personal datafrom the discussion, the boundaries between these two categories are often blurred (for example, court records are about individuals, but might also be a matter of public record), and with many independently published open datasets based on aggregated or anonymised personal data, plus with large-scale datasets harvested from social media and held by companies, ‘jigsaw identification’, in which machines can infer lots of potentially sensitive and personal facts about individuals becomes a concern. As Cole outlines, in the past we have dealt with some of these concerns by ad-hoc limitations and negotiated access to data. Unrestricted access to open data online removes these strategies, and highlights the importance of finding other solutions that protect keydimensions of individual privacy.

2) Open data as machine readable

Publishing datasets involves selecting formats and standards which impact on what the data can express and how it can be used. Often standard setting can have profound political consequences, yet it can be treated as a purely technical issue.

Standards are developing for everything from public transport timetables (GTFS), to data on aid projects (IATI). These standards specify the format data should be shared in, and what the data can express. If open data publishers want to take advantage of particular tools and services, they may be encouraged to chose particular data standards. In some areas, no standards exist, and competing open and non-open standards are developing. Sometimes, because of legacy systems, datasets are tied into non-open standards, creating a pressure to develop new open alternatives.

Some data formats offer more flexibility than others, but usually with connected increase in complexity. The common CSV format of flat data, accessing in spreadsheet software, does not make it easy to annotate or extend standardised data to cope with local contexts. eXtensible Markup Language makes extending data easier, and Linked Data offers the possibility of annotating data, but these formats often present barriers for users without specialist skills or training. As a whole web of new standards, code lists and identifiers are developed to represent growing quantities of open data, we need to askwho is involved in setting standards and how can we make sure that global standards for open data promote, rather than restrict, the freedom of local groups to explore and address the diverse issues that concern them.

3) Open data as licensed for re-use

Many uses case for open data rely on the ability to combine datasets, and this makesc ompatible licenses a vital issue. In developing license frameworks, we should engage with debates over who benefits from open data and how norms and licenses can support community claims to benefit from their data.

Open Source and Creative Commons licenses often include terms such as a requirement to ‘Share Alike’, or a Non-Commercial clause prohibiting profit making use of the content. These place restrictions on re-users of the content: for example, if you use Share Alike licensed content to in your work, you must share your work under the same license. However, open data advocates argue that terms like this quickly create challenges for combining different datasets, as differently licensed data may be incompatible, and many of the benefits of having access to the data will be lost when it can’t be mashed up and remixed using both commercial and non-commercial tools. The widely cited OpenDefinition.org states that at most, licenses can require attribution of the source, but cannot place any other restrictions on data re-use. Developing a common framework for licensing has been a significant concern in many past governance discussions of open data.

These discussions of common licenses have connections to past Access to Knowledge (A2K) debates where the rights of communities to govern access to traditional knowledges, or to gain a return from use of traditional knowledge have taken place. An open licensing framework creates the possibility that, without a level playing field of access to resources to use data (i.e. data divides), some powerful actors might exploit open data to their advantage, and to the loss of those who have stewarded that data in the past. Identifying community norms, and other responses to addresses these issues is an area for discussion.

Further issues?

I’ve tried to set out some of the areas where debates on open data might connect with existing or emerging internet governance debates. In the workshop I’m planning for this years IGF I am hoping we will be able to dig into these issues in more depth to identify how far they are issues for the IGF, or for other fora, and to develop ideas on different constructive approaches to support equitable outcomes from open data. I’m sure the issues above don’t cover all those we might address, so do drop in a comment below to share your suggestions for other areas we need to discuss…

Further reading:

Michael Gurstein’s First Monday paper on Open Data explores who might benefit from open data. The theme is picked up in a number of articles in the Open Government Dataspecial issue of the Journal of Community Informatics
Christophe Gueret’s paper at the recent W3C Using Open Data Workshop argued that we need to look at ways to decentralise open data – creating infrastructures that do not rely on a constant Internet connection or keyboard and screen, but that link open data up with voice interfaces, local radio, and peer-to-peer infrastructures on low cost hardware like the XO laptops.
Kieron O’Hara’s review for the UK Government on the potential privacy impacts of open data explores key elements of the debate and highlights that this is an area in need of much more research.
The Critical Development Perspectives on Open Data research project will be looking at a number of these issues in the coming months, and has published a draft research outline for consultation at http://www.opendataresearch.net

(Other suggested references welcome too…)

Addition: over on the CIGF post Andrew has already suggested an extra reference to Tom Slee’s thought provoking blog post on ‘Seeing like a geek’ that emphasises the importance of putting licensing issues very much on the table in governance debates.

Open data: embracing the tough questions – new publications

[Summary: launching open data special issue of Journal of Community Informatics, and a new IKM Emergent paper] (Cross posted from Open Data Impacts blog)

Two open data related publications I’ve been working on have made it to the web in the last few days. Having spent a lot of the last few years working to support organisations to explore the possibilities of open data, these feel like they represent a more critical strand of exploring OGD, trying to embrace and engage with, rather than to avoid the tough questions. I’m hoping, however, they both offer something to the ongoing and unfolding debate about how to use open data in the interests of positive social change.

Special Issue of JoCI on Open Government Data
The first is a Special Issue of the Journal of Community Informatics on Open Government Data (OGD) bringing together four new papers, five field notes, and two editorials that critically explore how Open Government Data policies and practices are playing out across the world. All the papers and notes draw upon empirical study and grassroots experiences in order to explore key challenges of, and challenges to, OGD.

Nitya Raman’s note on “Collecting data in Chennai City and the limits of Openness” and Tom Demeyer’s account of putting together an application competition in Amsterdam explore some of the challenges of accessing and opening up government datasets in very different contexts, highlighting the complex realities involved in securing ongoing access to reliable government data. Papers from Sharadini Rath (on using government data to influence local planning in India), and Fiorella De Cindo (on designing deliberative digital spaces), explore the challenges of taking open data into civic discussions and policy making – recognising the role that platforms, politics and social dynamics play in enabling, and putting the brakes on, open data as a tool to drive change. A field note from Wolfgang Both and a point of view note from Rolie Cole on “The practice of open data as opposed to it’s promise” highlight that any OGD initiative involves choices about the data to priotise, and the compromises to make between competing agendas when it comes to opening data. Shashank Srinivasan’s note on Mapping the Tso Kar basin in Ladakh, using GIS systems to represent the Changpa tribal people’s interaction with the land also draws attention to the key role that technical systems and architectures play in making certain information visible, and the need to look for the data that is missing from official records.

Unlike many reports and white papers on OGD out there, which focus solely on potential positive benefits, a number of the papers in the issue also take the important step of looking at the potential for OGD to cause harm, or for OGD agendas to be co-opted against the interests of citizens and communities. Bhuvaneswari Raman’s paper
The Rhetoric of Transparency and its Reality: Transparent Territories, Opaque Power and Empowerment puts power front and centre of an analysis of how the impacts of open data may play out, and Jo Bates “This is what modern deregulation looks like” : co-optation and contestation in the shaping of the UK’s Open Government Data Initiative questions whether UK open data policy has become a fig-leaf for marketisation of public services and neoliberal reforms in the state.

These challenges to open government data, questioning whether OGD does (or even can?) deliver on promises to promote democratic engagement and citizen empowerment are, well, challenging. Advocates of OGD may initially want to ignore these critical cases, or to jump straight to sketching ‘patches’ and pragmatic fixes that route around these challenges. However, I suspect the positive potential of OGD will be closer when we more deeply engage with these critiques, and when in the advocacy and architecture of OGD we find ways to embrace tough questions of power and local context.

(Zainab and I have tried to provide a longer summary weaving together some of these issues in our editorial essay here, although we see this very much as the start, rather than end-point, of an exploration…)

More to come: I’ve been working on the journal issue for just over a year with my co-editor Zainab Bawa, and at the invitation of Michael Gurstein, who has also been fantastically supportive in us publishing this as a ‘rolling issue’. That means we’re going to be adding to the issue over the coming months, and this is just the first batch of papers available to start feeding into discussions and debates now, particuarly ahead of the Open Government Partnership meeting in Brasilia next week where IDRC, Berkman Centre and the World Wide Web Foundation are hosting a discussion to develop future research agendas on the impacts of Open Government Data.

ICT for or against development? Exploring linked and open data in development

The second publication is a report I worked on last year with Mike Powel and Keisha Taylor for the IKM Emergent programme, under the title: “ICT for or against development? An introduction to the ongoing case of Web 3” (PDF). The paper asks whether the International Development sector has historically adopted ICT innovations in ways that empower the subjects of development and to deliver sustainable improvements for those whose lives ” are blighted by poverty, ill-health, insecurity and lack of opportunity”, and looks at where the opportunities and challenges might lie in the adoption of open and linked data technologies in the development sector. It’s online as a PDF here, and summaries are available in English, Spanish and French

Untangling the data debate

[Cross posted from my PhD blog where I’m trying to write a bit more about issues coming up in my current research…]

This post is also available as a two-page PDF here.

Untangling the data debate: definitions and implications

Data is a hot topic right now: from big data, to open data and linked data, entrepreneurs and policy makers are making big claims about ‘data revolutions’. But, not all ‘data’ are the same, and good decision making about data involves knowing the differences.

Big data

Definition: Data that requires ‘massive’ computing power to process (Crawford & Boyd, 2011).

Massive computing power, originally only available on supercomputers, is increasingly available on desktop computers or via low cost cloud computing.

Implications: Companies and researchers can ‘data mine’ vast data resources, to identify trends and patterns. Big data is often generated by combining different datasets.

Digital traces from individuals and companies are increasingly captured and stored for their potential value as ‘big data’.

Raw data

Definition: Primary data, as collected or measured direct from the source. Or Data in a form that allows it to be easily manipulated, sorted, filtered and remixed.

Implications: Access to raw data can allows journalists, researchers and citizens to ‘fact check’ official analysis. Programmers are interested in building innovative services with raw data.

Real-time data

Definitions: Data measured and made accessible with minimal delay. Often accessed over the web as a stream of data through APIs (Application Programming Interfaces).

Implications: Real-time data supports rapid identifications trends. Data can support the development of ‘early warning systems’ (e.g. Google Flu Trends; Ushahidi). ‘Smart systems’ and ‘smart cities’ can be configured to respond to real-time data and adapt to changing circumstances.

Definition: Datasets that are made accessible in non-proprietary formats under licenses that permit unrestricted re-use (OKF – Open Knowledge Foundation, 2006). Open government data involves governments providing many of their datasets online in this way.

Implications: Third-parties can innovate with open data, generating social and economic benefits. Citizens and advocacy groups can use open government data to hold state institutions to account. Data can be shared between institutions with less friction.

Personal/ private data

Definitions: Data about an individual that they have a right to control access to. Such data might be gathered by companies, governments or other third-parties in order to provide a service to someone, or as part of regulatory and law-enforcement activities.

Implications: Many big and raw datasets are based on aggregating personal data, and combining them with other data. Effective anonymisation of personal data is difficult particularly when open data provides the pieces for ‘jigsaw identification’ of private facts about people (Ohm, 2009).

Linked data

Definitions: Datasets are published in the RDF format using URIs (web addresses) to identify the elements they contain, with links made between datasets (Berners-Lee, 2006; Shadbolt, Hall, & Berners-Lee, 2006).

Implications: A ‘web of linked data’ emerges, supporting ‘smart applications’ (Allemang & Hendler, 2008) that can follow the links between datasets. This provides the foundations for the Semantic Web.

More dimensions of data:

These are just a few different types of data commonly discussed in policy debates. There are many other data-distinctions we could also draw. For example: we can look at whether data was crowd-sourced, statistically sampled, or collected through a census. The content of a dataset also has important influence on the implications that working with that data will have: an operational dataset of performance statistics is very different from a geographical dataset describing the road network for example.

Crossovers and conflicts:

Almost all of the above types of data can be found in combination: you can have big linked raw data; real-time open data; raw personal data; and so-on.

There are some combinations that must be addressed with care. For example, ‘open data’ and ‘personal data’ are two categories that are generally kept apart for good reason: open data involves giving up control over access to a dataset, whilst personal data is the data an individual has the right to control access over.

These can be found in combination on platforms like Twitter, when individuals choose to give wider access to personal information by sharing it in a public space, but this is different from the controller of a dataset of personal data making that whole dataset openly available.

A nuanced debate:

It’s not uncommon to see claims and anecdotes about the impacts of ‘big data’ use in companies like Amazon, Google or Twitter being used to justify publishing ‘open’ and ‘raw data’ from governments, drawing on aggregating ‘personal data’. This sort of treatment glosses over the difference between types of data, the contents of the datasets, and the contexts they are used in. Looking to the potential of data use from different contexts, and looking to transfer learning between sectors can support economic and social innovation, but it also needs critical questions to be asked, such as:

What kind of data is this case describing?
Does the data I’m dealing with have similar properties?
Can the impacts of this data apply to the data I’m dealing with?
What other considerations apply to the data I’m dealing with?

Bibliography/further reading:

See http://www.opendataimpacts.net for ongoing work.

Allemang, D., & Hendler, J. A. (2008). Semantic web for the working ontologist: modeling in RDF, RDFS and OWL. Morgan Kaufmann. Retrieved from

Berners-Lee, T. (2006, July). Linked Data – Design Issues. Retrieved from http://www.w3.org/DesignIssues/LinkedData.html

Crawford, K., & Boyd, D. (2011). Six Provocations for Big Data.

Davies, T. (2010). Open data, democracy and public sector reform: A look at open government data use from data. gov. uk. Practical Participation. Retrieved from http://www.practicalparticipation.co.uk/odi/report

OKF – Open Knowledge Foundation. (2006). Open Knowledge Definition. Retrieved March 4, 2010, from http://www.opendefinition.org/

Ohm, P. (2009). Broken promises of privacy: Responding to the surprising failure of anonymization. Imagine. Retrieved from http://papers.ssrn.com/sol3/Papers.cfm?abstract_id=1450006

Shadbolt, N., Hall, W., & Berners-Lee, T. (2006). The Semantic Web Revisited. IEEE intelligent systems, 21(3), 96–101.

Focussing on open data where it matters: accountability and action

A lot of talk of open data proceeds as if all data is equal, and a government dataset is a government dataset. Some open data advocates fall into the trap of seeing databases as collections of ‘neutral facts’, without recognising the many political and practical judgements that go into the collection and modelling of data. But, increasingly, an awareness is growing that datasets are not a-political, and that not all datasets are equal when it comes to their role in constituting a more open government.

Back in November 2010 I started exploring whether the government’s ‘Public Sector Information Unlocking Service’ actually worked by asking for open data access to the dataset underling the Strategic Export Controls: Reports and Statistics Website. Data on where the UK has issued arms export licenses is clearly important data for accountability, and yet, the data is kept in obfuscated in an inaccessible website. 14 months on, and my various requests for the data have seen absolutely zero response. Not even an acknowledgement.

However, today Campaign Against the Arms Trade have managed to unlock the Export License dataset, after painstakingly extracting inaccessible statistics from the official government site, and turning this into an open dataset and providing an online application to explore the data. They explain:

Until now the data, compiled by the Export Control Organisation(ECO) in the Department for Business, Innovation and Skills (BIS), was difficult to access, use and understand. The new CAAT app, available via CAAT’s website, transforms the accessibility of the data.

The salient features are:

Open access – anyone can view data without registering and can make and refine searches in real time.
Data has been disaggregated, providing itemised licences with ratings and values.
Comprehensive searchability (including of commonly-required groupings, for example by region of the world or type of weaponry).
Graphs of values of items licensed are provided alongside listings of licences.
Revoked licences are identified with the initial licence approvals.
Individual pages/searches (unique urls) can be linked to directly.
The full raw data is available as csv files for download.

And as Ian Prichard, CAAT Research Co-ordinator put’s it:

It is hard to think of an area of government activity that demands transparency more than arms export licensing.

The lack of access to detailed, easy-to-access information has been a barrier to the public, media and parliamentarians being able to question government policies and practices. These practices include routine arming of authoritarian regimes such as Saudi Arabia, Bahrain and Egypt.

As well as providing more information in and of itself, we hope the web app will prompt the government to apply its own open data policies to arms exports. and substantially increase the level and accessibility of information available.

Perhaps projects like CAAT’s can help bring back the ‘hard political edge’ Robinson and Yu describe in the heritage of ‘open government’. They certainly emphasise the need for a ‘right to data’ rather than just access to data existing as a general policy subject to the decisions of those in power.

NT Open Data Days: Exploring data flow in a VCO

[Summary: A practical post of notes from a charity open data day. Part reflective learning; part brain-dump; part notes for ECDP]

Chelmsford was my destination this morning for a Nominet Trust funded ‘Open Data Day’ with the Essex Coalition of Disabled People (ECDP). The Open Data Days are part of an action research exploration of how charities might engage with the growing world of open data, both as data users and publishers. You can find a bit more of the context in this post on my last Open Data Day with the Nominet Trust team.

This (rather long and detailed) post provides a run down of what we explored on the day as a record for the ECDP team, and as as resource of wider shared learning.

Seeking structures and managing data

For most small organisations, data management often means Excel spreadsheets, and ECDP is no exception. In fact, ECDP has a lot of spreadsheets on the go. Different teams across the organisation maintain lists of volunteers, records about service users, performance data, employment statistics, and a whole lot more, in individual Excel workbooks. Bringing that data together to publish the ‘Performance Dashboards‘ that ECDP built for internal management, but that have also been shared in the open data are of the ECDPwebsite, is a largely manual task. Across these spreadsheets it’s not uncommon to see the information on a particular topic (e.g. volunteers), spread across different tabs, or duplicated into different spreadsheets where staff have manually copied filtered extracts for particular reports. The challenge with this is that it leads the organisations information to fragment, and makes pulling together both internal and open data and analysis tricky. Many of the spreadsheets we found during the open day mix the ‘data layer’, with ‘presentation’ and ‘analysis’ layers, rather than separating these out.

What can be done?

Before getting started with open data, we realised that we needed to look at the flow of data inside the organisation. So, we looked at what makes a good data layer in a spreadsheet, such as:

Keeping all the data of one type in a single worksheet. For example, if you have data on volunteers all the data should be in a single sheet. Don’t start new sheets for ‘Former volunteers’, or ‘Volunteers interested in sports’ – as this fragments the data. If you need to be know about a volunteers interest, or whether they are active or not, add a column to your main sheet, and use filters (see below).
Having one header row of columns. You can use merged cells, sub-headings and other formatting when you present data – but when you use these in the master spreadsheet where you collect and store your data you make life trickier for the computer to understand what your data is, and to support different analysis of the data in future.
Including validation… Excel allows you to define a list of possible values for a cell, and provides users entering data with a drop-down box to select from instead of them typing values in by hand. This really helps increase the consistency of data. You can also validate to be sure the entry in a cell is a number, or a date, and so-on. In working on some ECDP prototypes we ran up against a problem where our lists of possible valid entries for a cell was too long, and we didn’t want to keep the master-list of valid values in the Excel sheet our data was on, but Wizard of Excel has documented a workaround for that.
…but keeping some flexiblity. Really strict validation has it’s own problems, as it can force people to twist what they wanted to record to fit in a structure that doesn’t make sense, or that distorts the data. For example, in some spreadsheets we found the ‘Staff member responsible’ column often had more than one name in. We had to explore why that was, and whether the data structure needed to accomodate more than one staff member linked to a particular row in the spreadsheet. Keeping a spreadsheet structure flexible can be a matter of providing free text areas where users are not constrained in the detail they provide and in having a flexible process to revise and update structures according to demand.

Once you have a well structured spreadsheet (see the open data cookbook section on preparing your data if you still need to get a sense of what well structured data might look like), then you can do a lot more with it. For example:

Creating a pivot chart. Pivot chartsare a great way to analyse data, and are well worth spending time to explore. Many of the reporting requirements an organisation has can be met using a pivot chart.For ECDP we created an example well-structured dataset of ‘Lived Experience Feedback’ – views and insights provided by service users and recorded with detailed descriptions, dates when the feedback was given, and categories highlighting the topical focus of the views expressed. We made all this data into an Excel list, which allowed us to add a formula that would apply to every row and that used the =MONTH() formula to extract the month from the dates given in each row. Creating a pivot chart from this list, we could then drill down to find figures such as the number of Lived Experience reports provided to the Insight team and relating to ‘Employment’ in any given month.
Creating filtered lists and dashboards. It can seem counterintuitive to an organisation which mostly wants to see data in separate lists by area, or organisational team, to put all the data for those areas and teams into one spreadsheet, with just a column to flag up which team or area a row relates to. That’s why spreadsheets often end up with different tabs for different teams – where the same sort of data is spread across them. Using formulae to create thematic lists and dashboards can be a good way to keep teams happy, whilst getting them to contribute to a single master list of data. (We spent quite a lot of time on the open data day thinking about the importance of motivating staff to provide good quality data, and the need to make the consequences of providing good data visible.)Whilst the ‘Autofilter’ feater in Excel can be used to quickly sub-set a dataset to get just the information you are interested in, when we’re building a spreadsheet to be stored on a shared drive, and used by multiple teams, we want to avoid confusion when the main data sheet ends up with filters applied. So instead we used simple cross-sheet formulae (e.g. If your main data sheet is called ‘Data’, then put =’Data’!A1 in the top-left cell of a new sheet, and then drag it out) to copies of the master sheet, and then applied to the filters to these. We included a big note on each of these extra sheets to remind people that any edits should be made to the master data, not these lists.
Linking across spreadsheets. Excel formulae can be used to point to values not just in other sheets, but also to values in other files. This makes it possible to build a dashboard that automatically updates by running queries against other sheets on a shared drive.Things get even more powerful when you are able to publish datasets to the web as open data, when tools like Google Docs have the ability to pull in values and data across the web, but even with non-open data in an organisation, there should be no need to copy and paste values that could be transfered dynamically and automatically.

Of course, when you’ve got lots of legacy spreadsheets around, then making the shift to more structured data, separating the data, analysis and presentation layers, can be tricky. Fortunately, some of the common tools in the open data wranglers toolbox come in handy here.

To move from a spreadsheet with similar data spread across lots of different tabs (one for each team that produced that sort of data), to one with consistent and standardised data, we copied all the data into a single sheet with one header row, and a new column indicating the ‘team’ that row was from (we did this by saving each of the sheets as .csv files, and using the ‘cat’ command on Mac OSX to combine these together, but the same effect can be got with copy and paste).

We then turned to the open data wranglers power tool Google Refine (available as a free download) to clean up the data. We used ‘text facets’ to see where people had entered slightly different names for the same area or theme, and made bulk edits to these, and used some replacement patterns to tidy up date values.

We then took this data back into Excel to build a master spreadsheet, with one single ‘Data’ sheet, and separate sheets for pivot chart reports and filtered lists.

The whole process once started took an hour or so, but once complete, we had a dataset that could be analysed in many more ways than before, and we had the foundations for building both better internal data flows, and for extracting open data to share.

Heading towards a CRM

As much as, with the appropriate planning, discipline and stewardship, Excel can be used to manage a lot of the data an organisation needs, we also explored the potential to use a fully-featured ‘Contact Relationship Management‘ dataset (CRM) to record information right across the organisation.

Even when teams and projects in an organisation are using well structured spreadsheets, there are likely to be overlaps and links between their datasets that are hard to make unless they are all brought into one place. For example, two teams might be talking to the same person, but if one knows the person as Mr Rich Watts, and the other record R.Watts, bringing together this information is tricky. A CRM is a central database (often now accessed over the web) which keeps all this information in one place.

Modern CRM systems can be set up to track all sorts of interactions with volunteers, customers or service users, both to support day to day operations, and to generate management information. We looked at the range of CRM tools available, from the Open Source ‘CiviCRM’ which has case tracking modules that may be useful to an organisation like ECDP, through to tools like Salesforce, which offer discounts to non-profits. Most CRM solutions have free online trials. LASA’s ICT Knowledge Base is a great place to look for more support on exploring options for CRM systems.

In our open data day we discussed the importance of thinking about the ‘user journey’ that any database needs to support, and ensuring the databases enable, rather than constrain, staff. Any process of implementing a new database is likely to involve some changes in staff working practices too, so it’s important to look at the training and culture change components as well as the technical elements. This is something true of both internal data, and open data, projects.

When choosing CRM tools it’s important to think about how a system might make it possible to publish selected information as open data directly in future, and how they might be able to pull in open data.

Privacy Matters

Open data should not involve the release of people’s personal data. To make open data work, a clear line needs to be drawn between data that identifies and is about individuals, and the sorts of non-personal data that an organisation can release as open data.

Taking privacy seriously matters:

Data anonymisation cannot be relied upon. Studies conclusively show that we should not put our faith in anonymisation to protect individuals identity in published datasets. It’s not enough to simply remove names or dates of birth from a dataset before publishing it.
Any release of data drawn from personal data needs to follow from a clear risk assessment. It’s important to consider what harm could result from the release of any dataset. For example, if publishing a dataset that has contains information on reported hate crime by post-code area, if a report was traced back to an individual could this lead to negative consequences for them?
It’s important to be aware of jigsaw re-identification risks. Jigsaw re-identification is the risk that putting together two open datasets will allow someone to unlock previously anonymised personal data. For example, if you publish one open dataset that maps where users of your service are, and includes data on types of disability, and you publish another dataset that lists reports of hate-crime by local area, could these be combined to discover the disability of the person who reported hate crime in a particular area, and then, perhaps combined with some information from a social network like Facebook or Twitter, to identify that person.

Privacy concerns don’t mean that it’s impossible to produce open data from internal datasets of pesonal information, but care has to be taken. There can be tension between the utility of open data, and the privacy of personal data in a dataset. Organisations need to be careful to ensure privacy concerns and the rights of service users always come first.

With the ECDP data on ‘Lived Experience’ we looked at how Google Refine could be used to extract from the data a list of ‘PCT Areas’ and ‘Issue Topics’ reported by service users, to map where the hot-spots were for particular issues at the PCT level. Whilst drawn from a dataset with personal information, this dataset would not include any Personally Identifying Information, and may be possible to publish as open data.

Open data fusion

Whilst a lot of our ‘open data day’ was spent on the foundations for open data work, rather than open data itself, we did work on one small project which had an immediate open data element.

Rich Watts brought to the session a spreadsheet of 250 Disabled People’s User Led Organisations (DPULOs), and wanted to find out (a) how many of these organisations were charities; and (b) what their turnover was. Fortunately, Open Charities has gathered exactly the data needed to answer that question as open data, and so we ran through how Google Fusion Tables could be used to merge Rich’s spreadsheet with existing charity data (see this How To for an almost identical project with Esmee Fairbairn grants data), generating the dataset needed to answer these questions in just under 10 minutes.

We discussed how Rich might want to publish his spreadsheet of DPULOs as open data in future, or to contribute information on the fact that certain charities are Disabled People’s User Led Organisations back to an open data source like Open Charities.

Research Resources

The other element to our data was an exploration of online data sources useful in researching a local area, led fantastically by Matthew of PolicyWorks.

Many of the data sources Matthew was able to point to for finding labour market information, health statistics, demographic information and other stats provide online access to datasets, but don’t offer this as ‘open data’ that would meet the OKF’s open definition requirements, raising some interesting questions about the balance between a purist approach to open data, or an approach that looks for data which is ‘open enough’ for rough-and-ready research.

Where next?

Next week Nominet, NCVO and Big Lottery Fund are hosting a conference to bring together learning from all the different Open Data Days that have been taking place. The day will also see the release of a report on the potential of open data in the charity sector.

For me, today’s open data day has shown that we need to recognise some of the core data skills that organisations will need to benefit from open data. Not just skills to use new online tools, but skills to manage the flow of data internally, and to fascilitate good data management. Investment in these foundations might turn out to be pivotal for realising open data’s third-sector potential…

5-Stars of Open Data Engagement?

[Summary: Notes from a workshop at UKGovCamp that led to sketching a framework to encourage engagement and impact of open data initiatives might contain]

Update: The 5 Stars of Open Data Engagement now have their own website at http://www.opendataimpacts.net/engagement/.

In short

* Be demand driven

* * Provide context

* * * Support conversation

* * * * Build capacity & skills

* * * * * Collaborate with the community

The Context

I’ve spent the last two days at UKGovCamp, an annual open-space gathering of people from inside and around local and national government passionate about using digital technologies for better engagement, policy making and practice. This years event was split over two days: Friday for conversations and short open-space slots; Saturday for more hands-on discussions and action. Suffice to say, there were plenty of sessions on open data on both days – and this afternoon we tried to take forward some of the ideas from Day 1 about open data engagement in a practical form.

There is a general recognition of the gap between putting a dataset online, and seeing data driving real social change. In a session on Day 1 led by @exmosis, we started to dig into different ways to support everyday engagement with data, leading to Antonio from Data.gov.uk suggesting that open data initiatives really needed to have some sort of ‘Charter of engagement’ to outline ways they can get beyond simply publishing datasets, and get to supporting people to use data to create social, economic and administrative change. So, we took that as a challenge for day 2, and in session on ‘designing an engaging open data portal’ a small group of us (including Liz Stevenson, Anthony Zacharzewski, J on Foster and Jag Goraya) started to sketch what a charter might look like.

You can see the (still developing) charter draft in this Google Doc. However, it was Jag Goraya‘s suggestion that the elements of a charter we were exploring might also be distilled into a ‘5 Stars’ that seemed to really make some sense of the challenge of articulating what it means to go beyond publishing datasets to do open data engagement. Of course, 5-star rating scales have their limitations, but I thought it worth sharing the draft that was emerging.

What is Open Data Engagement?

We were thinking about open data engagement as the sorts of things an open data initiative should be doing beyond just publishing datasets. The engagement stars don’t relate to the technical openness or quality of the datasets (there are other scales for that), and are designed to be flexible to be able to apply to a particular dataset, a thematic set of datasets, or an open data initiative as a whole.

We were also thinking about open government data in our workshop; though hopefully the draft has wider applicability. The ‘overarching principles’ drafted for the Charter might also help put the stars in context:

Key principles of open government data: “Government information and data are common resources, managed in trust by government. They provide a platform for public service provision, democratic engagement and accountability, and economic development and innovation. A commitment to open data involves making information and data resources accessible to all without discrimination; and actively engaging to ensure that information and data can be used in a wide range of ways.”

Draft sketch of five stars of Open Data Engagement

The names and explanatory text of these still need a lot of work; you can suggest edits as comments in the Google Doc where they were drafted.

* Be demand driven

Are your choices about the data you release, how it is structured, and the tools and support provided around it based on community needs and demands? Have you got ways of listening to people’s requests for data, and responding with open data?

** Provide good meta-data; and put data in context

Do your data catalogue provide clear meta-data on datasets, including structured information about frequency of updates, data formats and data quality? Do you include qualitative information alongside datasets such as details of how the data was created, or manuals for working with the data? Do you link from data catalogue pages to analysis your organisation, or third-parties, have already carried out with the data, or to third-party tools for working with the data?

Often organisations already have detailed documentation of datasets (e.g. analysis manuals and How To’s) which could be shared openly with minimal edits. It needs to be easy to find these when you find a dataset. It’s also common that governments have published analysis of the datasets (they collected it for a reason), or used it in some product or service, and so linking to these from the dataset (and vice-versa) can help people to engage with it.

*** Support conversation around the data

Can people comment on datasets, or create a structured conversation around data to network with other data users? Do you join the conversations? Are there easy ways to contact the individual ‘data owner’ in your organisation to ask them questions about the data, or to get them to join the conversation? Are there offline opportunities to have conversations that involve your data?

**** Build capacity, skills and networks

Do you provide or link to tools for people to work with your datasets? Do you provide or link to How To guidance on using open data analysis tools, so people can build their capacity and skills to interpret and use data in the ways they want to? Are these links contextual (e.g. pointing people to GeoData tools for a geo dataset, and to statistical tools for a performance monitoring dataset)? Do you go out into the community to run skill-building sessions on using data in particular ways, or using particular datasets? Do you sponsor or engage with community capacity building?

When you give people tools – you help them do one thing. When you give people skills, you open the possibility of them doing many things in future. Skills and networks are more empowering than tools.

***** Collaborate on data as a common resource

Do you have feedback loops so people can help you improve your datasets? Do you collaborate with the community to create new data resources (e.g. derived datasets)? Do you broker or provide support to people to build and sustain useful tools and services that work with your data?

It’s important for all the stars that they can be read not just with engaging developers and techies in mind, but also community groups, local councillors, individual non-techie citizens etc. Providing support for collaboration can range from setting up source-code sharing space on GitHub, to hanging out in a community centre with print-outs and post-it notes. Different datasets, and different initiatives will have different audiences and so approaches to the stars – but hopefully there is a rough structure showing how these build to deeper levels of engagement.

Where next?

Hopefully Open Data Sheffield will spend some time looking at this framework at a future meeting – and all comments are welcome on the Google doc. Clearly there’s lot to be done to make these more snappy, focussed and neat – but if we do find there’s a fairly settled sense of a five stars of engagement framework (if not yet good language to express it) then it would be interesting to think about whether we have the platforms and processes in place anywhere to support all of this: finding the good practice to share. Of course, there might already be a good engagement framework out there we missed when sketching this all out – so comments to that effect welcome too…

Updates:

Ammended 22nd January to properly credit Antonio of Data.gov.uk as originator of the Charter idea

Exploring Open Charity Data with Nominet Trust

[Summary: notes from a pilot one-day working on open data opportunities in third-sector organisations]

On Friday I spent the day with Nominet Trust for the second of a series of charity ‘Open Data Days’ exploring how charities can engage with the rapidly growing and evolving world of open data. The goal of these hands-on workshops is to spend just one working day looking at what open data might have to offer to a particular organisation and, via some hands-on prototyping and skill-sharing, to develop an idea of the opportunities and challenges that the charity needs to explore to engage more with open data.

The results of ten open data days will be presented at a Nominet Trust, NCVO and Big Lottery Fund conference later in the year, but for now, here’s a quick run-down / brain-dump of some of the things explored with the Nominet Trust team.

What is Open Data anyway?

Open data means many different things to different people – so it made sense to start the day looking at different ways of understanding open data, and identifying the ideas of open data that chimed most with Ed and Kieron from the Nominet Trust Team.

The presentation below runs through five different perspectives on open data, from understanding open data as a set of policies and practices, to looking at how open data can be seen as a political movement or a movement to build foundations of collaboration on the web.

Reflecting on the slides with Ed and Kieron highlighted that the best route into exploring open data for Nominet Trust was looking at the idea that ‘open data is what open data does’ which helped us to set the focus for the day on exploring practical ways to use open data in a few different contexts. However, a lot of the uses of open data we went on to explore also chime in with the idea of a technical and cultural change that allows people to perform their own analysis, rather than just taking presentations of statistics and data at face value.

Mapping opportunities for open data

Even in a small charity there are many different places open data could have an impact. With Nominet Trust we looked at a number of areas where data is in use already:

Informing calls for proposals – Nominet Trust invite grant applications for ideas that use technology for disruptive innovation in a number of thematic areas, with two main thematic areas of focus live at any one time. New thematic areas of focus are informed by ‘State of the Art’ review reports. Looking at one of these it quickly becomes clear these are data-packed resources, but that the data, analysis and presentation are all smushed together.
Throughout the grant process – Nominet Trust are working not only to fund innovative projects, but also to broker connections between projects and to help knowledge and learning flow between funded projects. Grant applications are made online, and right now, details of successful applicants are published on the Trust’s websites. A database of grant investment is used to keep track of ongoing projects.
Evaluation – the Trust are currently looking at new approaches to evaluating projects, and identifying ways to make sure evaluation contributes not only to an organisations own reflections on a project, but also to wider learning about effective responses to key social issues.

With these three areas of data focus, we turned to identify three data wishes to guide the rest of the open data day. These were:

Being able to find the data we need when we need it
Creating actionable tools that can be embedded in different parts of the grant process – and doing this with open platforms that allow the Nominet Trust team to tweak and adapt these tools.
Improving evaluation – with better data in, and better day out

Pilots, prototypes and playing with data

The next part of our Open Data Day was to roll up our sleeves and to try some rapid experiments with a wide range of different open data tools and platforms. Here are some the experiments we tried:

Searching for data

We imagined a grant application looking at ways to provide support to young people not in education, employment or training in the Royal Borough of Kensington and Chelsea, and set the challenge of finding data that could support the application, or that could support evaluation of it. Using the Open Data Cook Book guide to sourcing data, Ed and Keiron set off to track down relevant datasets, eventually arriving at a series of spreadsheets on education stats in London on the London Skills and Employment Observatory website via the London Datastore portal. Digging into the spreadsheets allowed the team to put claims that could be made about levels of education and employment exclusion in RBKC in context, looking at the difference interpretations that might be drawn from claims made about trends and percentages, and claims about absolute numbers of young people affected.

Learning: The data is out there; and having access to the raw data makes it possible to fact-check claims that might be made in grant applications. But, the data still needs a lot of interpretation, and much of the ‘open data’ is hidden away in spreadsheets.

Publishing open data

Most websites are essentially databases of content with a template to present them to human readers. However, it’s often possible to make the ‘raw data’ underlying the website available as more structured, standardised open data. The Nominet Trust website runs on Drupal and includes a content type for projects awarded funding which includes details of the project, it’s website address, and the funding awarded.

Using a demonstration Drupal website we explored how the Drupal Views and the Views Bonus Pack open source modules it was easy to create a ‘CSV’ open data download of information in the website.

The sorts of ‘projects funded’ open data this would make available from Nominet Trust might be of interest to sites like OpenlyLocal.com which are aggregating details of funding to many different organisations.

Learning: You can become an open data publisher very easily, and by hooking into existing places where ‘datasets’ are kept, keeping your open data up-to-date is simple.

Mashing-up datasets

Because open datasets are often provided in standardised forms, and the licenses under which data is published allow flexible re-use of the data, it becomes easy to mash-up different datasets, generating new insights by combining different sources.

We explored a number of mash-up tools. Firstly, we looked at using Google Spreadsheets and Yahoo Pipes to filter a dataset ready to combine it with other data. The Open Data Cook Book has a recipe that involves scraping data with Google Spreadsheets, and a Yahoo Pipes recipe on combing datasets.

Then we turned to the open data powertool that is Google Refine. Whilst Refine runs in a web browser, it is software you install on your own computer, and it keeps the data on your machine until you publish it – making a good tool for a charity to use to experiment with their own data, before deciding whether it will be published as open data or not.

We started by using Google Refine to explore data from OpenCharities.org – taking a list of all the charities with the word ‘Internet’ in their description that had been exported from the site, and using the ‘Facets’ feature (and a Word Facet) in Google Refine to look at the other terms they used in their descriptions. Then we turned to a simple dataset of organisations funded by Nominet Trust, and explored how by using API access to OpenlyLocal.com’s spending dataset we could get Google Refine to fetch details of which Nominet Trust funded organisations had also recieved money from particular local authorities or big funders like Big Lottery Fund and the Arts Council. This got a bit technical, so a step-by-step How To will have to wait – but the result was an interesting indication of some of the organisations that might turn out to be common co-funders of projects with Nominet Trust – a discovery enabled by those funders making their funding information available as open data.

Learning: Mash-ups can generate new insights – although many mash-ups still involve a bit of technical heavy-lifting and it can take some time to really explore all the possibilities.

Open data for evaluation

Open data can be both an input and an output of evaluation. We looked at a simple approach using Google Spreadsheets to help a funder create evaluation online evaluation tools for funded projects.

With a Google Docs account, we looked at creating a new ‘Form’. Google Forms are easy to create, and let you design a set of simple survey elements that a project can fill in online, with the results going directly into an online Google Spreadsheet. In the resulting spreadsheet, we added an extra tab for ‘Baseline Data’, and exploring how the =ImportData() formula in Google Spreadsheet can be used to pull in CSV files of open data from a third party, keeping a sheet of baseline data up-to-date. Finally, we looked at the ‘Publish as a Web Page’ feature of Google Spreadsheets which makes it possible to provide a simple CSV file output from a particular sheet.

In this way, we saw that a funder could create an evaluation form template for projects in a Google Form/Spreadsheet, and with shared access to this spreadsheet, could help funded projects to structure their evaluations in ways that helped cross-project comparison. By using formulae to move a particular sub-set of the data to a new sheet in the Spreadsheet, and then using the ‘Publish as a Web Page’ feature, non-private information could be directly published as open data from here.

Learning: Open data can be both an input to, and an output from, evaluation.

Embeddable tools and widgets

Working with open data allows you to present one interpretation or analysis of some data, but also allow users of your website or resources to dig more deeply into the data and find their own angles, interpretations, or specific facts.

When you add a ‘Gadget’ chart to a Google Spreadsheet of data you can often turn it into a widget to embed in a third party website. Using some of the interactive gadgets allows you to make data available in more engaging ways.

Platforms like IBM’s Many Eyes also let you create interactive graphs that users can explore.

Sometimes, interactive widgets might already be available, as in the case of Interactive Population pyramids from ONS. The Nominet Trust state of the art review on Aging and use of the Internet includes a static image of a population pyramid, but many readers could find the interactive version more useful.

Learning: If you have data in a report, or on a web page, you can make it interactive by publishing it as open data, and then using embeddable widgets.

Looking ahead

The Open Data Day ended with a look at some of the different ways to take forward learning from our pilots and prototypes. The possibilities included:

Sooner

Quick wins: Making funded project data available as structured open data. As this information is already published online, there are not privacy issues with making it available in a more structured format.
Developing small prototypes taking the very rough proof-of-concept ideas from the Open Data Day on a stage, and using this to inform plans for future developments. Some of the prototypes might be interactive widgets.
A ‘fact check’ experiment: taking a couple of past grant applications, and using open data resources to fact-check the claims made in those applications. Reflecting on whether this process offers useful insights and how it might form part of future processes.
Commissioning open data along with research: when Nominet Trust commissions future State of the Art reviews it could include a request for the researcher to prepare a list of relevant open datasets as well, or to publish data for the report as open data.

Later

Explore open data standards such as the International Aid Transparency Initiative Standard for publishing project data in a more detailed form.
Building our own widgets and tools: for example, tools to help applicants find relevant open data to support their application, or tools to give trustees detailed information on applicant organisations to help their decision making.
Building generalisable tools and contributing to the growth of a common resource of software and tools for working with open data, as well as just building things for direct organisational use.

Where next?

This was just the second of a series of Open Data Days supported by Nominet Trust. I’m facilitating one more next month, and there are a team of other consultants working with varied other charities over the coming weeks. So far I’ve been getting a sense of the wide range of possible areas open data can fit into charity work (it feels quite like exploring the ways social media could work for charities did back in 2007/8…), but there’s also much work to be done identifying some of the challenges that charities might face, and sustainable ways to overcome them. Lots more to learn….

Evaluating the Autumn Statement Open Data Measures

[Summary: Is government is meeting the challenge of building an open data infrastructure for the UK? A critical look at the Autumn Statement Open Data Measures.]

For open data advocates, the Chancellor’s Autumn Statement published on Tuesday, underlined how far open data has moved from a small geeks issue, to an increasingly common element in Government policy. The statement itself included a section announcing new data, and renewing the argument that Public Sector Information (PSI) can play a role in both economic growth, and public service standards.

1.125 Making more public sector information available will help catalyse new markets and innovative products and services as well as improving standards and transparency in public services. The Government will open up access to core public datasets on transport, weather and health, including giving individuals access to their online GP records by the end of this Parliament. The Government will provide up to £10 million over five years to establish an Open Data Institute to help industry exploit the opportunities created through release of this data

And accompanying this the Cabinet Office published a paper of Further Detail on Open Data Measures in the Autumn Statement, including an updated on the fate of the proposed Public Data Corporation consulted on earlier in the year. Although this paper includes a number of positive announcements when it comes to the release of new datasets such as detailed transport and train timetable data, the overall document shows that government continues to fudge key reforms to bring the UK’s open data infrastructure into the 21^st Century, and displays some worrying (though perhaps unsurprising) signs of open data rhetoric being hijacked to advance non-open personal data sharing projects, and highly political uses of selective open data release.

In order to put forward a constructive critique, let us take the governments intent at face value (the intent to use PSI and open data to promote economic growth, and to improve standards in public services), and then suggest where the Open Data Measures either fall short of this, or where they should otherwise give cause for concern.

A strategic approach to data?

Firstly, let’s consider the particular datasets being made available: there are commitments to provide train and bus timetable information, highways and traffic data, land registry ‘price paid’ data, Met Office weather data and companies house datasets all under some form of open license. However, the commitments to other datasets, such as key ordnance survey mapping data, train ticket price data, and the national address gazetteer are much more limited, with only a limited ‘developers preview’ of the gazetteer being suggested. There appears to be little coherence to what is being made available as open data, nor a clear assessment of how the particular datasets in question will support economic development and public accountability. If we take seriously the idea that open government data provides key elements of infrastructure for both enterprise and civic engagement in a digital economy, then we need a clear strategic approach to build and invest in that infrastructure: focussing attention on the datasets that matter most rather than seeing piecemeal release of data [1].

Clear institutional arrangements and governance?

Secondly, although the much disliked ‘Public Data Corporation’ proposal to integrate the main trading funds and establish a common (and non-open) regime for their data, has disappeared from the Measures, the alternative institutional arrangements right now appear inadequate to meet key goals of releasing infrastructure data to support economic development, and removing the inefficiencies in the current system which has government buying data off itself, reducing usage and limiting innovation.

The Open Data Measures propose the creation of a ‘Public Data Group (PDG)’ to include the trading funds who retain their trading role, selling core data and value-added services, although with a new responsibility to better collaborate and drive efficiency. The responsibility to promote availability of open data is split off to a ‘Data Strategy Board (DSB)’, which, in the current proposal, will receive a subsidy in it’s first year to ‘buy’ data from the PSG for the public, will in future years rely for it’s funding on a proportion of the dividends paid from the PDG. It is notable that the DSB is only responsible for ‘commissioning and purchasing of data for free release’ and not for ‘open’ release (the difference is in the terms of re-use of the data), which may mean in effect the DSB is only able to ‘rent’ data from the PDG, or that any data it is able to release will be a snapshot in time extract of core reference data, not a sustainable move of core reference data into the public domain.

So – in effect whilst the PDC has disappeared, and there is a split between the bodies with an interest in maximising return on data (PDG), and a body increasing supply of public data (DSB) – the body seeking public data will be reliant upon the profitability of the PDG in order to have the funding it needs to secure the release of data that, if properly released in free forms, would likely undermine the current trading revenue model of the PDG. That doesn’t look like the foundation for very independent and effective governance or regulation to open up core reference data!

Furthermore, whilst the proposed terms for the DSB terms state that “Data users from outside the public sector, including representatives of commercial re-users and the Open Data community, will represent at least 30% of the members of DSB”, there are also challenges ahead to ensure data users from civil society interests are represented on the board, including established civil society organisations from beyond the technology-centric element of the open data community (the local authority or government members of the board will not be ‘open data’ people, but simply data people – who want better access to the resources they may already be using; we should be identifying similar actors from civil society to participate – understanding the role of the DSB as one of data governance through the framework of an open data strategy).

Open data as a cloak for personal data projects and political agendas?

Thirdly, and turning to some of the other alarm bells that ring in the Open Data Measures, the first measures in the Cabinet Office’s paper are explicitly not about open data as public data, but are about the restricted sharing of personal medical records with life-science research firms – with the intent of developing this sector of the economy. With a small nod to “identifying specified datasets for open publication and linkage”, the proposals are more centrally concerned with supporting the development of a Clinical Practice Research Datalink (CPRD) which will contain interlinked ‘unidentifiable, individual level’ health records, by which I interpret the ability to identify a particular individual with some set of data points recorded on them in primary and secondary care data, without the identity of the person being revealed.

The place of this in open data measures raises a number of questions, such as whether the right constituencies have been consulted on these measures and why such a significant shift in how the NHS may be handing citizens personal data is included in proposals unlikely to be heavily scrutinised by patient groups? In the past, open data policies have been very clear that ‘personal data’ is out of scope – and the confusion here raises risks to public confidence in the open data agenda. Leaving this issue aside for the moment, we also need to critically explore the evidence that the release of detailed health data will “reinforce the UK’s position as a global centre for research and analytics and boost UK life sciences”. In theory, if life science data is released digitally and online, then the firms that can exploit it are not only UK firms – but the return on the release of UK citizens personal data could be gained anywhere in the world where the research skills to work with it exist.

When we look at the other administrative datasets proposed for release in the Measures the politicisation of open data release is evident: Fit Note Data; Universal Credit Data; and Welfare Data (again discussed for ‘linking’ implying we’re not just talking about aggregate statistics) are all proposed for increased release, with specific proposals to “increase their value to industry”. By contrast, no mention of releasing more details on the tax share paid by corporations, where the UK issues arms export licenses, or which organisations are responsible for the most employment law violations. Although the stated aims of the Measures include increasing “transparency and accountability” it would not be unreasonable to read the detail of the measures as very one-sided on this point: and emphasising industry exploitation of data far more than good governance and citizen rights with respect to data.

The blurring of the line between ‘personal data’ and ‘open data’, and the state’s assumption of the right to share personal data for industrial gain should give cause for concern, and highlights the need for build a stronger constituency scrutinising government open data action.

Building capacity to use data?

Fourthly, and perhaps most significantly if we are taking seriously the goal of seeing open data not only lead to economic development, but also to better public services, the measures contain a dearth of funding or support to truly support the sorts of skills development and organisational change that will be needed to have effective use of open data in the UK.

The Measures announce the creation of an Open Data Institute, with the possibility of £10m match funding over 5 years, to “help business exploit the opportunities created by release of public data” which does have the potential to address much needed research to the gap in understanding and practice on how to build sustainable enterprise with open data. However, beyond this, there is little in the measures to foster the development of data skills more widely in government, in the economy and in civil society.

We know that open data alone is not enough to drive innovation: it’s a raw material to be combined with others in an information economy and information society. There are significant skills development needs to equip the UK to make the most of open data – and the Measures fall short on meeting that challenge.

A constructive critique?

Many of the detailed measures from the Autumn Statement are still draft – subject to further consultation. As a package, it’s not one to be accepted or rejected out of hand. Rather – there is a need for continued engagement by a broad constituency, including members of the broad based ‘open data community’ to address the measures one-by-one as government works to fill in the details over coming months.

Footnotes

[1] An open data infrastructure: The idea of open data as digital infrastructure for the nation has a number of useful consequences. It can help us to develop our thinking about the state’s responsibility with respect to datasets. Just as in the development of our physical infrastructure the state both invested directly in provision of roads and railways, has adopted previously privately created infrastructure (the turnpikes for examples), and encouraged private investment within frameworks of government regulation, a strategic approach to public data infrastructure would not just be about pre-existing datasets having an open license slapped on them – but would involve looking at a range of strategies to provide the open data foundations for economic and civic activity. Government may need to act as guarantor of specific datasets, if not core provider. When we think infrastructure projects, we can think critically about who benefits from particular projects: and can have an open debate about where limited state resources to support a sustainable open data infrastructure should go. The infrastructure metaphor also helps us start to distinguish different sorts of government data, recognising that performance data and personal data may need to be handled within different arrangements and frameworks from core reference data like mapping and transport systems information. In the later case, there is a strong argument to secure a guarantee of the continued funding of these resources as public goods, free at the point of use, kept in public trust, and maintained to high standards of consistency. Other arrangements are likely to lead to over-charging and under-use of core reference datasets, with deadweight loss of benefit – and particularly excluding civic uses and benefits. In the case of other datasets generated by government in the day to day conduct of business (performance data; aggregate medical records, etc.), it may be more appropriate to recognise that while there is benefit to be gained from the open release of these (a) for civic use; and (b) for commercial use, this will vary significantly on a case-by-case basis, and the release of the data should not create an ongoing obligation on government to continue to collect and produce the data once it is no longer useful for government’s primary purpose.)

Open Personae: a step towards user-centred data developments?

[Summary: reflections on data-shaped design, and adding user persona as a new raw material in working with open data]

A lot has been written recently about the fact that open data alone is not enough to make a difference. Data needs to be put into the hands of those who can use it to make a difference, and if the only way to do that is as a programmer, or someone with the resources to hire one, we end up with a bigger, rather than narrower, data divide.

Infomediaries, with the technical skills to take data and create accessible interface onto it; to integrate it into existing systems; and to make it accessible to be communicated to those who need it, are a key part of the solution. However, unlike common software and resource development challenges, which often start from a clearly articulated problem and user needs, and then work backwards to source data and information, open data projects often have a different structure. A need is recognized; data is identified; data is opened; and then from the data applications and resources are built. The advantage of open data is that, rather than data being accessed just to solve one particular problem, it is now available to be used in a wide range of problem solving. But, there is a risk that the structure of the open data process introduces a disconnect: specific problems drive demands for open data, but open data offers general solutions – and those with the skills to work with data may not be aware of, or connected with, the specific problems that motivated the desire to open the data in the first place; nor with other specific problems which the data, now it is open, can be part of solving.

When open data is the primary raw material for a project, that data can exert a powerful influence in shaping the design of the project and its outputs. The limitations of the data quickly become accepted as limitations of the application; the structure of the data is often presented to the user, regardless of whether this is the structure of information they need to be able to use the application effectively. Data-shaped design is not necessarily good design. But finding ways to put users back at the heart of projects, and adopt user-centered design approaches to working with data can be a challenge.

The frictionless nature of accessing data contrasts heavily with the friction involved in identifying and working with potential users of a data-driven application. For technical developers interested in experimenting with data in hack-day contexts*, or working in small, time and resource-limited, projects, the overheads of user engagement are a big ask. It’s an even bigger challenge in projects like the International Aid Transparency Initiative (IATI), where with aidinfo labs I’m currently trying to support development of informediary apps and resources for users spread across the globe: users who might be in low-bandwidth/limited Internet access environments, or in senior governmental positions, where engagement in a user-workshop is not easy to secure.

So – without ignoring the need to have real user engagement in a project – one of the things we’re just starting to experiment with in the aidinfo labs project, is adding another raw material alongside our open data. We’re creating a set of ‘open personae’ – imaginary profiles of potential users of applications and resources built with IATI data, designed to help techies and developers gain insights into the people who might benefit from the data, and to help provide a clearer idea of some of the challenges applications need to meet.

So far we’ve created four personae (borrowing one from another project), simply working in open Google Docs so that we can collaboratively draft them, and leave them open to comment to help them develop. And we’re planning to create lots more over the coming months (with fantastic support from Tara Burke who is researching and writing a lot of the profiles), created as an open resource so others can use them too.

I’m keen to explore how these personae can provide a first step to greater user-centered design in data use – and how we can use them as an intuitive tool for us to explore who is being best served by the eco-system of applications and infomediaries around IATI data. I’m also curious about the potential for a wider library of open personae to be used to help other open data projects include users as a key raw material for app building.

If ‘Data + Data-use skills + Involvement of Users’ is a part of ‘effective use’ of open data, then ‘Data + Skills + Understanding of users’ must be a step in the right direction…