Untangling the data debate

[Cross posted from my PhD blog where I’m trying to write a bit more about issues coming up in my current research…]

This post is also available as a two-page PDF here.

Untangling the data debate: definitions and implications

Data is a hot topic right now: from big data, to open data and linked data, entrepreneurs and policy makers are making big claims about ‘data revolutions’. But, not all ‘data’ are the same, and good decision making about data involves knowing the differences.

Big data

Definition: Data that requires ‘massive’ computing power to process (Crawford & Boyd, 2011).

Massive computing power, originally only available on supercomputers, is increasingly available on desktop computers or via low cost cloud computing.

Implications: Companies and researchers can ‘data mine’ vast data resources, to identify trends and patterns. Big data is often generated by combining different datasets.

Digital traces from individuals and companies are increasingly captured and stored for their potential value as ‘big data’.

Raw data

Definition: Primary data, as collected or measured direct from the source. Or Data in a form that allows it to be easily manipulated, sorted, filtered and remixed.

Implications: Access to raw data can allows journalists, researchers and citizens to ‘fact check’ official analysis. Programmers are interested in building innovative services with raw data.

Real-time data

Definitions: Data measured and made accessible with minimal delay. Often accessed over the web as a stream of data through APIs (Application Programming Interfaces).

Implications: Real-time data supports rapid identifications trends. Data can support the development of ‘early warning systems’ (e.g. Google Flu Trends; Ushahidi). ‘Smart systems’ and ‘smart cities’ can be configured to respond to real-time data and adapt to changing circumstances.

Open data

Definition: Datasets that are made accessible in non-proprietary formats under licenses that permit unrestricted re-use (OKF – Open Knowledge Foundation, 2006). Open government data involves governments providing many of their datasets online in this way.

Implications: Third-parties can innovate with open data, generating social and economic benefits. Citizens and advocacy groups can use open government data to hold state institutions to account. Data can be shared between institutions with less friction.

Personal/ private data

Definitions: Data about an individual that they have a right to control access to. Such data might be gathered by companies, governments or other third-parties in order to provide a service to someone, or as part of regulatory and law-enforcement activities.

Implications: Many big and raw datasets are based on aggregating personal data, and combining them with other data. Effective anonymisation of personal data is difficult particularly when open data provides the pieces for ‘jigsaw identification’ of private facts about people (Ohm, 2009).

Linked data

Definitions: Datasets are published in the RDF format using URIs (web addresses) to identify the elements they contain, with links made between datasets (Berners-Lee, 2006; Shadbolt, Hall, & Berners-Lee, 2006).

Implications: A ‘web of linked data’ emerges, supporting ‘smart applications’ (Allemang & Hendler, 2008) that can follow the links between datasets. This provides the foundations for the Semantic Web.

More dimensions of data:

These are just a few different types of data commonly discussed in policy debates. There are many other data-distinctions we could also draw. For example: we can look at whether data was crowd-sourced, statistically sampled, or collected through a census. The content of a dataset also has important influence on the implications that working with that data will have: an operational dataset of performance statistics is very different from a geographical dataset describing the road network for example.

Crossovers and conflicts:

Almost all of the above types of data can be found in combination: you can have big linked raw data; real-time open data; raw personal data; and so-on.

There are some combinations that must be addressed with care. For example, ‘open data’ and ‘personal data’ are two categories that are generally kept apart for good reason: open data involves giving up control over access to a dataset, whilst personal data is the data an individual has the right to control access over.

These can be found in combination on platforms like Twitter, when individuals choose to give wider access to personal information by sharing it in a public space, but this is different from the controller of a dataset of personal data making that whole dataset openly available.

A nuanced debate:

It’s not uncommon to see claims and anecdotes about the impacts of ‘big data’ use in companies like Amazon, Google or Twitter being used to justify publishing ‘open’ and ‘raw data’ from governments, drawing on aggregating ‘personal data’. This sort of treatment glosses over the difference between types of data, the contents of the datasets, and the contexts they are used in. Looking to the potential of data use from different contexts, and looking to transfer learning between sectors can support economic and social innovation, but it also needs critical questions to be asked, such as:

  • What kind of data is this case describing?
  • Does the data I’m dealing with have similar properties?
  • Can the impacts of this data apply to the data I’m dealing with?
  • What other considerations apply to the data I’m dealing with?

Bibliography/further reading:

See http://www.opendataimpacts.net for ongoing work.

Allemang, D., & Hendler, J. A. (2008). Semantic web for the working ontologist: modeling in RDF, RDFS and OWL. Morgan Kaufmann. Retrieved from

Berners-Lee, T. (2006, July). Linked Data – Design Issues. Retrieved from http://www.w3.org/DesignIssues/LinkedData.html

Crawford, K., & Boyd, D. (2011). Six Provocations for Big Data.

Davies, T. (2010). Open data, democracy and public sector reform: A look at open government data use from data. gov. uk. Practical Participation. Retrieved from http://www.practicalparticipation.co.uk/odi/report

OKF – Open Knowledge Foundation. (2006). Open Knowledge Definition. Retrieved March 4, 2010, from http://www.opendefinition.org/

Ohm, P. (2009). Broken promises of privacy: Responding to the surprising failure of anonymization. Imagine. Retrieved from http://papers.ssrn.com/sol3/Papers.cfm?abstract_id=1450006

Shadbolt, N., Hall, W., & Berners-Lee, T. (2006). The Semantic Web Revisited. IEEE intelligent systems, 21(3), 96–101.

Open Rights Group 2012 Conference

[Summary: A quick plug for the upcoming Open Rights Group conference on March 24th 2012.]

The Open Rights Group is a campaigning organisation focussed on protecting citizen’s rights in the digital age. From advocating for a proportional copyright system that doesn’t lead to rights-holders dictating the terms of Internet access, to scrutinising government policies on Internet filtering and blocking, protecting online freedoms, and digging into the detail of open data to balance benefits for society and individuals privacy, the Open Rights Group (ORG) is active on issues that are increasingly important to all of us. I recently joined the ORG Advisory Council to support work on open data, and have been really impressed to find an organisation committed to improving policy so that key digital (and thus, in a digital age, general) freedoms are not undermined by narrow or special interest driven policy making.

In a few Saturday’s ORG are holding their annual conference in London and tickets are still on sale here.

There’s a great line of speakers and workshops, including a rare UK appearance by Lawrence Lessig, and keynotes from Cory Doctorow and Wendy Seltzer.

Plus, in one of the workshops I’m going to be putting some key questions to open data advocates Rufus Pollock and Chris Taggart and another guest panelist asking: Raw, big, linked and open: is all this data doing us, our economy and our democracy any good?

It would be great to see you there if you can make it…

Focussing on open data where it matters: accountability and action

A lot of talk of open data proceeds as if all data is equal, and a government dataset is a government dataset. Some open data advocates fall into the trap of seeing databases as collections of ‘neutral facts’, without recognising the many political and practical judgements that go into the collection and modelling of data. But, increasingly, an awareness is growing that datasets are not a-political, and that not all datasets are equal when it comes to their role in constituting a more open government.

Back in November 2010 I started exploring whether the government’s ‘Public Sector Information Unlocking Service’ actually worked by asking for open data access to the dataset underling the Strategic Export Controls: Reports and Statistics Website. Data on where the UK has issued arms export licenses is clearly important data for accountability, and yet, the data is kept in obfuscated in an inaccessible website. 14 months on, and my various requests for the data have seen absolutely zero response. Not even an acknowledgement.

However, today Campaign Against the Arms Trade have managed to unlock the Export License dataset, after painstakingly extracting inaccessible statistics from the official government site, and turning this into an open dataset and providing an online application to explore the data. They explain:

Until now the data, compiled by the Export Control Organisation(ECO) in the Department for Business, Innovation and Skills (BIS), was difficult to access, use and understand. The new CAAT app, available via CAAT’s website, transforms the accessibility of the data.

The salient features are:

    • Open access – anyone can view data without registering and can make and refine searches in real time.
    • Data has been disaggregated, providing itemised licences with ratings and values.
    • Comprehensive searchability (including of commonly-required groupings, for example by region of the world or type of weaponry).
    • Graphs of values of items licensed are provided alongside listings of licences.
    • Revoked licences are identified with the initial licence approvals.
    • Individual pages/searches (unique urls) can be linked to directly.
    • The full raw data is available as csv files for download.
And as Ian Prichard, CAAT Research Co-ordinator put’s it:

It is hard to think of an area of government activity that demands transparency more than arms export licensing. 

The lack of access to detailed, easy-to-access information has been a barrier to the public, media and parliamentarians being able to question government policies and practices. These practices include routine arming of authoritarian regimes such as Saudi Arabia, Bahrain and Egypt.

As well as providing more information in and of itself, we hope the web app will prompt the government to apply its own open data policies to arms exports. and substantially increase the level and accessibility of information available.

Perhaps projects like CAAT’s can help bring back the ‘hard political edge’ Robinson and Yu describe in the heritage of ‘open government’. They certainly emphasise the need for a ‘right to data’ rather than just access to data existing as a general policy subject to the decisions of those in power.

A commonwealth of skills and capabilities

Cross-posted from a guest blog post on the Commonwealth Internet Governance Forum Website.

[Summary: Creating cultures of online collaboration, and skills for online safety, is tougher than building platforms or creating technical controls, but without a participation-centred approach we will lose out on the benefits of the net]

“We want to encourage more knowledge sharing, learning and collaboration within our network. Let’s create an online platform.”

“These online spaces contain dangerous content. We need to restrict access to them.”

These sorts of thoughts are incredibly common when it comes to engagement with the Net, whether as a space of opportunity, or as a space of risk and danger. I’m sure you will have encountered them. For example, from a committee focussing on the provision of new online tools and services, forums and websites to improve communication within a group. Or perhaps from institutions and governments arguing for more powers or tools to control Internet access, whether filtering Internet access in schools, or domain seizures requests to take websites offline at the DNS level in the interests of protecting students or citizens. However, these lines of reasoning are deeply problematic if we believe in the Internet as a democratic tool, and a space of active citizenship. In this post I’ll try and explain why, and to argue that our energy should go primarily into sharing skills and capabilities rather than solely into building platforms or creating controls.

The protection paradox

In the UK there has recently been a vigorous debate over whether the police should be able to ask the national domain name registrar Nominet, to block certain .uk DNS entries (domain names) if a website is found to contain malware or to be selling counterfeit goods. Much of the debate has been over whether the police should have a court order before making their requests, or whether the DNS can be altered on law-enforcement request without judicial authorisation. Creating new powers to allow authorities to act against cybercrime by adding blocks within the network can certainly seems like an appealing option when confronted with a multitude of websites with malicious intent, but these approaches to protection can create a number of unintended results.

Blocks within the network can create a false sense of security: users feel that someone else is taking care of security for them, and so have even less motivation to act on security for themselves, creating increased risks when malicious sites inevitably slip through the cracks. Strategies of control  and filtering in schools and educational institutions also remove the incentives for educators to support young people to develop the digital skills they need to navigate online risks safely. But the potential for control-based protection policies to limit individuals ability to protect themselves is just one of the paradoxes. Protection measures placed in the network itself can centralise power over Internet content, creating threats to the open nature of the Internet, and putting in place systems and powers which could be used to limit democratic freedoms.

If we put restriction and control strategies of protection aside, there are still options open to us – and options that better respect democratic traditions. On the one hand, we can ensure that laws and effective judicial processes are in place to address abuses of the openness of the Internet; and, on the other, we can focus on individuals skills and capabilities to manage their own online safety. Often these skills are very practical. As young people at last years Internet Governance Forum explained in a session on challenging myths about young people and the Internet explained (LINK), young people do care about privacy: they don’t need to be given scare stories about privacy dangers,  but they do want help to use complicated social network privacy settings, and opportunities to talk with friends and colleagues about norms of sharing personal information online.

Of course, the work involved in spreading practical digital skills can look like at order of magnitude greater than the work involved in implementing network-level controls. But that doesn’t mean it’s not the right approach. It might be argued that, for some countries, spreading the digital literacy needed for people to participate in their own protection from cybercrime is simply too complicated right now – and it can wait until later, whilst rolling out Internet access can’t wait. In ‘Development as FreedomAmartya Sen counters a similar argument about democratic freedoms and economic development, where some theorists suggest democratic rights are a luxury that can only be afforded once economic development is well progressed. Sen counters that democratic rights and freedoms are a constitute part of development, not some add-on. In the same vein we might argue that being able to be an autonomous user of an Internet endpoint, with as much control as possible over any controls that might be placed on your Internet access is constitutive both of having effective Internet access, and of being able to use the Internet as a tool to promote freedom and development. The potential challenges of prioritising skills-based and user-empowerment approach to cyber-security should not be something we shy away from.

The problem with a platform focus

When we look at a successful example of online collaboration the most obvious visible element of it is often the platform being used: whether it’s a Facebook group, or a custom-built intranet. Projects to support online learning, knowledge sharing or dialogue can quickly get bogged down in developing feature-lists for the platform they think they need – articulating grand architectural visions of a platform which will bring disparate conversations together, and which will resolve information-sharing bottlenecks in an organisation or network. But when you look closer at any successful online collaboration, you will see that it’s not the platform, but the people, that make it work.

People need opportunities, capabilities and supportive institutional cultures to make the most of the Internet for collaboration. The capabilities needed range from technical skills (and, on corporate networks, the permission) to install and use programs like Skype, to Internet literacies for creating hyper-links and sharing documents, and the social and media literacy to participate in horizontal conversations across different media. But even skills and capabilities of the participants are not enough to make online collaboration work: there also needs to be a culture of sharing, recognising that the Internet changes the very logic of organisational structures, and means individuals need to be trusted and empowered to collaborate and communicate across organisational and national boundaries in pursuit of common goals.

Online collaboration also needs facilitation: from animateurs who can build community and keep conversations flowing, to technology stewards who can help individuals and groups to find the right ad-hoc tools for the sorts of sharing they are engaged in at that particular time. Online facilitators also need to work to ensure dialogues are inclusive – and to build bridges between online and offline dialogue. In my experience facilitating an online community of youth workers in the UK, or supporting social reporting at the Internet Governance Forum, the biggest barriers to online collaboration have been people’s lack of confidence in expressing themselves online, or easily-address technical skill shortages for uploading and embedding video, or following a conversation on Twitter.

Building the capacity of people and institutions, and changing cultures, so that online collaboration can work is far trickier than building a platform. But, it’s the only way to support truly inclusive dialogue and knowledge-sharing. Plus, when we focus on skills and capabilities, we don’t limit the sorts of purposes they can be put to. A platform has a specific focus and a limited scope: sharing skills lays the foundation for people to participate in a far wider range of online opportunities in the future.

Culture, capability and capacity building in the Commonwealth

So what has this all got to do with the Commonwealth? And with Internet governance? Hopefully the connections are clear. Sharing knowledge across boundaries is at the heart of a Commonwealth vision, and cybercrime is one area where the Commonwealth has agreed to focus collaboration (LINK). Projects like the Commonwealth Youth Exchange Council’s Digital Guyana project, and numerous other technical skills exchanges provide strong examples of how the Commonwealth can build digital skills and capabilities – but as yet, we’ve only scratched the surface of social media, online collaboration and digital skill-sharing in the Commonwealth. It would be great to think that we can switch from the statements this post opened with, to finding that statements like those below are more familiar:

“We want to encourage more knowledge sharing, learning and collaboration within our network. Let’s invest in sharing the skills to engage, building the culture of openness, and involving the technology stewardship and facilitation we need to do it.”

“These online spaces contain dangerous content. Let’s use shared knowledge across the Commonwealth to build the capacity of communities and individuals to actively participate in their own protection, and in having a safer experience of the Internet.”

Going beyond simply sharing legal codes and practices, or building platforms, to sharing skills and co-creating programmes to build individual and community capability is key for us to meet the collaboration and IG challenges of the future.



Footnote: 3Ps, with participation as the foundation

In thinking about how to respond to range of Internet Governance issues, I’m increasingly turning to a model drawn from the UN Convention on the Rights of the Child (UNCRC), which turns out to have far wider applicability than just to youth issues. There is a customary division of the UNCRC rights into three categories: protection rights, provision rights, and participation rights. Rather than being in tension, these can be seen as mutually re-enforcing, and represented with a triangle of rights. Remove any side of the triangle, and the whole structure collapses.

How can this be used? Think of this triangle as a guide to what any effective policy and practice response to use of the Internet needs to involve. When your concern is protection (e.g. in addressing cybercrime), the solutions don’t only involve ‘protective’ measures, but need components involving the provision of education, support or remedial action in cases of harm, and components that promote the participation in individuals, both to develop skills to navigate online risks, and to be active stakeholders in their own protection. When your concern is promoting online participation and collaboration then as well as developing participative cultures and skills, you need to look to the provision of spaces and tools for dialogue, and making sure those spaces do not create unnecessary risks for participants. A balanced response to the Net can identify how it addresses each of protection, provision and participation.

However, we can go one step further by positing Participation as the foundation of this triangle (in the UNCRC Participation rights are arguably a key foundation for the others). Any policy or intervention which undermines people’s capacity to freely participate online undermines the validity of this intervention as a whole.

You can find more on the application of this model to young people’s online lives in this paper, or share your reflections on the model on this blog post.