Sourcing raw data… (drafting the open data cook book)

Open Data Cook Book LogoI’m at the Local by Social South West ‘Apps for Communities’ event in Bristol today, doing some prototyping work on the Open Data Cook Book. Listening to people working through how to find data – and trying to search for data myself, I thought I would try and map out all the different places I’ve been looking to track down different open datasets. So – with a sprinkling of recipe book metaphors – here’s a draft for comment of key places to track down open data (focussed on UK government data)…

Sourcing raw data

Finding the right ingredients for your data creation is often the hardest part. You will often have to mix-and-match from the approaches below to get all the data and information you need.

1) Search the supermarkets – the data catalogues & data stores

There are a growing number of data catalogues that bring together listings of published open data (and there are also now data marketplaces that can help you find commercially licensed data as well – so be sure to check the details of the data you find).

Data catalogues often have a particular focus – and no one catalogue can tell you about all the data out there.

CKAN.net is a catalogue of data from many different sources. Good to check if you are not quite sure where the dataset you want might be found to see if someone has already created a ‘packaged‘ version of it.

Data.gov.uk is the UK Governments data catalogue, which aims to include listings of all open datasets in the public sector. It’s early days yet, but it boasts over 4,600 dataset listings, many of which link direct to spreadsheets and data downloads.

Guardian World Data Store makes it easy to search across a range of different government open data catalogues – browsing data by country and format.

Your local authority might have a data store, or at least a data page on their website. London has http://data.london.gov.uk and you can find a list of other local open data web pages through the ‘All Councils’ listing at OpenlyLocal.com.

Publicdata.eu is a new catalogue bringing together data from right across Europe.

2) Specialist independents – data stores

Where the supermarkets are stacking the datasets high, and sharing them free – there might be a specialist in your area of interest – working hard to source and bring together the finest data they can. Fortunately, most of them provide the data for free too.

OpenlyLocal.com is focussed on making local council information accessible. You can find details of local council spending for many authorities alongside details of council meetings and councillors that has been scrumped and scraped from the respective websites for you. Most of the raw data is available through an API – so you might need to explore a few new skills to get at it though.

Timetric.com are specialists when it comes to time series data. If you can plot it on a graph over time, chances are they’ve taken the dataset, tidied it up, and providing ways to search and browse for it – with csv spreadsheet downloads of the raw data.

Do you have a specialist independent you go to for data? Tell us about them in the comments.

3) Foraging – searching for the data

If the data you want isn’t available pre-packaged and catalogued, you might need to head out foraging across the Internet. There is a lot of open data in the wild – you just need to know how to spot it.

GetTheData.org makes a great first port of call to see if other data-foragers have already found a good spot to get the data you are after. It’s a community website full of requests for data, and conversations about good places to find it. Plus, if your own foraging doesn’t turn up anything, you can come back and pose your question to the community here later.

SearchTry searching the web for the topic you are interested in. Perhaps add ‘data’ as an extra key word. When you read news articles or web pages that appear to be based on data, take note of the names of the data sources they mention and plug that back into a search. Oftentimes that will lead you to some data you might be able to use.

Think-tank websites, academic researcher web pages and even newspaper sites can all host lots of datasets. Just make sure you find out all you can about the provenance of the information before you use it!

Deep searchingYou can use a standard Google Search to look for data published in common office formats hosted on a particular web domain: your local council or university for example. All you need are two handy operators:

  • The ‘site:’ operator on Google restricts searches to only show results from a particular domain;
  • The ‘filetype:’ operator only returns files of a particular type.

Using those together you can construct searches like ‘filetype:xls site:oxford.gov.uk’ to find all the Excel spreadsheets that Google has indexed on the Oxford City Council website.

4) Scrumping – screen-scrape the data

It’s not uncommon to find the data you need… only it’s just out of reach. Perhaps it’s in a table on a web page when you want it in the sort of table you can load into a spreadsheet to sort and chart. Or it might be spread across lots of different web pages and files. That’s where screen-scraping comes in – creating small computer scripts that turn structured information on a website into raw data.

There are recipes that explain the details of screen-scraping coming in the cook book, and you can go screen-scrape scrumping with a variety of different tools.

Google Spreadsheetsusing a special formula you can grab tables and lists from other websites direct into your spreadsheet (recipe).

Scraper Wiki – helps you get started created advanced scrapers which they will run every day to grab information from websites and turn it into accessible raw data (recipe).

5) Special order – FOI

Perhaps you have found that no-one stocks the data you need – not even in places you can forage or scrump for it. If the data comes from a public body, then it might be time to explore putting in a special request for it using the Freedom of Information Act.

WhatDoTheyKnow.com is a service that makes it easy to submit a Freedom of Information Act request to a local authority, government department or other public body. You have a right to ask authorities for a copy of the information and data they hold, and you can ask for it to me returned as raw data. Search WhatDoTheyKnow to see if anyone has requested the data you want already, and if not, put in your request. (Often if data is available on WhatDoTheyKnow it will be locked up in PDFs. You might need to crowd-source the process of turning it into structured raw data, although there are a few tools and approaches that might help turn PDFs into data programatically)

The Public Sector Information Unlocking Service available at http://unlockingservice.data.gov.uk/ provides a root for requesting data is opened up by the Data.gov.uk team. It’s not backed by the legal framework of FOI, but may play a role in data requests under the currently debated ‘Right to Data’ legislation.

IsItOpenData.org provides a useful tool for asking non-public bodies to share their data as open data, or to clarify the licensing.

6) Home grown – research and crowdsourcing

Some data simply doesn’t exist yet – but you can create a raw dataset through research, and through crowd-sourcing, inviting others to help you research.

Simple spreadsheets – if you are systematically working through a research task, keep your results in a spreadsheet. See the section on raw data for ideas about how to structure it well.

Google Forms – available through http://docs.google.com allows you to create an online form that anyone can fill in, with all the responses going direct into a spreadsheet for you to use. You might be able to get supporters to research for you and collaborative build up a useful dataset.


Always check the label

Is the data you have found licensed for re-use? Whilst you might get away with cooking up some foraged raw data for your own consumption without checking out the details – when you re-publish data and share it with others you need to be sure you have permission to do so.

Remember as well to keep a list of the ingredient you use, and where you got them from, so you can publish a full list of sources along with your creation.)

Worked example: A simple search, with many steps

Sadly we’re not yet at the stage where you can easily get all the data you need delivered to your door – so most projects will involve some searching around.

For example: I was recently looking for data on library locations in Bristol. I started at the data supermarkets, searching data.gov.uk for ‘libraries’. I found a few datasets listed, but the links were broken, so I ended up at a dead end. Next I turned to the Guardian datastore, but that wasn’t very helpful either – so I looked at GetTheData.org to see if anyone else had been looking for library data. Fortunately they had, and their conversations pointed me towards a few possible data sources. Again though, I ended up almost a a dead end – I could find a list of planned library closures, but not a dataset of all the libraries. However, I did find a link to the Bristol Council website, and on browsing the site I came across a listing of libraries in a web-page – so I turned to a little scrumping – using Google Spreadsheets to import the web-page table into a spreadsheet table that I could manipulate and work with. Working through the list of data sources above I was searching for about 15 minutes – following my nose to finally get to the raw ingredients I needed for some data creations.

Pareto Problems for Digital Innovation?

Photo Credit: http://www.flickr.com/photos/pigpencole/1264620687/
Going for the High Hanging Fruit?

[Summary: Local by Social author Andy Gibson is working on a new paper for NESTA on how digital innovation can save public services, and has asked for reflections on ‘obstacles and their solutions’ to adoption or more social technology. I’ve written on practical barriers to digital technology in government before, but here I’m exploring an economic argument that sets out a potential challenge to many digital-social innovation projects*.]

The Pareto Problem
The Pareto Principle (named after the famous Italian Economist, but often known just as the 80-20 rule) suggests that in many real-world situations 80% of the features required in a project can be gained with just 20% of the effort**.

In software development and much of the business world, focussing on the 80% of features you can build easily makes sense. For each bit of effort put in at the start there is a large marginal return and benefit; but as you get to the trickier bits of a project, the marginal benefit (the number of people who will use a feature; how much benefit each new feature will bring etc.) relative to effort put in falls. The last 20% of features might cost four times as much as the first 80%, and in many cases, implementing them simply isn’t cost effective. So, the rational developer or manager never provides them.

Public Services don’t work like that. The tricky 20% of a service to provide is often the service to the most in need. Into that tricky 20% might fall providing services in remote rural areas; educating children from more challenging backgrounds; providing transports services for the elderly; making sure education classes are accessible to those with additional needs and so-on. When social innovators hold up technology driven innovations – new ways of providing public services – we have to ask: are they just solving the easy 80% and ignoring the tough cases?

Is the promise of more efficient and cheaper digital services simply the result of a slight-of-hand – measuring the costs of a service based on it’s provision in the easy cases and bracketing out the tough cases which would require re-engineering systems and adding significant cost and effort if a digital service were to be a universal service?

Possible Solutions
The Pareto Problem isn’t an argument against digital innovation per se. Innovation can shift where the Pareto Problem kick’s in (e.g. Can we serve 90% of the people on 10% of the cost and make savings that way?) and innovation can help the public sector to challenge the frequent over-design of processes and systems around the tough cases. However, the Pareto Problem is significant. A few possible ways to address it in thinking about digital innovation are addressed below.

  • Account for a universal service – any digital innovation needs to show its cost and benefits not just in the easy pilot cases – but also if it were to provide a universal service. Or if it can’t provide a universal service it needs to explain it’s limitations, and allow the public sector to properly cost provision to those the innovation will not work for.
  • Take the tough cases into account – Conventional design of services in the public sector often starts with tough cases. Staff have in mind the cases they faced recently where a service user had complex needs – and they design from the tricky cases first – building all sorts of processes and systems to cope with the complexities. Agile developers often start with the easy cases – and far too often the tough cases get ignored. For example, how does your service work for young people who need additional privacy because of a custody battle currently taking place? Or how does your service work for people with learning difficulties and other additional needs? ??Find the balance between over-engineering processes, but having processes that work for those with the greatest needs, is the key challenge for social innovators.
  • Design with social justice in mind – digital innovation in the public sector shouldn’t just be about creating ‘better stuff’ and ‘better services’ for individuals to consume: it should be about creating a ‘better society’ – and that involves thinking about the distribution of benefits from innovation as well as the nature of the innovation itself.
  • Collaborate and listen – the most important way to make sure social innovations don’t fall into a Pareto Problem trap is to design with the people working at the frontline.

A metaphorical summary
I started writing this post a while back under the title ‘What happens when we’ve picked all the low hanging fruit?’. Many digital innovations come showing as basket full of the low hanging fruit and explain how easy it was to pick. The key is asking – how are you also planning to get the stuff from the top of the tree as well?



* I’m posting this very tentatively, not sure that I’ve quite managed to express the idea I’ve been reflecting on – but aware that Andy’s paper is currently in progress and that working on the last 20% of tweaks to get this blog post spot on is, um, well, going to take at least four times as long as what’s been written so far… (#paretopost)

** Pareto’s original observations concerned the distribution of wealth in Italy, but the principle has been applied much more widely since. The actual numbers don’t matter here. The 80-20 ratio is simply used because Pareto observed it as a ratio that applied in many real-world situation. Take any ratio in the region of 70-30 towards 99-1 and you will see the argument above still broadly holds.