Defining raw data

[Summary: explaining what raw data is and  why it matters]

On the Friday of last weeks Open Government Data Camp in a discussion on how to empower non-technical citizens, civil servants and community activists to make use of open government data, we hit upon the idea of an ‘Open Data Cook Book’ of simple recipes for working with data. The recipe analogy also emerged (via @exmosis) in a twitter discussion on Monday about ‘machine-readable data’ – and a bit of cook-book drafting later, here’s my attempt at describing good open data, whilst avoiding as much as possible any technical terms or getting caught up in the ambiguity of machine-readability.

Sourcing your ingredients for a raw data project:

For all of the recipes in the forthcoming open data cook book you will need to have access to some raw data to work with. You might already have the data you want to work with to hand, or you might have ideas for a great project, but no idea of where to get the data you need. In cook book we will outline a range of places you can source your data, and how to prepare it ready to be part of your data-creations.

Identifying raw data

You can find data all over the place when you start looking, but all-too-often the data you want has been pre-prepared, locked down in written reports, or only available through complicated website interfaces that only let you glimpse a small bit of the data at any one time.

Raw data is easier to manipulate with a computer. When you have raw data you can sort it, edit it and remix it in new ways with the tools you want to use.

Locked, raw, linked

We can think of data on a continuum.

At one end, is locked-up data. This is the sort of data you find in reports, charts and maps. Someone has interpreted what the data means and has pinned it down in a particular context. To use this data in new ways you will probably have to spend time converting it into a raw format through scraping, crowd-sourcing, or lots of manual work.

In the middle is raw data. This is when the data is available in a structured way that you can load into the software or online tools of your choice and can explore, manipulate and remix it. Raw data is ready for us in open data recipes.

However, to make use of any raw dataset you will need to know what it contains. Often raw data can contain cryptic headings, titles and codes for columns, rows or other elements of the dataset, so you will need to make sure you have access to meta-data which tells you what all the things in your raw dataset are, and how the data was generated (sort of like the ingredients list, and list of additives and preservatives on the back of any food packet).

Linked data and RDF provide a way for the meta-data to be transferred along with the raw data, and for connections to be made between different datasets that make it possible to discover even more context about something in your data. Linked data can make it easier to integrate different datasets when they use the same ways of representing different parts of the data. The tools for working with linked data aren’t quite as widespread yet as the tools for working with standard raw data formats, so often linked data is transformed into a common raw data format like CSV (spreadsheets/tabular data), or JSON and XML (flexible structures for different sorts of data).

I’ve still some more work to do tidying up these definitions – and I hope in the cook book we can make use of a few more visual metaphors to show the difference between locked-up, raw and linked information. The process of creating thinking through the relationship between raw and linked data as defined above, in conjunction with the DIKW model also seems to hint at a useful point I’ve not found a good way of articulating yet: that in most mash-up creation/data-use, human understanding of both data and context(meta-data) as separate elements is important – so whilst linked data helps context travel with data, when it comes to working with data, most users need to decompose it back into raw data with separate data and context to work with it.

2 thoughts on “Defining raw data”

  1. Nice post!

    I would just encourage referring to RDF formatted data as one example of Linked Data since RDF and Linked Data (the concept) aren’t inextricably bound 🙂

    There are other formats that enable us fashion hypermedia based self-describing structured data.

    I advocate against exclusivity re. RDF and Linked Data since RDF formats can, and should, succeed on their own merits re. options for hypermedia based structured data.


Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.