NT Open Data Days: Exploring data flow in a VCO

[Summary: A practical post of notes from a charity open data day. Part reflective learning; part brain-dump; part notes for ECDP]

Chelmsford was my destination this morning for a Nominet Trust funded ‘Open Data Day’ with the Essex Coalition of Disabled People (ECDP). The Open Data Days are part of an action research exploration of how charities might engage with the growing world of open data, both as data users and publishers. You can find a bit more of the context in this post on my last Open Data Day with the Nominet Trust team.

This (rather long and detailed) post provides a run down of what we explored on the day as a record for the ECDP team, and as as resource of wider shared learning.

Seeking structures and managing data

For most small organisations, data management often means Excel spreadsheets, and ECDP is no exception. In fact, ECDP has a lot of spreadsheets on the go. Different teams across the organisation maintain lists of volunteers, records about service users, performance data, employment statistics, and a whole lot more, in individual Excel workbooks. Bringing that data together to publish the ‘Performance Dashboards‘ that ECDP built for internal management, but that have also been shared in the open data are of the ECDPwebsite, is a largely manual task. Across these spreadsheets it’s not uncommon to see the information on a particular topic (e.g. volunteers), spread across different tabs, or duplicated into different spreadsheets where staff have manually copied filtered extracts for particular reports. The challenge with this is that it leads the organisations information to fragment, and makes pulling together both internal and open data and analysis tricky. Many of the spreadsheets we found during the open day mix the ‘data layer’, with ‘presentation’ and ‘analysis’ layers, rather than separating these out.

What can be done?

Before getting started with open data, we realised that we needed to look at the flow of data inside the organisation. So, we looked at what makes a good data layer in a spreadsheet, such as:

  • Keeping all the data of one type in a single worksheet. For example, if you have data on volunteers all the data should be in a single sheet. Don’t start new sheets for ‘Former volunteers’, or ‘Volunteers interested in sports’ – as this fragments the data. If you need to be know about a volunteers interest, or whether they are active or not, add a column to your main sheet, and use filters (see below).
  • Having one header row of columns. You can use merged cells, sub-headings and other formatting when you present data – but when you use these in the master spreadsheet where you collect and store your data you make life trickier for the computer to understand what your data is, and to support different analysis of the data in future.
  • Including validation… Excel allows you to define a list of possible values for a cell, and provides users entering data with a drop-down box to select from instead of them typing values in by hand. This really helps increase the consistency of data. You can also validate to be sure the entry in a cell is a number, or a date, and so-on. In working on some ECDP prototypes we ran up against a problem where our lists of possible valid entries for a cell was too long, and we didn’t want to keep the master-list of valid values in the Excel sheet our data was on, but Wizard of Excel has documented a workaround for that.
  • …but keeping some flexiblity. Really strict validation has it’s own problems, as it can force people to twist what they wanted to record to fit in a structure that doesn’t make sense, or that distorts the data. For example, in some spreadsheets we found the ‘Staff member responsible’ column often had more than one name in. We had to explore why that was, and whether the data structure needed to accomodate more than one staff member linked to a particular row in the spreadsheet. Keeping a spreadsheet structure flexible can be a matter of providing free text areas where users are not constrained in the detail they provide and in having a flexible process to revise and update structures according to demand.

Once you have a well structured spreadsheet (see the open data cookbook section on preparing your data if you still need to get a sense of what well structured data might look like), then you can do a lot more with it. For example:

  • Creating a pivot chart. Pivot chartsare a great way to analyse data, and are well worth spending time to explore. Many of the reporting requirements an organisation has can be met using a pivot chart.For ECDP we created an example well-structured dataset of ‘Lived Experience Feedback’ – views and insights provided by service users and recorded with detailed descriptions, dates when the feedback was given, and categories highlighting the topical focus of the views expressed. We made all this data into an Excel list, which allowed us to add a formula that would apply to every row and that used the =MONTH() formula to extract the month from the dates given in each row. Creating a pivot chart from this list, we could then drill down to find figures such as the number of Lived Experience reports provided to the Insight team and relating to ‘Employment’ in any given month.
  • Creating filtered lists and dashboards. It can seem counterintuitive to an organisation which mostly wants to see data in separate lists by area, or organisational team, to put all the data for those areas and teams into one spreadsheet, with just a column to flag up which team or area a row relates to. That’s why spreadsheets often end up with different tabs for different teams – where the same sort of data is spread across them. Using formulae to create thematic lists and dashboards can be a good way to keep teams happy, whilst getting them to contribute to a single master list of data. (We spent quite a lot of time on the open data day thinking about the importance of motivating staff to provide good quality data, and the need to make the consequences of providing good data visible.)Whilst the ‘Autofilter’ feater in Excel can be used to quickly sub-set a dataset to get just the information you are interested in, when we’re building a spreadsheet to be stored on a shared drive, and used by multiple teams, we want to avoid confusion when the main data sheet ends up with filters applied. So instead we used simple cross-sheet formulae (e.g. If your main data sheet is called ‘Data’, then put =’Data’!A1 in the top-left cell of a new sheet, and then drag it out) to copies of the master sheet, and then applied to the filters to these. We included a big note on each of these extra sheets to remind people that any edits should be made to the master data, not these lists.
  • Linking across spreadsheets. Excel formulae can be used to point to values not just in other sheets, but also to values in other files. This makes it possible to build a dashboard that automatically updates by running queries against other sheets on a shared drive.Things get even more powerful when you are able to publish datasets to the web as open data, when tools like Google Docs have the ability to pull in values and data across the web, but even with non-open data in an organisation, there should be no need to copy and paste values that could be transfered dynamically and automatically.

Of course, when you’ve got lots of legacy spreadsheets around, then making the shift to more structured data, separating the data, analysis and presentation layers, can be tricky. Fortunately, some of the common tools in the open data wranglers toolbox come in handy here.

To move from a spreadsheet with similar data spread across lots of different tabs (one for each team that produced that sort of data), to one with consistent and standardised data, we copied all the data into a single sheet with one header row, and a new column indicating the ‘team’ that row was from (we did this by saving each of the sheets as .csv files, and using the ‘cat’ command on Mac OSX to combine these together, but the same effect can be got with copy and paste).

We then turned to the open data wranglers power tool Google Refine (available as a free download) to clean up the data. We used ‘text facets’ to see where people had entered slightly different names for the same area or theme, and made bulk edits to these, and used some replacement patterns to tidy up date values.

We then took this data back into Excel to build a master spreadsheet, with one single ‘Data’ sheet, and separate sheets for pivot chart reports and filtered lists.

The whole process once started took an hour or so, but once complete, we had a dataset that could be analysed in many more ways than before, and we had the foundations for building both better internal data flows, and for extracting open data to share.

Heading towards a CRM

As much as, with the appropriate planning, discipline and stewardship, Excel can be used to manage a lot of the data an organisation needs, we also explored the potential to use a fully-featured ‘Contact Relationship Management‘ dataset (CRM) to record information right across the organisation.

Even when teams and projects in an organisation are using well structured spreadsheets, there are likely to be overlaps and links between their datasets that are hard to make unless they are all brought into one place. For example, two teams might be talking to the same person, but if one knows the person as Mr Rich Watts, and the other record R.Watts, bringing together this information is tricky. A CRM is a central database (often now accessed over the web) which keeps all this information in one place.

Modern CRM systems can be set up to track all sorts of interactions with volunteers, customers or service users, both to support day to day operations, and to generate management information. We looked at the range of CRM tools available, from the Open Source ‘CiviCRM’ which has case tracking modules that may be useful to an organisation like ECDP, through to tools like Salesforce, which offer discounts to non-profits. Most CRM solutions have free online trials. LASA’s ICT Knowledge Base is a great place to look for more support on exploring options for CRM systems.

In our open data day we discussed the importance of thinking about the ‘user journey’ that any database needs to support, and ensuring the databases enable, rather than constrain, staff. Any process of implementing a new database is likely to involve some changes in staff working practices too, so it’s important to look at the training and culture change components as well as the technical elements. This is something true of both internal data, and open data, projects.

When choosing CRM tools it’s important to think about how a system might make it possible to publish selected information as open data directly in future, and how they might be able to pull in open data.

Privacy Matters

Open data should not involve the release of people’s personal data. To make open data work, a clear line needs to be drawn between data that identifies and is about individuals, and the sorts of non-personal data that an organisation can release as open data.

Taking privacy seriously matters:

  • Data anonymisation cannot be relied upon. Studies conclusively show that we should not put our faith in anonymisation to protect individuals identity in published datasets. It’s not enough to simply remove names or dates of birth from a dataset before publishing it.
  • Any release of data drawn from personal data needs to follow from a clear risk assessment. It’s important to consider what harm could result from the release of any dataset. For example, if publishing a dataset that has contains information on reported hate crime by post-code area, if a report was traced back to an individual could this lead to negative consequences for them?
  • It’s important to be aware of jigsaw re-identification risks. Jigsaw re-identification is the risk that putting together two open datasets will allow someone to unlock previously anonymised personal data. For example, if you publish one open dataset that maps where users of your service are, and includes data on types of disability, and you publish another dataset that lists reports of hate-crime by local area, could these be combined to discover the disability of the person who reported hate crime in a particular area, and then, perhaps combined with some information from a social network like Facebook or Twitter, to identify that person.

Privacy concerns don’t mean that it’s impossible to produce open data from internal datasets of pesonal information, but care has to be taken. There can be tension between the utility of open data, and the privacy of personal data in a dataset. Organisations need to be careful to ensure privacy concerns and the rights of service users always come first.

With the ECDP data on ‘Lived Experience’ we looked at how Google Refine could be used to extract from the data a list of ‘PCT Areas’ and ‘Issue Topics’ reported by service users, to map where the hot-spots were for particular issues at the PCT level. Whilst drawn from a dataset with personal information, this dataset would not include any Personally Identifying Information, and may be possible to publish as open data.

Open data fusion

Whilst a lot of our ‘open data day’ was spent on the foundations for open data work, rather than open data itself, we did work on one small project which had an immediate open data element.

Rich Watts brought to the session a spreadsheet of 250 Disabled People’s User Led Organisations (DPULOs), and wanted to find out (a) how many of these organisations were charities; and (b) what their turnover was. Fortunately, Open Charities has gathered exactly the data needed to answer that question as open data, and so we ran through how Google Fusion Tables could be used to merge Rich’s spreadsheet with existing charity data (see this How To for an almost identical project with Esmee Fairbairn grants data), generating the dataset needed to answer these questions in just under 10 minutes.

We discussed how Rich might want to publish his spreadsheet of DPULOs as open data in future, or to contribute information on the fact that certain charities are Disabled People’s User Led Organisations back to an open data source like Open Charities.

Research Resources

The other element to our data was an exploration of online data sources useful in researching a local area, led fantastically by Matthew of PolicyWorks.

Many of the data sources Matthew was able to point to for finding labour market information, health statistics, demographic information and other stats provide online access to datasets, but don’t offer this as ‘open data’ that would meet the OKF’s open definition requirements, raising some interesting questions about the balance between a purist approach to open data, or an approach that looks for data which is ‘open enough’ for rough-and-ready research.

Where next?

Next week Nominet, NCVO and Big Lottery Fund are hosting a conference to bring together learning from all the different Open Data Days that have been taking place. The day will also see the release of a report on the potential of open data in the charity sector.

For me, today’s open data day has shown that we need to recognise some of the core data skills that organisations will need to benefit from open data. Not just skills to use new online tools, but skills to manage the flow of data internally, and to fascilitate good data management. Investment in these foundations might turn out to be pivotal for realising open data’s third-sector potential…