Aligning Insight: standardisation and data collection for COVID-19 responses

[Summary: a brain-dump of thoughts on approaches to data standardisation relevant in the current coronavirus context.]

Over the last few weeks I’ve talked with a number of initiatives that are seeking to bring greater coherence to data collection on the impacts that coronavirus is having on their constituencies. Thousands of organisations, from chambers of commerce, to charity networks, and international agencies, are sending out surveys, or soliciting inputs, to help them understand the social, economic, organisational and operational impacts of the current pandemic – and to start charting ways forward in response.

This has led to a number of conversations asking how data standards could help. Common fears of wasted effort in duplicate data collection, missed insights from siloed data, or confusion created by incompatible categorisations, are all being compounded by the rapid data collection needs in this crisis. Yet, creating new standards can be a time-consuming process: involving in-depth negotiation of different user needs and capacities, careful drafting of definitions, and rigorous testing of schemas, in order to develop something that can function as an equitable tool for long-term communication and collaboration. That doesn’t mean, however, that it’s not possible to iterate towards more aligned and standardised data right now.

In this post I’ll try and set out a few (non-exhaustive) considerations on where some of the data standardisation practices I’ve engaged with over recent years fit in the current landscape, and some approaches to move towards aligning data collection initiatives.

Documentation, documentation, documentation

There are a couple of different parts of a data standard, including definitions that describe what the data should cover, and what each field is about and schemas that determine how the data should be encoded, serialised and shared. But it is documentation that brings these together, and makes them widely usable.

Good documentation should allow people designing data collection instruments (surveys, studies etc.) to quickly identify the building blocks of standardisation that they can draw upon, and should make following the standard the path of least resistance, rather than an uphill struggle.

Ideally documentation should be clearly versioned, and, if intended for global use, published in ways that support language translation.

Start from user needs

It’s easy to fall into the trap of being ‘data driven’, and trying to work out ways to bring together ’all the data’ by imposing top-down structures on data collection or aggregation. But, in working out where to prioritise alignment of definitions and structures it’s crucial to be driven user need. In a crisis context, it may help to identify the primary user need that data pipelines are being built to meet (e.g. a dashboard for operational decision making), and secondary user needs that is is desirable to meet too (e.g. evaluating whether support has been provided equitably; gathering baselines for future research; supporting advocacy for funding certain needs). This will help guide decisions on…

…’just enough standardisation’

Standards are about the distribution of costs and benefits between data producers, intermediaries and data users. Without any standards, data users wanting to draw on data from different sources have to do all the work of reconciling differences and inconsistencies – and sometimes find different datasets are simply irreconcilable. Where multiple datasets have compatible definitions, but different schemas, if may be possible for intermediaries to do the work of creating a consistent dataset by standardising non-standard data. Where data produces are made responsible for data standardisation, they have to do the work of reconciling their own business needs and local definitions, with the definitions and structures provided by a standard.

In the early stages of a crisis, the focus should be on what intermediaries can do: keeping the burden on data producers and users as low as possible, and focussing only on essential standardisation (guided by an understanding of user needs). By seeking to reconcile data from different sources, intermediaries will quickly learn which gaps in data alignment or standardisation are most costly to creating interoperable datasets.

Whilst adopting standards like the Open Contracting Data Standard or Beneficial Ownership Data Standard involves working with organisations over many months and even years to align their data (and in some cases, underlying business processes) with a shared model – in a crisis response, data producers need light-weight building blocks that make their job easier – giving them content to copy and paste into surveys, or data structures that can be easily implemented.

One well-developed approach for alignment in a crisis context comes from HXL – the Humanitarian eXchange Language which provides a simple approach to mark-up columns in spreadsheets using a collection of known # hash-tags, and then provides tools to combine and filter tagged data.

(For more on ‘just enough’ thinking see Rachel Coldicutt’s post on just enough internet)

(Critically) re-use existing standards

It’s rare that you will need to ‘invent’ any standards from scratch: standardisation is often an assembly job: working out which existing standards to align with and which pieces are aligned enough to work together. As a starting point I often turn to schema.org, the ad-hoc effort by search engines to create a common (and relatively loose) vocabulary of terms to describe everything from people, local businesses and books, to pandemic related data, or I look at conventions at use in existing datasets in the domain I’m helping create data models for.

Certain lower-level conventions, like using ISO Dates, unicode for text, and ISO language and country codes, are also worth encouraging and documenting: although in most cases as long as a data source is internally consistent in how it encodes countries, dates, languages and so-on, intermediaries will be able to more-or-less map the data to common codes over the short-term.

I say that one should ‘critically’ re-use existing standards, because, as the fantastic Data Feminism book underscores, definitions of data are about power: about whose lived experience and accounts of the world will be represented and shared. There is often a balance to strike between adopting common ways of representing the world, and challenging oppressive and problematic representations.

Particularly when building standards for use across national and cultural boundaries, this calls for an awareness of the many falsehoods embedded in data models, and consideration of the embedded assumptions in off-the-shelf data models. It can also call for a sensitivity to when standards, even in a crisis, should not take the path of least resistance, but should introduce some friction in deciding which categories to use, or how to disaggregate data. For example, where user needs (and here is where considering diverse secondary user needs can be important, as ‘primary user needs’ may often represent dominant power perspectives) require an understanding of how data varies by gender, or the ability to provide intersectional disaggregation, then standards should make clear how this should be recorded and shared.

Look for the keys

One way to lower the burden on data collectors is to look for the keys that unlock additional existing open datasets. For example:

  • Postcodes in many countries allow data to be geocoded, and allow you to integrate a range of local classifications and statistics. In the UK, collecting the postcode of where a service is delivered allows you to look up the socio-economic status of the are, the local authority responsible for service delivery there, and a whole host of other information. In other countries, location data may be possible to match with satellite observation data to infer other relevant classifications for a survey respondent.
  • Organisation identifiers – which, if collected and well validated, can be reconciled against public databases to find information on companies, charities and other entities. In the UK, a Charity number can be used to look up classification data on the organisation’s beneficiaries taken from annual charity returns. For many nations, company numbers can be reconciled against OpenCorporates to provide detailed corporate information.
  • URLS and Social Media IDs can be useful in some use-cases to crawl web pages and social network and find signals about the networks an organisation is part of, of the topics they work on.

Each sector and domain is also likely to have some of its own ‘keys’ that can hook into existing datasets (e.g. the Common Procurement Vocabulary for classifying public procurements in Europe). If you are lucky, they will be attached to relevant open datasets.

Care still needs to be taken to consider gaps in the lookup data (e.g. some countries lack open corporate register data; satellite data coverage varies; not all organisations have websites), and to avoid introducing biases through faulty assumptions (e.g. if assuming the ‘register office’ postcode of UK charities is where their beneficiaries are, then it looks like London gets more funding than it does). It’s also important to consider how easy it will be for those providing data to enter it. For example, do organisations know their registration number? (On the organisation identifiers point, this is one of the reasons I was involved in creating org-id.guide and there remains a lot still to do in this area).

Decide on your approach to categories

At the heart of many standardisation processes is classification: sorting needs, organisations, events or people into categories. Standardising categories can be notoriously difficult: and is often hard to do in a rush. You might find there are existing classification schemes you can draw upon, or you might find a need to create your own (or, as LandVoc has done, albeit over a number of years, to engage with an existing classification scheme to get the elements you need included).

Good documentation of the boundaries of a category (ideally with examples) is vital for them to be used in interoperable ways.

Many of the standards I’ve worked on have stepped back from settling categorisation debates, but representing classification elements in terms of:

  • A vocabulary – to allow different datasets to use different classification schemes
  • A code – that stays constant across languages
  • A label – that can be translated into local languages

This offers a way to at least avoid two people talking about different things with the same terms, but leaves the alignment problem to later.

In an ideal world, a rapid standardisation project might be able to provide ‘good enough’ categories for data collectors to start with, but then offer them some level of flexibility so that individual data collection exercises can address their local user needs by adapting core categorisations.

Semantic standards such as SKOS have a lot to offer to efforts to bring together data using heterogenous classification schemes: allowing not only hierarchical relationships (i.e. the ability to add a ‘narrower’ concept under a headline category), but also broad and narrow matches between neighbouring concepts. However, tools and skills for working well with this kind of data and classification structure are, in my experience, quite scarce.

Meta-data matters

One of the most important things to help intermediaries align different datasets is ‘data about the data’. Knowing who collected a dataset (ideally with ability to contact them), knowing when and where it was collected, and ideally having pointers to the survey forms or data collection instruments used can make the process of ingesting and reconciling disparate datasets at lot, lot easier.

Conventions like MetaTab provide an easy way to get started providing standardised meta-data when circulating spreadsheets, and there are well established standards for meta-data in most domains.

Meta-data should also include clear information on restrictions or permissions that apply to re-use of a dataset, which brings me onto:

Don’t forget standards of data governance

The first question to ask before making use of any dataset that might contain sensitive information from individuals or organisations is: do I have the right to use this data? Does using or sharing this data (or analysis based on it), put anyone at risk?

As the responsible data initiative puts it, there is a:

…collective duty to account for unintended consequences of working with data by:

1) prioritising people’s rights to consent, privacy, security and ownership when using data in social change and advocacy efforts,

2) implementing values and practices of transparency and openness.

Working out early on a set of shared procedures for assessing the need for, obtaining and recording consents from data subjects for data sharing and re-use can avoid hitting barriers later on. This might take a number of forms, such as:

  • Suggested privacy policy terms that describe how data might be shared and re-used;
  • Identifying the different states that consent might take (.e.g. consent for data to be ‘shared’ with identified partners, or consent for non-personal data to be ‘open’ – drawing on the ODI’s data spectrum and how these should be encoded in each relevant row of a dataset;
  • Adding a section to meta-data templates for those sharing data to indicate who else data can be shared with, and if any fields should be masked from an open version of a dataset.

Standards are about people

Lastly, but by no means least – it is important to think of standards as a process, not a product. That documentation I mentioned at the start? That’s not for users: that’s for you. Because most of the time people don’t read documentation: they don’t have the time, or don’t know where to start. In reality, most of the standards I’ve worked on require conversations, engagement and feedback to help people align their data with them.

If someone is designing a data collection survey, the prime opportunity for standardisation is between their first draft, and it going out in the field. If you can get into a conversation then, and provide prioritised feedback on how it can align more with the documented standard, how it could incorporate some ‘key fields’ that will unlock other data, or how the consent questions could be worded to be compatible with shared data governance, then you have a chance of the data that flows from that data collection will be possible to bring together as part of a wider aligned insight datasets.

In all the standards I’ve worked on, the ‘Helpdesk’ team have been as vital as the documentation and schema to making standards truly work as tools of coordination and collaboration.

 

 

Three cross-cutting issues that UK data sharing proposals should address

[Summary: an extended discussion of issue arising from today’s discussion of UK data sharing open policymaking discussions]

I spend a lot of time thinking and writing about open data. But, as has often been said, not all of the data that government holds should be published as open data.

Certain registers and datasets managed by the state may contain, or be used to reveal, personally identifying and private information – justifying strong restrictions on how they are accessed and used. Many of the datasets governments collect, from tax records to detailed survey data collected for policy making and monitoring fall into this category. However, the principle that data collected for one purpose might have a legitimate use in another context still applies to this data: one government department may be able to pursue it’s public task with data from another, and there are cases where public benefit is to be found from sharing data with academic and private sector researchers and innovators.

However, in the UK, the picture of which departments, agencies and levels of government can share which data with others (or outside of the state) is complex to say the least. When it comes to sharing personally identifying datasets, agencies need to rely on specific ‘legal gateways’, with certain major data holders such as HM Revenue and Customs bound by restrictive rules that may require explicit legislation to pass through parliament before specific data shares are permitted.

That’s ostensibly why the UK Government has been working for a number of years now on bringing forward new data sharing proposals – creating ‘permissive powers’ for cross-departmental and cross-agency data sharing, increasing the ease of data flows between national and local government, whilst increasing the clarity of safeguards against data mis-use. Up until just before the last election, an Open Policy Making process, modelled broadly on the UK Open Government Partnership process was taking place – resulting in a refined set of potential proposals relating to identifiable data sharing, data sharing for fraud reduction, and use of data for targeted public services. Today that process was re-started, with a view to a public consultation on updated proposals in the coming months.

However, although much progress has been made in refining proposals based on private sector and civil society feedback, from the range of specific and somewhat disjointed proposals presented for new arrangements in today’s workshop, it appears the process is a way off from providing the kinds of clarification of the current regime that might be desirable. Missing from today’s discussions were clear cross-cutting mechanisms to build trust in government data sharing, and establish the kind of secure data infrastructures that are needed for handling personal data sharing.

I want to suggest three areas that need to be more clearly addressed – all of which were raised in the 2014/15 Open Policymaking process, but which have been somewhat lost in the latest iterations of discussion.

1. Maximising impact, minimising the data shared

One of the most compelling cases for data sharing presented in today’s workshop was work to address fuel poverty by automatically giving low-income pensioners rebates on their fuel bills. Discussions suggested that since the automatic rebate was introduced, 50% more eligible recipients are getting the rebates – with the most vulnerable who were far less likely to apply to recieve the rebates they were entitied to the biggest beneficiaries. With every degree drop in the temperature of a pensioners home correlating to increased hospital admissions – then the argument for allowing the data share, and indeed establishing the framework for current arrangements to be extended to others in fuel poverty (the current powers are specific to pensioners data in some way), is clear.

However, this case is also one where the impact is accompanied by a process that results in minimal data actually being shared from government to the private companies who apply the rebates to individuals energy bills. All that is shared in response to energy companies queries for each candidate on their customer list is a flag for whether the individual is eligible for the rebate or not.

This kind of approach does not require the sharing of a bulk dataset of personally identifying information – it requires a transactional service that can provide the minimum certification required to indicate, with some reasonable level of confidence, that an individual has some relevant credentials. The idea of privacy protecting identity services which operate in this way is not new – yet the framing of the current data sharing discussion has tended to focus on ‘sharing datasets’ instead of constructing processes and technical systems which can be well governed, and still meet the vast majority of use-cases where data shares may be required.

For example, when the General Records Office representative today posed the question of “In what circumstances would it be approciate to share civil registration data (e.g. Birth, Adoption, Marriage and Death) information?”, the use-cases that surfaced were all to do with verification of identity: something that could be achieved much more safely by providing a digital service than by handing over datasets in bulk.

Indeed, approached as a question of systems design, rather than data sharing, the fight against fraud may in practice be better served by allowing citizens to digitally access their own civil registration information and to submit that as evidence in their transactions with government, helping narrow the number of cases where fraud may be occurring – and focussing investigative efforts more tightly, instead of chasing after problematic big data analysis approaches.

(Aside #1: As one participant in today’s workshop insightfully noted, there are thousands of valid marriages in the UK which are not civil marriages and so may not be present in Civil Registers. A big data approach that seeks to match records of who is married to records of households who have declared they are married, to identify fraudulent claims, is likely to flag these households wrongly, creating new forms of discrimination. By contrast, an approach that helps individuals submit their evidence to government allows such ‘edge cases’ to be factored in – recognising that many ‘facts’ about citizens are not easily reduced to simple database fields, and that giving account of ones self to the state is a performative act which should not be too readily sidelined.)

(Aside #2: The case of civil registers also illustrates an interesting and significant qualitative difference between public records, and a bulk public dataset. Births, marriages and deaths are all ‘public events’: there is no right to keep them private, and they have long been recorded in registers which are open to inspection. However, when the model of access to these registers switches from the focussed inspection, looking for a particular individual, to bulk access, they become possible to use in new ways – for example, creating a ‘primary key’ of individuals to which other data can be attached, eroding privacy in ways which was not possible when each record needed to be explored individually. The balance of benefits and harms from this qualitative change will vary from dataset to dataset. For example, I would strongly advocate the open sharing of company registers, including details of beneficial owners, both because of the public benefit of this data, and because registering a company is a public act involving a certain social contract. By contrast, I would be more cautious about the full disclosure of all civil registers, due to the different nature of the social contract involved, and the greater risk of vulnerable individuals being targetted through intentional or unintentional misuse of the data.)

All of which is a long way to say:

  • Where the cross-agency or cross-departmental use-cases for access to a particular can be reduced to sharing assertions about individuals, rather than bulk datasets, this route should be explored first.

This does not remove the need for governance of both access and data use. However, it does ease the governance of access, and audit logs of access to a service are easier to manage than audit logs of what users in possession of a dataset have done.

Even the sharing of a ‘flag’ that can be applied to an individuals data record needs careful thought: and those in receipt of such flags need to ensure they govern the use of that data carefully. For example, as one participant today noted, pensioners have raised fears that energy companies may use a ‘fuel poverty’ flag in their records to target them with advertising. Ensuring that later analysts in the company do not stumble upon the rebate figures in invoices, and feed this into profiling of customers, for example, will require very careful data governance – and it is not clear that companies practices are robust enough to protect against this right now.

2. Algorithmic transparency

Last year the Detroit Digital Justice Coalition produced a great little zine called ‘Opening Data’ which takes a practical look at some of the opportunities and challenges of open data use. They look at how data is used to profile communities, and how the classifications and clustering approaches applied to data can create categories that may be skewed and biased against particular groups, or that reinforce rather than challenge social divides (see pg 30 onwards). The same issues apply to data sharing.

Whilst current data protection legislation gives citizens a right to access and correct information about themselves, the algorithms used to process that data, and derive analysis from it are rarely shared or open to adequate scrutiny.

In the process of establishing new frameworks for data sharing, the algorithms used to process that data should be being brough in view as much as the datasets themselves.

If, for example, someone is offered a targetted public service, or targetted in a fraud investigation, there is question to be explored of whether they should be told which datasets, and which algorithms, led to them being selected. This, and associated transparency, could help to surface otherwise unseen biases which might otherwise lead to particular groups being unfairly targetted (or missed) by analysis. Transparency is no panacea, but it plays an important role as a safeguard.

3. Systematic transparency of sharing arrangements

On the theme of transparency, many of the proposals discussed today mentioned oversight groups, Privacy Impact Assessments, and publication of information on either those in receipt of shared data, or those refused access to datasets – yet across the piece no systematic framework for this was put forward.

This is an issue Reuben Binns and I wrote about in 2014, putting forward a proposal for a common standard for disclosure of data sharing arrangements that, in it’s strongest form would require:

  • Structured data on origin, destination, purpose, legal framework and timescales for sharing;
  • Publication of Privacy Impact Assessments and other associated documents;
  • Notices published through a common venue (such as the Gazette) in a timely fashion;
  • Consultation windows where relevant before a power comes into force;
  • Sharing to only be legally valid when the notice has been published.

Without such a framework, we are likely to end up with the current confused system in which no-one knows which shares are in place, how they are being used, and which legal gateways are functioning well or not. With a scattered set of spreadsheets and web pages listing approved sharing, citizens have no hope of understanding how their data is being used.

If only one of the above issues could be addressed in the upcoming consultation on data sharing, then I certainly hope progress could be made on addressing this missing piece of a robust common framework for the transparency principles of data sharing to be put into practice.

Towards a well governed infrastructure?

Ultimately, the discussion of data sharing is a discussion about one aspect of our national data infrastructure. There has been a lot of smart work going on, both inside and outside government, on issues such as identity assurance, differential privacy, and identifying core derived datasets which should be available as open data to bypass need for sharing gateways. A truly effective data sharing agenda needs to link with these to ensure it is neither creating over-broad powers which are open to abuse, nor establishing a new web of complex and hard to operate gateways.

Further reading

My thinking on these issues has been shaped in part by inputs from the following:

Data & Discrimination – Collected Essays

White House Report on Big Data, and associated papers/notes from The Social, Cultural & Ethical Dimensions of “Big Data.” conference

Slow down with the standards talk: it’s interoperability & information quality we should focus on

[Summary: cross-posting a contribution to the discussions on the International Open Data Conference blog]

There is a lot of focus in the run up the International Open Data Conference in Ottawa next week. Two of the Action Area workshops on Friday are framed in terms of standards – at the level of data publication best practices, and collaboration between the standards projects working on thematic content standards at the global level.

It’s also a conversation of great relevance to local initiatives, with CTIC writing on the increasing tendancy of national open data regulations to focus on specific datasets that should be published, and to prescribe data standards to be used. This is trend mirrored in the UK Local Government Transparency code, accompanied by schema guidance from Local Government Association, and even where governments are not mandating standards, community efforts have emerged in the US and Australia to develop common schemas for publication of local data – covering topics from budgets to public toilet locations.

But – is all this work on standards heading in the right direction? In his inimitable style, Friedrich Lindenberg has offered a powerful provocation, challenging those working on standards to consider whether the lofty goal of creating common ways of describing the world so that all our tools just seamlessly work together is really a coherent or sensible one to be aiming for.

As Friedrich notes, there are many different meanings of the word ‘standard’, and often multiple versions of the word are in play in our discussions and our actions. Data standards like the the General Transit Feed Specification, International Aid Transparency Initiative Schema, or Open Contracting Data Standard are not just technical descriptions of how to publish data: they are also rhetorical and discplinary interventions, setting out priorities about what should be published, and how it should be represented. The long history of (failed) attempts to find general logical languages to describe the world across different contexts should tell us that data standards are always going to encode all sorts of social and cultural assumptions – and that the complexity of our real-world relationships, and all that we want to know about the different overalapping institutional domains that affect our lives will never be easily rendered into a single set of schema.

This is not to say we should not pursue standardisation: standards are an important tool. But I want to suggest that we should embed our talk of standards within a wider discussion about interoperability, and information quality.

An interop approach

I had the chance to take a few minutes out of IODC conference preparations last week to catch up with Urs Gaser, co-author of Interop: The Promise and Perils of Highly Interconnected Systems, and one of the leaders of the ongoing interop research effort. As Urs explained, an interoperability lens provides another way of thinking about the problem standards are working to address.

Where a focus on standards leads us to focus on getting all data represented in a common format, and on using technical specifications to pursue policy goals – an interoperability focus can allow us to incorporate a wider range of strategies: from allowing the presence of translation and brokering layers between different datasets, to focussing on policy problems directly to secure the collection and disclosure of important information.

And even more importantly, an interop approach allows us to discuss what the right level of interoperability to aim for is in any situation: recognising, for example, that as standards become embedded, and sunk into our information infrastructures, they can shift from being a platform for innovation, to a source of innertia and constraints on progress. Getting the interopabiliy level right in global standards is also important from a power perspective: too much interoperability can constrain the ability of countries and localities to adapt how they express data to meet their own needs.

For example, looked at through a standards lense, the existence of different data schema for describing the location of public toilets in Sydney, Chennai and London is a problem. From the standards perspective we want everyone to converge on the same schema and to use the same file formats. For that we’re going to need a committee to manage a global standard, and an in-depth process of enrolling people in the standard. And the result with almost undoubtedly be just one more standard out there, rather than one standard to rule them all, as the obligatory XKCD cartoon contends.

But through an interoperability lense, the first question is what level of interoperability do we really need? Andwhat are the consequences of the level we are striving for?. It invites us to think about the different users of data, and how interoperablity affects them. For example, a common data schema used by all cities might allow a firm providing a loo-location app in Ottawa to use the same technical framework in Chennai, but is this really the ideal outcome? But the consequences of this could be to crowd out local developers who could build something much more culturally contextualised. And there is generally nothing to stop the Ottawa firm from building a translation layer between the schemas used in their app, and the data disclosed in other cities – as long as the disclosure of data in each context include certain key elements, and are internally consistent.

Secondly, an interoperability lens encourages us to consider a whole range of strategies: from regulations that call consistent disclosure of certain information without going as far as giving schema, to programmes to develop common identification infrastructures, to the development and co-funding of tools that bridge between data captured in different countries and contexts, and the fostering of collaborations between organisations to work together on aggregating heterogenous data.

As conversations develop around how to enable collaboration between groups working on open aid data, public contracts, budgets, extractives and so-on, it is important to keep the full range of tools on the table for how we might enable users to find connections between data, and how the interoperability of different data sources might be secured: from building tools and platforms, working together on identifiers and small building-blocks of common infrastructure, to advocating for specific disclosure policies and, of course, discussing standards.

Information quality

When it comes down to it – for many initiatives, standards and interoperability are only a means to another end. The International Aid Transparency Initiative cares about giving aid recieving governments a clear picture of the resources available to them. The Open Contracting Partnership want citizens to have the data they need to be more engaged in contracting, and for corruption in procurement to be identified and stopped. And the architects of public loo data standards don’t want you to get caught short.

Yet often our information quality goals can get lost as we focus on assessing and measuring the compliance of data with schema specs. Interoperability and quality are distinct concepts, although they are closely linked. Having standardised, or at least interoperable data, makes it easier to build tools which go some of the way to assessing information quality for example.

interop-and-quality

But assessing information quality goes beyond this. Assessments need to take place from the perspective of real use-cases. Whilst often standardisation aims at abstraction, our work on promoting the quality, relevance and utility of data sharing – at both the local and global levels – has to be rooted in very grounded problems and projects. Some of the work Johanna Walker and Mark Frank have started on user-centered methods for open data assessment, and Global Integrity’s bottom-up Follow The Money work starts us down this path, but we’ve much more work to do to make sure our discussions of data quality are substantive as well as technical.

Thinking about assessing information quality distinct from interoperability can also help us to critically analyse the interoperability ecosystems that are being developed. We can look at whether an interoperability approach is delivering information quality for a suitable diverse range of stakeholders, or whether the costs of getting information to the required quality for use are falling disproportionately one one group rather than another, or are leading to certain use-cases for data being left unrealised.

Re-framing the debate

I’m not calling for us to abandon a focus on standards. Indeed, much of the work I’m committed to in the coming year is very much involved in rolling out data standards. But I do want to invite us to think about framing our work on standards within a broader debate on interoperability and information quality (and ideally to embed this conversation within the even broader context of thinking on Information Justice, and an awareness of critical information infrastructure studies, and work on humanistic approaches to data).

Exactly what shape that debate takes: I don’t know yet… but I’m keen to see where it could take us…

OCDS – Notes on a standard

logo-open-contracting Today sees the launch of the first release of the Open Contracting Data Standard (OCDS). The standard, as I’ve written before, brings together concrete guidance on the kinds of documents and data that are needed for increased transparency in processes of public contracting, with a technical specification describing how to represent contract data and meta-data in common ways.

The video below provides a brief overview of how it works (or you can read the briefing note), and you can find full documentation at http://standard.open-contracting.org.

When I first jotted down a few notes on how to go forward from the rapid prototype I worked on with Sarah Bird in 2012, I didn’t realise we would actually end up with the opportunity to put some of those ideas into practice. However: we did – and so in this post I wanted to reflect on some aspects of the standard we’ve arrived at, some of the learning from the process, and a few of the ideas that have guided at least my inputs into the development process.

As, hopefully, others pick up and draw upon the initial work we’ve done (in addition to the great inputs we’ve had already), I’m certain there will be much more learning to capture.

(1) Foundations for ‘open by default’

Early open data advocacy called for ‘raw data now‘, asking for governments to essentially export and dump online existing datasets, with issues of structure and regular publishing processes to be sorted out later. Yet, as open data matures, the discussion is shifting to the idea of ‘open by default’, and taken seriously this means more than just data dumps that are created being openly licensed as the default position, but should mean that data is released from government systems as a matter of course in part of their day-to-day operation.

green_compilation.svgThe full OCDS model is designed to support this kind of ‘open by default’, allowing publishers to provide small releases of data every time some event occurs in the lifetime of a contracting process. A new tender is a release. An amendment to that tender is a release. The contract being awarded, or then signed, are each releases. These data releases are tied together by a common identifier, and can be combined into a summary record, providing a snapshot view of the state of a contracting process, and a history of how it has developed over time.

This releases and records model seeks to combine together different user needs: from the firm seeking information about tender opportunities, to the civil society organisation wishing to analyse across a wide range of contracting processes. And by allowing core stages in the business process of contracting to be published as they happen, and then joined up later, it is oriented towards the development of contracting systems that default to timely openness.

As I’ll be exploring in my talk at the Berkman Centre next week, the challenge ahead for open data is not just to find standards to make existing datasets line-up when they get dumped online, but is to envisage and co-design new infrastructures for everyday transparent, effective and accountable processes of government and governance.

(2) Not your minimum viable product

Different models of standard

Many open data standard projects adopt either a ‘Minimum Viable Product‘ approach, looking to capture only the few most common fields between publishers, or are developed through focussing on the concerns of a single publisher or users. Whilst MVP models may make sense for small building blocks designed to fit into other standardisation efforts, when it came to OCDS there was a clear user demand to link up data along the contracting process, and this required an overarching framework from into which simple component could be placed, or from which they could be extracted, rather than the creation of ad-hoc components, with the attempt to join them up made later on.

Whilst we didn’t quite achieve the full abstract model + idiomatic serialisations proposed in the initial technical architecture sketch, we have ended up with a core schema, and then suggested ways to represent this data in both structured and flat formats. This is already proving useful for example in exploring how data published as part of the UK Local Government Transparency Code might be mapped to OCDS from existing CSV schemas.

(3) The interop balancing act & keeping flex in the framework

OCDS is, ultimately, not a small standard. It seeks to describe the whole of a contracting process, from planning, through tender, to contract award, signed contract, and project implementation. And at each stage it provides space for capturing detailed information, linking to documents, tracking milestones and tracking values and line-items.

This shape of the specification is a direct consequence of the method adopted to develop it: looking at a diverse set of existing data, and spending time exploring the data that different users wanted, as well as looking at other existing standards and data specifications.

However, OCDS by not means covers all the things that publishers might want to state about contracting, nor all the things users may want to know. Instead, it focusses on achieving interoperability of data in a number of key areas, and then providing a framework into which extensions can be linked as the needs of different sub-communities of open data users arise.

We’re only in the early stages of thinking about how extensions to the standard will work, but I suspect they will turn out to be an important aspect: allowing different groups to come together to agree (or contest) the extra elements that are important to share in a particular country, sector or context. Over time, some may move into the core of the standard, and potentially elements that appear core right now might move into the realm of extensions, each able to have their own governance processes if appropriate.

As Urs Gasser and John Palfrey note in their work on Interop, the key in building towards interoperability is not to make everything standardised and interoperable, but is to work out the ways in which things should be made compatible, and the ways in which they should not. Forcing everything into a common mould removes the diversity of the real world, yet leaving everything underspecified means no possibility to connect data up. This is both a question of the standards, and the pressures that shape how they are adopted.

(4) Avoiding identity crisis

green_organisation.svgData describes things. To be described, those things need to be identified. When describing data on the web, it helps if those things can be unambiguously identified and distinguished from other things which might have the same names or identification numbers. This generally requires the use of globally unique identifiers (guid): some value which, in a universe of all available contracting data, for example, picks out a unique contracting process; or, in the universe of all organizations, uniquely identifies a specific organization. However, providing these identifiers can turn out to be both a politically and technically challenging process.

The Open Data Institute have recently published a report on the importance of identifiers that underlines how important identifiers are to processes of opening data. Yet, consistent identifiers often have key properties of public goods: everyone benefits from having them, but providing and maintaining them has some costs attached, which no individual identifier user has an incentive to cover. In some cases, such as goods and service identifiers, projects have emerged which take a proprietary approach to fund the maintenance of those identifiers, selling access to the lookup lists which match the codes for describing goods and services to their descriptions. This clearly raises challenges for an open standard, as when proprietary identifiers are incorporated into data, then users may face extra costs to interpret and make sense of data.

In OCDS we’ve sought to take as distributed an approach to identifiers as possible, only requiring globally unique identifiers where absolutely necessary (identifying contracts, organizations and goods and services), and deferring to existing registration agencies and identity providers, with OCDS maintaining, at most, code lists for referring to each identity ‘scheme’.

In some cases, we’ve split the ‘scheme’ out into a separate field: for example, an organization identifier consists of a scheme field with a value like ‘GB-COH’ to stand for UK Companies House, and then the identifier given in that scheme, like ‘5381958’. This approach allows people to store those identifiers in their existing systems without change (existing databases might hold national company numbers, with the field assumed to come from a particular register), whilst making explicit the scheme they come from in the OCDS. In other cases, however, we look to create new composite string identifiers, combining a prefix, and some identifier drawn from an organizations internal system. This is particularly the case for the Open Contracting ID (ocid). By doing this, the identifier can travel between systems more easily as a guid – and could even be incorporated in unstructured data as a key for locating documents and resources related to a given contracting process.

However, recent learning from the project is showing that many organisations are hesistant about the introduction of new IDs, and that adoption of an identifier schema may require as much advocacy as adoption of a standard. At a policy level, bringing some external convention for identifying things into a dataset appears to be seen as affecting the, for want of a better word, sovereignty of a specific dataset: even if in practice the prefix approach of the ocid means it only need to be hard coded in the systems that expose data to the world, not necessarily stored inside organizations databases. However, this is an area I suspect we will need to explore more, and keep tracking, as OCDS adoption moves forward.

(5) Bridging communities of practice

If you look closely you might in fact notice that the specification just launched in Costa Rica is actually labelled as a ‘release candidate‘. This points to another key element of learning in the project, concerning the different processes and timelines of policy and technical standardisation. In the world of funded projects and policy processes, deadlines are often fixed, and the project plan has to work backwards from there. In a technical standardisation process, there is no ‘standard’ until a specification is in use: and has been robustly tested. The processes for adopting a policy standard, and setting a technical one, differ – and whilst perhaps we should have spoken from the start of the project of an overall standard, embedding within it a technical specification, we were too far down the path towards the policy launch before this point. As a result, the Release Candidate designation is intended to suggest the specification is ready to draw upon, but that there is still a process to go (and future governance arrangements to be defined) before it can be adopted as a standard per-se.

(6) The schema is just the start of it

This leads to the most important point: that launching the schemas and specification is just one part of delivering the standard.

In a recent e-mail conversation with Greg Bloom about elements of standardisation, linked to the development of the Open Referral standard, Greg put forward a list of components that may be involved in delivering a sustainable standards project, including:

  • The specification – with its various components and subcomponents);
  • Tools that assesses compliance according to the spec (e.g. validation tools, and more advanced assessment tools);
  • Some means of visualizing a given set of data’s level of compliance;
  • Incentives of some kind (whether positive or negative) for attaining various levels of compliance;
  • Processes for governing all of the above;
  • and of course the community through which all of this emerges and sustains;

To this we might also add elements like documentation and tutorials, support for publishers, catalysing work with tool builders, guidance for users, and so-on.

Open government standards are not something to be published once, and then left, but require labour to develop and sustain, and involve many social processes as much as technical ones.

In many ways, although we’ve spent a year of small development iterations working towards this OCDS release, the work now is only just getting started, and there are many technical, community and capacity-building challenges ahead for the Open Contracting Partnership and others in the open contracting movement.

Two senses of standard

[Summary: technical standards play a role in both interoperability, and in target-setting for policy.]

I’ve been doing lots of thinking about standardisation recently, particularly as part of work on the Open Contracting Data Standard (feedback invited on the latest draft release…), and thanks to the opportunity to work with Samuel Goëta on a paper around data standards (hopefully out some time next year).

One of the themes I’ve been seeking to explore is how standards play both a technical and a political role, and how standards processes (at least at the level of content standards) can sensitively engage with this. Below is a repost of my earlier contribution to a GitHub thread discussing some of this in the context of Open Contracting.

Two senses of standard

In Open Contracting I believe we’re dealing with two different senses of ‘standard’, and two purposes which we need to keep in balance. Namely:

  • Standards as a basis for interoperability – as in *”their data complies with the standard, and can be used by standards-compliant tools.”
  • Standards as targets – as in, “they have achieved a high standard of disclosure”.

To unpack these a bit:

(Note: the arguments below are predominantly theoretical, and so some of the edge cases considered may not come up at all in practice in the Open Contracting Data Standard, but considering them is a useful exercise to test the intuitions and principles directing our action.)

Standards as interoperability

We’re interested in interoperability in two directions: vertical (can a single dataset be used by other actors and tools in a value-chain of re-use), and horizontal (can two datasets from different publishers be easily analysed alongside one another).

Where data is already published, then the goal should be to achieve the largest possible set of data publishers who can richly represent their data in the standard, and of data users who can draw on data in the standard to meet their needs. This supports the idea that for any element in the standard where (a) data already exists; and (b) use cases already exist; we should be looking for reference implementations to test that data can be rendered in the standard, and that users (or tools they create) can read, analyse and use that data effectively.

However, it is important that in this we look at both both horizontal and vertical interoperability in making this judgement. E.g. there could be a country as the sole publisher of a field that is used by 5 different users in their country. This should clearly not be a required field in a standard, but articulating how it is standardised is useful to this community of users (one way to accommodate such cases may be in extensions, although the judgement on whether or not to move something to an extension might come down to whether it is likely that other publishers could be providing this data in future).

In many cases, underlying data from different sources is not perfectly interoperable, or there is a mismatch between the requirements of users, and the requirements of data holders. In these cases, the way a standard is designed affects the distribution of labour between publishers and users with respect to rendering data interoperable. For example, a use case might involve ‘Identifying which different government agencies, each publishing data independently, have contracts with a particular firm’. In this case, a standard could require all publishers, who may store different identifiers in their systems, to map these to a common identifier, or a standard could allow publishers to use whatever identifier they hold, leaving the costs of reconciling these on the user. Making things interoperable then involves can involve then a process of negotiation, and this process may play out differently in different places at different times, leaving certain elements of a standard less stable than others. The concept of ‘designing for the tussle’ (PDF) may be relevant here, thinking about how we can modularise stable (or ‘neutral’) and unstable elements of a standard (this is what the proposed Organisation ID standard does, but having a common way to represent identifiers, but separating this off from the choice of identifier itself, and then allowing for the emergence of a set of third-party tools and validation routines to help manage the tussle).

In seeking to maximise the set of publishers and users interoperable through the standard we need to be critically aware of both short-term and long-term interoperability, as organisations modify their practices in order to be able to publish to, or draw upon, a common standard. We need to balance out a ‘Lowest Common Denominator’ (LCD) of ‘Minimum Viable Product’ (MVP) approach that means that the majority of publishers can achieve substantial coverage of the standard, with a richer standard that supports the greatest chance of different producer and consumer groups being able to exchange data through the standard.

initial-sketch-thinking-about-standards

(Initial attempt to sketch distinction between maximising set of common fields across publisher and users, and maximising set of publishers and users)

Standards as targets

Open Contracting is a political process. The Open Contracting Partnership have articulated a set of Global Principles which set out the sorts of information about contracting that governments and other parties should disclose, and they are working to secure government sign-up to these principles. In policy circles, a standard is often seen as a form of measure, qualitative or quantitative, against which process towards some policy goal is measured. Some targets might be based on ‘best practice’, others are based on ‘stretch goals’: things which perhaps no-one is yet doing particularly well, but which a community of actors agree are worth aiming for. A standard, whether specified in terms of indicators and measures, or in terms of fields and formats, provides a means of agreeing what meeting the target will look like.

The Open Contracting Principles call for a lot of things which no governments appear to yet be publishing in machine-readable forms. In many cases we’ve not touched the standardisation of these right now (e.g. “Risk assessments, including environmental and social impact assessments”) recognising that standards for these will either exist in different domains that can be linked or embedded into our standard, or, recognising that interoperability of such information is hard to achieve and ultimately what is needed for most use cases may be legal text or plain language documents, rather than structured data. However, there may be cases where something is a strong candidate for standardisation, having both the potential to be published (i.e. this is something which evidence suggests governments either do, or could, capture in their existing information systems), and for which clearly articulated use cases exist. In these cases a proposed field-level standard can act as an important target for those seeking to provide this data to move towards. It also acts to challenge unwarranted ‘first mover advantage’ where the first person to publish, even if publishing less than an idea target would require, gets to set the standard, and instead makes the ‘target’ subject to community discussion.

Clearly any ‘aspirational’ elements of a standard should not predominate or make up the majority of a standard if it seeks to effectively support interoperability, but in standards that play a part in policy and political processes (as, in practice, all standards do to some extent (c.f. Lessig).

Implications for Open Contracting Data Standard

There are a number of ways we might respond to a recognition of the dual role that standardisation plays in Open Contracting.

Purposes and validation sets

One approach, suggested in the early technical scoping is to identify different sets of users, or ‘purposes’ for the standard, and for each of these to identify the kinds of fields (subset of the data) these purposes require. As Jeni Tennison’s work on the scoping describes “…each purpose can have a status (eg proposed vs implemented) and … purposes are only marked as implemented when there are implementations that use the given subset of data for the specified purpose”.

If their are neither purposes requiring a field, nor datasets providing a field, then it would not be suitable for inclusion in a standard. And if a purpose either went unimplemented for a long period, or required a field that no supplier could publish, then careful evaluation would be needed of whether to remove that purpose (or remove that field from the purpose) against which elements of the standard could be evaluated for relevance to remain in the model.

Purposes could also be used to validate datasets, identifying how many datasets are fit for which purpose.

Stable, ordinary and target elements

We could maintain a distinction in how the standard is described between fields and elements which are ‘stable’ (and thus very unlikely to change), ‘ordinary’ elements (which may have reference implementations, but could change if there was some majority interest amongst those governing a standard in seeing changes), and ‘target’ elements, which may lack any reference implementations, but which are considered useful to help publishers moving towards implementing a political commitment to publish.

Q: Could we build this information into the schema meta-data somehow?

We might need to have quite a long time horizon for keeping target elements provisionally in the standard, and to only remove them if there is agreement that no-one is likely to publish to them. However, being able to represent them visually as distinct in the schema, and clearly documenting the distinction may be valuable.

Extensions

Some ‘target’ elements may best belong in extensions, with some process for merging extensions into the core standard if they are widely enough adopted.

Regular implementation monitoring

The IATI Team run a dashboard which tracks use of particular fields in the data. Doing similar for Open Contracting would be valuable, and it may even be useful to feed such information into the display of the schema or documentation (or at least to make it easy for publishers and users to look up who is implementing a given property)

Implementation schedules

Another approach IATI uses for ‘target elements’ is to ask publishers to prepare ‘Implementation Schedules‘ which outline which fields they expect to be able to publish by when. This allows an indication of whether there is political will to reach some of the ‘stretch targets’ that might be involved in a standard, and holds out the potential to convene together to define and refine target standardisations those who are most likely to publish that data in the near to medium term.

Discussion

What theoretical writing on standardisation could I be drawing on here?

What experience from other standards could we be drawing upon in Open Contracting and in other standard processes?

Five critical questions for constructing data standards

I’ve been spending a lot of time thinking about processes of standardisation recently (building on the recent IATI Technical Advisory Group meeting, working on two new standards projects, and conversations at today’s MIT Center for Civic Media & Berkman Center meet-up). One of the key strands in that thinking is around how pragmatics and ethics of standards collide. Building a good standard involves practical choices based on the data that is available, the technologies that might use that data and what they expect, and the feasibility of encouraging parties who might communicate using that standard to adapt their practices (more or less minimally) in order to adopt it. But a standard also has ethical and political consequences, whether it is a standard deep in the Internet stack (as John Morris and Alan Davidson discuss in this paper from 2003[1]), or a standard at the content level, supporting exchange of information in some specific domain.

The five questions below seek to (in a very provisional sense) capture some of the considerations that might go into an exploration of the ethical dimensions of standard construction[2].

(Thanks to Rodrigo DaviesCatherine D’Ignazio and Willow Brugh for the conversations leading to this post)

For any standard, ask:

Who can use it?

Practically I mean. Who, if data in this standard format was placed in front of them, would be able to do something meaningful with it. Who might want to use it? Are people who could benefit from this data excluded from using it by it’s complexity?

Many data standards assume that ‘end users’ will access the data through intermediaries (i.e. a non-technical user can only do anything with the data after it has been processed by some intermediary individual or tool) – but not everyone has access to intermediaries, or intermediaries may have their own agendas or understandings of the world that don’t fit with those of the data user.

I’ve recently been exploring whether it’s possible to turn this assumption around, and make simple versions of a data standard the default, with more expressive data models available to those with the skills to transform data into these more structured forms. For example, the Three Sixty Giving standard (warning: very draft/provisional technical docs) is based around the idea of a rich data model, but a simple flat-as-possible serialisation that means most of the common forms of analysis someone might want to do with the data can be done in a spreadsheet, and for 90%+ of cases, data can be exchanged in flat(ish) forms, with richer structures only used where needed.

What can be expressed?

Standards make choices about what can be expressed usually at two levels:

  • Field choice
  • Taxonomies / codelists

Both involve making choices about how the world is sliced up, and what sorts of things can be represented and expressed.

A thought experiment: If I asked people in different social situations an open question inviting them to tell me about the things a standard is intended to be about (e.g. “Tell me about this contract?”) how much of what they report can be captured in the standard? Is it better at capturing the information seen as important to people in certain social positions? Are there ways it could capture information from those in other positions?

What social processes might it replace or disrupt?

Over the short-term, many data standards end up being fed by existing information systems – with data exported and transformed into the standard. However, over time, standards can lead to systems being re-engineered around them. And in shifting the flow of information inside and outside of organisations, standards processes can disrupt and shift patterns of autonomy and power.

Sometimes the ‘inefficient’ processes of information exchange, which open data standards seek to rationalise, can be full of all sorts of tacit information exchange, relationship building etc. which the introduction of a standard could affect. Thinking about how the technical choices in a standard affect it’s adoption, and how far they allow for distributed patterns of data generation and management may be important. (For example, which identifiers in a standard have to be maintained centrally, thus placing a pressure for centralised information systems to maintain the integrity of data – and which can be managed locally – making it easier to create more distributed architectures. It’s not simply a case of what kinds of architectures a standard does or doesn’t allow, but which it makes easier or trickier, as in budget constrained environments implementations will often go down the path of least resistance, even if it’s theoretically possible to build out implementation of standard-using tools in ways that better respect the exiting structures of an organisation.)

Which fields are descriptive? Which fields are normative?

There has recently been discussion of the introduction on Facebook of a wide range of options for describing Gender, with Jane Fae arguing in the Guardian that, rather than provide a restricted list of fields, the field should simply be dropped altogether. Fae’s argument is about the way in which gender categories are used to target ads, and that it has little value as a category otherwise.

Is it possible to look at a data standard and consider which proposed fields import strong normative worldviews with them? And then to consider omitting these fields?

It may be that for some fields, silence is the better option that forcing people, organisations or events (or whatever it is that the standard describes) into boxes that don’t make sense for all the individuals/cases covered…

Does it permit dissent?

Catherine D’Ignazio suggested this question. How far does a standard allow itself to be disputed? What consequences are there to breaking the rules of a standard or remixing it to express ideas not envisaged by the original architects? What forms of tussle can the standard accommodate?

This is perhaps even more a question of the ecosystem of tools, validators and other resources around the standard than a standard specification itself, but these are interelated.

Footnotes

[1]: I’ve been looking for more recent work on ‘public interest’ and politics of standard creation. Academically I spend a lot of time going back to Bowker and Star’s work on ‘infrastructure’, but I’m on the look out for other works I should be drawing upon in thinking about this.

[2]: I’m talking particularly about open data standards, and standards at the content level, like IATI, Open 311, GTFS etc.