Monthly Archives: August 2014

Exploring Wikidata

WikiData[Summary: thinking aloud – brief notes on learning about the wikidata project, and how it might help addressing the organisational identifiers problem]

I’ve spent a fascinating day today at the Wikimania Conference at the Barbican in London, mostly following the programmes ‘data’ track in order to understand in more depth the Wikidata project. This post shares some thinking aloud to capture some learning, reflections and exploration from the day.

As the Wikidata project manager, Lydia Pintscher, framed it, right now access to knowledge on wikipedia is highly skewed by language. The topics of articles you have access to, the depth of meta-data about them (such as the locations they describe), and the detail of those articles, and their liklihood of being up to date, is greatly affected by the language you speak. Italian or Greek wikipedia may have great coverage of places in Italy or Greece, but go wider and their coverage drops off. In terms of seeking more equal access to knowledge, this is a problem. However, whilst the encyclopedic narrative of a French, Spanish of Catalan page about the Barbican Center in London will need to be written by someone in command of that language, many of the basic facts that go into an article are language-neutral, or translatable as small units of content, rather than sentences and paragraphs. The date the building was built, the name of the architect, the current capacity of the building – all the kinds of things which might appear in infoboxes – are all things that could be made available to bootstrap new articles, or that, when changed, could have their changes cascaded across all the different language pages that draw upon them.

That is one of the motivating cases for Wikidata: separating out ‘items’ and their ‘properties’ that might belong in Wikipedia from the pages, making this data re-usable, and using it to build a better encyclopedia.

However, wikidata is also generating much wider interest – not least because it is taking on a number of problems that many people want to see addressed. These include:

  • Somewhere ‘institutional’ and well governed on the web to put data – and where each data item also gains the advantage of a discussion page.
  • The long-term preservation, and versioning, of data;
  • Providing common identifiers on the web for arbitrary things – and providing URIs for these things that can be looked up (building on the idea of DBPedia as a crystalisation point for the web of linked data);
  • Providing a data model that can cope with change over time, and with data from heterogenous sources – all of the properties in wikidata can have qualifiers, such as when the statement is true from, or until, source information, and other provenance data.

Wikidata could help address these issues on two levels:

  • By allowing anyone to add items and properties to the central wikidata instance, and making these available for re-use;
  • By providing an open source software platform for anyone to use in managing their own corpus of wikified, versioned data*;

A particular use case I’m interested in is whether it might help in addressing the perenial Organisational Identifiers problem faced by data standards such as IATI and Open Contracting, where it turns out that having shared identifiers for government agencies, and lots of existing, but non-registered, entities like charities and associations that give and recieve funds, is really difficult. Others at Wikimania spoke of potential use cases around maintaining national statistics, and archiving the datasets underlying scientific publications.

However, in thinking about the use cases wikidata might have, its important to keep in mind it’s current scope:

  • It is a store of ‘items’ and then ‘statements’ about them (essentially a graph store). This is different from being a place to store datasets (as you might want to do with the archival of the dataset used in a scientific paper), and it means that, once created, items are the first class entities of wikidata, able to exist in multiple collection.
  • It currently inherits Wikipedia’s notability criteria for items. That is, the basic building blocks of wikidata – the items that can be identified and described, such as the Barbican, Cheese or Government of Grenada – can only be included in the main wikidata instance if they have a corresponding wikipedia page in some language wikipedia (or similar: this requirement is a little more complex).
  • It can be edited by anyone, at any time. That is, systems that rely on the data need to consider what levels of consistence they need. Of course, as wikipedia has shown, editability is often a great strength – and as Rufus Pollock noted in the ‘data roundtable’ session, updating and versioning of open data are currently big missing parts of our data infrastructures.

Unlike the entirely distributed open world assumption on the web of data, where the AAA assumption holds (Anyone can say Anything about Anything), wikidata brings both a layer of regulation to the statements that can be made, and the potential of community driven editorial control. It sits somewhere between the controlled description sets of Schema.org, and an entirely open proliferation of items and ontologies to describe them.

Can it help the organisational identifiers problem?

I’ve started to carry out some quick tests to see how far wikidata might be a resource to help with the aforementioned organisational identifiers problem.

Using Kasper Brandt‘s fantastically useful linked data rendering of IATI, I queried for the names of a selection of government and non-government organisations occurring in the International Aid Transparency Initiative data. I then used Open Refine to look up a selection of these on the DBPedia endpoint (which it seems now incorporates wikidata info as well). This was very rough-and-ready (just searching for full name matches), but by cross-checking negative results (where there were no matches) by searching wikipedia manually, it’s possible to get a sense of how many organisations might be identifiable within Wikipedia.

So far I’ve only tested the method, and haven’t run a large scale test – but I found around 1/2 the organisations I checked had a Wikipedia entry of some form, and thus would currently be eligible to be Wikidata items right away. For others, Wikipedia pages would need to be created, and whether or not all the small voluntary organisations that might occur in an IATI or Open Contracting dataset would be notable for inclusion is something that would need to be explored more.

Exploring the Wikidata pages for some of the organisations I did find threw up some interesting additional possibilities to help with organisation identifiers. A number of pages were linked to identifiers from Library Authority Files, including VIAF identifiers such as this set of examples returned for a search on Malawi Ministry of Finance. Library Authority Files would tend to only include entries when a government agency has a publication of some form in that library, but at a quick glance coverage seems pretty good.

Now, as Chris Taggart would be quick to point out, neither wikipedia pages, nor library authority file identifiers, act as a registry of legal entities. They pick out everyday concepts of an organisation, rather than the legally accountably body which enters into contracts. Yet, as they become increasingly backed by data, these identifiers do provide access to look up lots of contextual information that might help in understanding issues like organisational change over time. For example, the Wikipedia page for the UK’s Department for Education includes details on the departments that preceeded it. In wikidata form, a statement like this could even be qualified to say if that relationship of being a preceeding department is one that passes legal obligations from one to the other.

I’ve still got to think about this a lot more, but it seems that:

  • There are many things it might be useful to know about organisations, but which are not going to be captured in official registries anytime soon. Some of these things will need to be subject of discussion, and open to agreement through dialogue. Wikidata, as a trusted shared space with good community governance practices might be a good place to keep these things, albeit recognising that in its current phase it has no goal of being a comprehensive repository of records about all organisations in the world (and other spaces such as Open Corporates are already solving the comprehensive coverage problem for particular classes of organiastion).

  • There are some organisations for which, in many countries, no official registry exists (particularly Government Departments and Agencies). Many of these things are notable (Government Departments for example), and so even if no Wikipedia entry yet exists, one could and should. A project to manage and maintain government agency records and identifiers in Wikidata may be worth exploring.

Whether a shift from seeking to solve some aspects of the organisational identifiers problem through finding some authority to provide master lists, to developing a distributed best-efforts community approach is one that would make sense to the open government community is something yet to be explored.

Notes

*I here acknowledge SJ Klein‘s counsel was that this (encouraging multiple domain specific instances of a wikidata platform) is potentially a very bad idea, as the ‘forking’ of wiki-projects has rarely been a successful journey: particularly with respect to the sustainability of forked content. As SJ outlined, even though there may be technical and social challenges to a mega graph store, these could be compared to the apparant challenges of making the first encyclopedias (the idea of 50,000 page book must have seemed crazy at first), or the social challenges envisioned to Wikipedia at its genesis (‘how could non-experts possible edit an enecylopedia?’). On this view, it is only by setting the ambition of a comprehensive shared store of the worlds propositional data (with the qualifiers that Wikidata supports to make this possible without a closed world assumption) that such limits might be overcome. Perhaps with data there is a greater possibility to support forking, and remerging, of wikidata instances, permitting short-term pragmatic creation of datasets outside the core wikidata project, which can later be brought back in if they are considered, as a set, notable (although this still carries risks that forked projects diverge in their values, governance and structure so far that re-connecting later is made prohibitively difficult).

A Data Sharing Disclosure Standard?

DataSharing[Summary: Iterations on a proposal for a public register of government data sharing arrangements, setting out options for a Data Sharing Disclosure Standard to be used whenever government shares personal data. Draft for interactive comments here (and PDF for those in govt without access to Google Docs )

At the instigation of the UK Cabinet Office, an open policy making process is currently underway to propose new arrangements for data sharing in government. Data sharing arrangements are distinct from open data, as they may involve the limited exchange of personal and private data between government departments, or outside of government, with specific purpose of data use in mind.

The idea that new measures are needed is based on a perception that many opportunities to make better use of data for research, addressing debt and fraud, or tailoring the design of public services, are missed because either because of legal or practical barriers to data moving being exchanged or joined up between government departments. Some departments in particular, such as HMRC, require explicit legal permissions to share data, where in other department and public bodies, a range of existing ‘legal gateways’ and powers support exchange of data.

I’ve been following the process from afar, but on Monday last week I had the chance to attend one of the open full-day workshops that Involve are facilitating as part of the open policy making process. This brought together representatives of a range of public bodies, including central government departments and local authorities, with members of the Cabinet Office team leading on data sharing reforms, and a small number of civil society organisations and individuals. Monday’s discussion were centered on the introduction of new ‘permissive powers’ for data sharing to support tailored public services. For example, powers that would make it easier for local government to request and obtain HMRC data on 16 – 19 year olds in order to identify which young people in their area were already in employment or training, and so to target their resources on contacting those young people outside employment or training who they have a statutory obligation to support.

The exact wording of such a power, and the safeguards that need to be in place to ensure it is neither too broad, nor open to abuse, are being developed through the open policy making process. One safeguard I believe is important comes from introducing greater transparency into government data sharing arrangements.

A few months back, working with Reuben Binns, I put together a short note on a possible model for an ‘Open Register of Data Sharing‘. In Monday’s open policy making meeting, the topic of transparency as an important aspect of tailored public service data sharing came up, and provided an opportunity to discuss many of the ideas that the draft proposal had contained. Through the discussions, however, it became clear that there were a number of extra considerations needed to develop the proposal further, in particular:

  • Noting that public disclosure of planned data sharing was not only beneficial for transparency and scrutiny, but also for efficiency, coordination and consistency of data sharing: by allowing public bodies to pool data sharing arrangements, and to easily replicate approved shares, rather than starting from scratch with every plan and business case.
  • Recognising the concerns of local authorities and other public bodies about a centralised register, and the need to accommodate shares that might take place between public bodies at a local level only, without involvement of central government.
  • Recognising the need for both human and machine-readable information on data sharing arrangements, so that groups with a specific interest in particular data (e.g. associations looking out for the rights of homeless people) could track proposed or enacted arrangements without needing substantial technical know-how.
  • Recognising the importance of documents like Privacy Impact Assessments and Business Cases, but also noting that mandatory publication of these during their drafting could distort the drafting process (with the risk they become more PR documents making the case for a share, than genuine critical assessments), suggesting a mix of proactive and reactive transparency may be needed in practice.

As a result of the discussions with local authorities, government departments and others, I took away a number of ideas about how the proposal could be refined, and so this Friday, at the University of Southampton Web and Internet Science group annual gathering and weekend of projects (known locally as WAISFest) I worked in a stream on personal data, and spend a morning updating the proposals. The result is a reframed draft that, rather than focusing on the Register, focuses on a Data Sharing Disclosure Standard emphasising the key information that needs to be disclosed about each data share, and discussing when disclosure should take place, whilst leaving open a range of options for how this might be technically implemented.

You can find the updated document here, as a Google Doc open to comments. I would really welcome comments and suggestion for how this could be refined further over the coming weeks. If you do leave a comment and want to be credited / want to join in future discussion of this proposal, please also include your name / contact details.

The Gazette provides semantically enriched public notices: readable by humans and machines.

The Gazette provides semantically enriched public notices: readable by humans and machines.

A couple of things of particular note in the draft:

  • It is useful to identify (a) data controllers; (b) dataset; (c) legislation authorising data shares. Right now the Register of Data Controllers seems to provide a good resource for (a), and thanks to recent efforts at building out the digital information infrastructure of the UK, it turns out there are often good URLs that can be used as identifiers for datasets (data.gov.uk lists unpublished datasets from many central government departments) and legislation (through the data-all-the-way down approach of legislation.gov.uk).
  • It considers how the Gazette might be used as a publication route for Data Sharing Disclosures. The Gazette is an official paper of record, established since 1665 but recently re-envisioned with a semantic publishing platform. Using such a route to publish notices of data sharing has the advantage that it combines the long-term archival of information in a robust source, with making enriched openly licensed data available for re-use. This potentially offers a more robust route to disclosures, in which the data version is a progressive enhancement on top of an information disclosure.
  • Based on feedback from Javier Ruiz, it highlights the importance of flagging when shared data is going to be processed using algorithms that will determine individuals eligibility for services/trigger interventions affecting citizens, and raises of the question of whether the algorithms themselves should be disclosed as a mater of course.

I’ll be sharing a copy of the draft with the Data Sharing open policy process mailing list, and with the Cabinet Office team working on the data sharing brief. They are working to draft an updated paper on policy options by early September, with a view to a possible White Paper – so comments over the next few weeks are particularly valued.