A recent Open Data Institute (ODI) summit in London featured a number of talks where a range of stakeholders discussed open data: how important it is, how it unleashes the true potential of data, what it means, what possibilities if offers, and where the future of the open data lies. Open data, should be accessible to all, usable and sharable by all, and as such is a key tool in seeking to advance sustainable development and be used for good governance.
However, despite more data being published in open formats, data scientists, journalists and analysts are often left with a daunting and time-consuming task of not only finding relevant data and discovering new datasets, but most importantly understanding it before any analysis can be done. That information should be found in the metadata that should couple the data published.
Metadata is, in essence, structured information that makes it easier to retrieve, use or manage an information resource. In practice, metadata describes a dataset and its structure, and helps users discover it. The information usually includes such basic elements as: title, who published the dataset, when it was published, how often it is updated and what license is associated with the dataset. These are classed as ‘descriptive metadata’ as opposed to ‘structural metadata’, which describes for example information on page layout or an object’s component and their relationships (such as chapters or tables in a book).
Just as the number of open datasets has exponentially increased, so too has the number of open data portals and associated standards. There are currently 521 open data portals listed on Data Portals, a list curated by a group of experts from around the world. Staggeringly 197 are associated with Europe, 100 registered in the USA, and only 33 in Africa. Simple analysis of the resources reveal that 118 of the portals are classed as Government (or ‘government’), 12 as Community, 5 as Institutional and 6 as Research. In total 141 open data portals are assigned a publisher, the remaining 380 are not assigned on a publisher classification basis. Five of the portals are listed as ‘inactive’.
This information was pooled from the website’s metadata; however, it seems that although the metadata is present, only a quarter has annotated fields. This in itself can stem from a variety of issues, but one stands out in particular: this data is incomplete because the data portals listed do not provide comprehensive metadata that describes their own platforms. 385 portals have metadata associated with them. Only 8 provide links to full metadata downloads and only 12 provide a working API (application programming interface) point.
The above problem combined with an ever-increasing number of open data portals begs the question: which metadata standards are used, if any, and which platforms are most prevalently used for these portals?
This paper investigates how open data portals share their metadata and explores the most prevalent underlying metadata standards used. It seeks to understand to what extent the metadata standards used by the predominant open data platforms are interoperable. Interoperble metadata across open data portals enables datasets to be discoverable,
re-useable and searchable across portals rather than ‘siloed’ within them (this is called a federated search).