Open Data Infrastructure Challenge

Working with open data in any project presents a series of challenges from the get go. Some of these are related to the data analysis process, in particular discovering and wrangling data1, the design of the platform and API usability issues2, and tangentially, the focus and competences available to the project3. These challenges are shared with most other data analysis projects and dealing with external APIs. There are special cases and concerns in open data projects related to the API usability and wrangling data, e.g. open data APIs are not on par with commercial APIs and the data is often less well documented and significantly more messy than data offered by data broker services and commercial APIs. In this post I discuss issues related to infrastructure and particularities related to specific open data platforms.

The following is based on my years of working with open data projects (i.e. open data service provider initiatives) and developing software components and services on top of open data. This has primarily involved a local Danish open data initiative (ODDK) based on the Open Knowledge Foundations CKAN platform, the Danish parliament open data platform (ODFT) based on the OData ecosystem, and the Danish statistical service (DST) based on PxWeb infrastructure and API. I am not affiliated with any of the projects. My experience is based on developing various software components in different projects. Early in the development, I contributed to and acted as an advisor for the Aarhus CKAN platform (now merged with the national platform).

Usability and analysis challenges

The infrastructure challenge(s) elaborated below, is one of three categories of challenges in (open) data analysis:

  1. Infrastructure challenge
  2. Technical usability
  3. Data discovery and wrangling

Meyers and Stylos's2 discuss API usability from a human-computer interaction perspective. They highlight that a considering technical usability when designing an API (and not only performance and data modeling) is crucial in promoting use and avoiding unintended use that might have an impact on the API performance. When briefly considering open data API usability, they are in my experience not as mature and well-documented as commercial APIs. Out of the three platforms, both DKOD and FTOD (CKAN and OData) rely on external documentation of the platform API, which is often reduced to query examples and not, for instance, through examples including tooling and appropriate frameworks. In the case of DST, the documentation is severely lacking and reduced to an API browser tool. One particular relevant issue is that open repository-like platforms like CKAN, that is used to host all kinds of data, does not come with any interface (human or API) that introduces the developer to the data models of the data. Rather, this is either provided by the organisation that uploads the data, which is rare, or expected to be "in" the raw data. I suspect that the open data projects adopting e.g. CKAN expect the technical usability to be a responsibility and concern for OKF and the data providers expect it to be a feature of the local platform.

Kandel et al.1 describe a data analysis process as consisting of five high level tasks: discovery, wrangling, profiling, modeling and reporting. While all present in open data analysis, discovery and wrangling of data represents particular challenges in open data projects. Unlike data lake approaches, large commercial APIs or internal database based discovery, finding good open data can be difficult due to the fragmented nature of open data initiatives and the many different platforms. Lists and repositories are rarely inclusive and contains a lot of links to non-data like sources (albeit presented as open data), such as PDFs and websites. The fitness and quality of the data is often not of the same quality as data collected through internal processes (where one might have access to the people overseeing the collection) and commercial APIs. Open data is rarely documented in sufficient detail (collection frequency, data model, values, types, error rates etc.) and it is often expected that the information about the data is in the data. This poses a significant challenge to wrangling the data for an analysis process. This is in particular a challenge with CKAN based platforms, as mentioned above. In case of FTOD and DST, the data is documented.

Infrastructure challenges

The three platforms above are all based on widely adopted open data platforms, where the CKAN platform seem to be the most popular in government and public sector open data project and the PxWeb being the most specialised one used for statistical services primarily in the Nordic countries. Each of their platforms do have particular issues, which we will return to below. Right now we will focus on the challenges that arises when an organistion appropriate one of these platforms for an open data project.

1) Hosting

All the platforms I have engaged with exhibit issues related to how the platform is hosted. Some of these issues are obvious, while other decisions related to hosting become apparent when using the service for some time. For example, the DST API was until recently hosted on a non-https domain. Aside from the potential security issues, this made it impossible to use the API from HTTPS based domains when developing visualisations for web. Another example is that the FTOD platform is hosted on the same domain as the public website of the Danish parliament. This means that the API, which is meant to be queried somewhat frequently, is under the same DDoS protection as a content-based website. Finally, the server hardware (or VM) configuration often mean that a) the site recommend not to use the API directly in production and b) that there is a relatively low limit on the number of records returned. In the case of FTOP, the initial record limit was 20 (increased to 100 upon my request). You can properly imagine what happens if someone try to fetch 10000 records on an API that is under DDoS protection.

It fails fast.

2) Platform configuration

Open data platform software come in several versions with different options for configuration. Current stable release of CKAN is 2.8.0 and the ODDK platform runs 2.7.5 at the time of writing. OData is working on implementing the specification for OData 4.01 and the API offer no way of determining which OData version an instance is running. This makes it difficult to determine which API documentation to refer to without some trial and error. For newcomers to programming and open data, the myriad of guides and stack overflow answers might end up being misleading. This also includes server settings, such as query limits (see above) and enabled plugins. Once one have learned the oddities of one open platform, it it still cumbersome to scale applications developed for one instance or platform to include similar or additional data from another platform.

3) Documentation and examples

Open data platforms rarely produce API documentation and examples tailored to their platform and local language (e.g. Danish). Examples are often query examples and documentation is a high level reference to the platform site. ODFT do provide an overview of the underlying data model, but provide no examples on how to do joined or sequential queries to obtain the data one needs. While this may not be seen as a crucial infrastructure issue (in a strict technical sense), it is a challenge when engaging with a particular platform for the first time, in particular for students and non-professional developers.

Additional challenges

There are two additional challenges worth mentioning here. The first is the very loose approach to licensing, where a lot of open data is provided unlicensed or under a locally developed license. Not providing open data with a common open data license makes it highly problematic to develop software and applications for public use based on open data. No license means it cannot be considered open data.

The second issue pertains specifically to CKAN and how this platform is used (in a Danish context). CKAN is used as a repository with a myriad of different data sets provided by multiple organisations (sometimes hosted on other CKAN instances and then linked). Whereas the CKAN platform is documented sufficiently (once the instance version is identified), the data is rarely documented in a useful way. This adds a significant overhead in discovering useful data and figuring out what it describes and does not include. This can make open data analysis prone to reinforcing biases and amplify (known to someone else) issues.

Some conclusion

The infrastructure challenge is difficult to address as it is inherently local and changes depending on project and country. Given that Denmark is relatively mature in terms of its IT infrastructure, it is a bit surprising to see these issues arise. It is indicatory of several underlying issues that I speculate will be a part of most open data efforts. First, resources and competences allocated for open data projects can be scarce. From what I have observed, open data initiatives are often added to existing responsibilities of IT departments and as such prioritised together with more urgent tasks and maintaining existing systems and infrastructure. Second, open data in Denmark is as much a branding project for the public sector as it is an open data effort. Substantial time and resources are allocated to promoting open data and developing success cases to merit the projects. This means that focusing on the crucial details – infrastructure and usability issues – is a secondary concern.

Open data initiatives are often driven by and branded in connection to a huge potential in making society more transparent, enable local innovations and economic growth and lower costs related to data redundancy and maintenance within the public sector. I cannot imagine this potential as anything but hype until some of the issues above are addressed. If we want to make it possible for non-professional developers and others, such as journalists, educators, NGOs, to start working on open data, we need to look at all the three areas from a non-data scientist and software developer perspective.

Not surprisingly, I share Meyers and Stylos's2 argument that these issues are relevant to human-computer interaction. However, the issues also illustrate a need for informing and educating policy and open data platform project managers on the topic of usability of their platform and data.

References

1 Kandel, S., Paepcke, A., Hellerstein, J. M., & Heer, J. (2012). Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualization and Computer Graphics, 18(12), 2917-2926.
2 Myers, B. A., & Stylos, J. (2016). Improving API usability. Communications of the ACM, 59(6), 62-69.
3 Choi, J., & Tausczik, Y. (2017, February). Characteristics of collaboration in the emerging practice of open Data analysis. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (pp. 835-846). ACM.
@heenrikgithubemail
CC BY-NC Henrik Korsgaard 2018
Looking for Work