OAMI JatsCon Submission, 2013

↑ Up to parent page

This is our original paper proposal, which we sent to Jeff on 5/28/2013.


Inconsistent XML as a barrier to reuse of Open Access Content

Brief description of the paper

In this paper, we will describe the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we will use our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset and automatically upload it to Wikimedia Commons.

Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the MIME types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements required us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons.

Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations for generators of content on how best to tag certain data so that it is both compatible with existing standards, and consistent and machine-readable.


The Open Access Media Importer is an automated bot, developed under a grant from Wikimedia Deutschland, that harvests multimedia content from open access repositories, and uploads that content to Wikimedia Commons. Once in Wikimedia Commons, those multimedia files are very easy to insert into Wikipedia pages, and to be reused on a wide variety of other wikis.

Currently, the OAMI harvests content from the PMC Open Access Subset. Since most articles in the subset are actually not Open Access (in the sense of being free to reuse, revise, remix and redistribute), the bot scans the XML sources of new articles for those articles that are. It then checks whether these have multimedia files in their supplementary materials. If so, they are downloaded, converted and uploaded to Wikimedia Commons.

We discovered that automatically discerning these licensing terms and reuse rights was non-trivial. In the XML available from PMC, licensing metadata is not expressed consistently across journals, and sometimes not even across different articles of the same journal. For example, in many cases the URI identifying the Creative Commons license was contradicted by the human-readable text. In others, the license text contained typographical errors, or was given in the <copyright-statement> element rather than the <license>.

These problems necessitated the development of a fairly sophisticated text-mining module that examined both the machine-readable URI in the <license> element, as well as the human-readable expressions, so as to determine the publisher’s intent. Because of Wikimedia Commons’ strict enforcement of proper licensing, it was important that this module be conservative: false positives being much less acceptable than false negatives. The inconsistency and number of errors that we found in the existing subset would seem to indicate that because of these constraints, there is a fair amount of content that could legitimately be uploaded to Wikimedia Commons but can not reliably be identified by automated tools like the OAMI. This does a disservice to those who released their material with reuse-friendly licenses, often in the hopes of having it be used as widely as possible.

There have been a number of efforts to standardize the expression of license metadata (e.g. Creative Commons REL and ONIX-PL), and we welcome these efforts as well as the ongoing NISO initiative with a similar scope. However, standards can only act as such if they are implemented consistently.

In addition to the problems with communicating the licensing terms, the MIME types of supplementary materials are frequently not indicated properly and keywords are used inconsistently, which again reduces discoverability and translates into extra efforts before any reuse becomes possible.

We will thus end by discussing some more long-term prospects for enhancing the discoverability of reuseable open-access content, combined with practical recommendations for how the current situation might be improved.

Additional information (optional)

  • PMC Tagging Guidelines should be more strict, especially for articles included in the Open Access Subset:
    • check for a closed set of recognized URIs in license/@xlink:href, and issue a warning if there’s no match.
    • incorporate checks for consistency between human-readable and machine-readable license data.
  • PMC should provide some mechanism for feeding back all the information derived from this project back into their system (and the search-by-license functionality in particular), so that others don’t have to duplicate the effort.

  • If keywords are supplied (e.g. using the <subject> tag), it is current practice to select them for the article as a whole. In order to increase the discoverability of non-text materials in both the articles and their supplementary materials, it would be desirable to expand this system such that individual components can be tagged separately.


(Names, affiliations, and a short biography for authors).

  • Daniel Mietchen, Wikimedian in Residence on Open Science, Open Knowledge Foundation Germany
    Daniel Mietchen is a biophysicist and a consultant to Open Access publishers on matters of publishing technology, including XML-based workflows.

  • Chris Maloney, software developer at PMC (Contractor with A-Tek, Inc.)
    Chris Maloney is a web developer working for NCBI’s PMC and Bookshelf resources. He has worked with XML technologies for over ten years.

  • Nils Dagsson Moskopp
    Nils Dagsson Moskopp is a philosophy student, blogger and programmer with focus on web development.