Fedora Commons gets semantic with Apache Stanbol

Outside the institutional domain and the public sector, many people following and implementing Apache Stanbol will not be familiar with Fedora Commons or its common usage patterns. So in this post we will try to provide an overview of the Fedora Common repository and outline the work underway to integrate it with Apache Stanbol, which in the Early Adopter phase is making use of the ontology storage and reasoning facilities offered by Knowledge Representation and Reasoning (KReS) layer of IKS.

The Fedora Commons Repository – Challenges and Opportunities with IKS

Fedora Commons is an open source digital object repository, licensed under the Apache 2.0 licence. It’s used for storing, managing, preserving and accessing digital content using a quite abstract notion of a digital object. It’s used across the world by university, national and public libraries and archives, broadcast organisations and government agencies amongst others.

Fedora is more a framework than an out-of-the-box content management solution. Fedora adopters tend to expect to invest significant time in repository design and configuration prior to going into production. This is reflected in its typical role as a system capable of meeting preservation demands, and situations that demand that identifiers for content, or specific versions of content, reliably resolve to exactly the content expected. Given there are the appropriate organisational procedures in place, this often lends itself to situations demanding “persistent identifiers” – durable identifiers suitable for embedding into (digital and non-digital) scholarly papers, for example – where guarantees are put in place and implemented by Fedora so that when an identifier for a content object is de-referenced (which is not always done via the web, eg when using the Handle system), the exact contents expected are returned over a significantly long time frame.

Given the durability of this content and the informational infrastructure built to support it, challenges arise. The technology landscape moves on, as do user expectations around those repository contents surfaced on the web. One such growing area is the semantic web to harmonise the meanings of often carefully-curated metadata within repositories, and its connectivity to the growing cloud of Linked Data.

There is some irony perhaps here: repository software such as Fedora is frequently used within the digital libraries context. Such situations are backed by decades of investment in stable cataloguing practices to precisely locate key entities within semantically-rich taxonomic networks of categories and controlled vocabularies, such as those published by the Library of Congress, Getty and the US National Library of Medicine. Established authorities exist to ensure durability of this kind of reference information, and some are now directly available to semantic web machinery, for instance through the Virtual Internet Authority File.

So the challenges are for implementers here: how can this rich heritage be best put to work for repository users? And what are the best ways to implement it, what are the issues?

It is worth adding one observation lying at a technical level specific to Fedora. Fedora was designed to meet a number of design goals, one of which was the concept of an “information network overlay“. Digital objects are conceived as being connected in a graph structure, and an important innovation was to include the “Resource Index”- which is implemented using an RDF triple store – to allow location of contents within this structure. So on the one hand, the fact that an RDF triple store is used has given rise to expectations that Fedora is ready to use in a generic semantic web or linked data setting; but on the other, there are design issues in the way Fedora treats this data that first need addressing.

We’re hoping the work underway will both point the way forward for using Apache Stanbol with Fedora, and, through collaboration with other stakeholders in the community, will build a reusable and durable foundation for future work.

In terms of first steps taken in the Early Adopters work, the repository content we’re working with currently consists of images catalogued using the VRA metadata format, which includes controlled vocabulary elements from the Getty ULAN thesaurus, which we’ve converted to SKOS. One of our objectives is alignment of the semantics of the metadata schema, the SKOS thesaurus and the Fedora content model to provide a rich and consistent discovery experience. We’re including categories from both metadata and the thesaurus, delivered through inference services provided by KReS.

In terms of Fedora community interest for the Apache Stanbol work so far, we’ve received some considerable interest when we presented at Open Repositories 2011 back in June, and further contacts made at the recent Red Island Repository Institute.

Overview of Fedora Digital Objects

In many ways the Fedora repository is similar to a web CMS in its role of storing and providing access to digital content, but with a greater focus on preservation and flexibility of the content model. Unlike the more usual CMS hierarchical content models, Fedora’s objects are structured as a graph of content nodes.

Fedora provides a specific kind of “resource-oriented” view of a networked and potentially very large repository of content. Contents can be accessed and transformed via services beneath that resource-oriented model. Contents themselves can be textual, audio-visual, or indeed any bitstream content (though textual and audio-visual content are probably the most common). The repository is essentially middleware managing contents that can be physically distributed.

Fedora is very flexible in its possibilities and not very prescriptive about the structural arrangement of objects. Fedora objects are typically a compound aggregation of one or more closely-related content items (datastreams). Datastreams can be of any format, and can be either stored locally within the repository, or stored externally and referenced by the digital object.

The Fedora model makes explicit the difference between a conceptual resource (“the object”) and its bitstream “representation“ via datastreams (these are not actually the same thing as “representations” in the web architecture sense however – For further details see Appendix B of the RIDIR report). An instance of a datastream can be thought of as a manifestation (serialisation) of some digital object or a manifestation of some metadata about the object.

For example, thumbnail and hi-res versions of an image would typically be arranged as datastreams of the same digital object. A metadata record in a certain format or structure (like XML) about the image would also be typically treated as a datastream attached to it.

Lastly, we should state that objects can be “typed” by specifying a “profile” object that defines the pattern of datastreams expected for objects expressing homogeneous content. These are known as Content Model objects.

Fedora’s graph of content includes relationships between nodes representing the conceptual digital objects themselves, between objects and datastreams, and triples expressing properties of both objects and datastreams (Content Model objects are not shown).

The Integration: Issues and Possibilities

Fedora’s capabilities through its Resource Index are limited (by design) to expressing relationships between Fedora artefacts – objects and datastreams. So things like surfacing rich metadata records into graphs that describe those objects in semantic terms are not addressed, and Fedora does not perform reasoning. But as mentioned before, use of RDF as the implementation has already got the community excited about the possibilities of integrating with semantic web technologies, and both providing and consuming linked open data.

What Apace Stanbol offers is complementary functionality to that of Fedora, whilst the RDF capability in Fedora offers a useful and fairly direct route to integrating information. So what we’re building is a “bridge” between Fedora and Apache Stanbol in a similar manner to the CMS Adapter; though with some differences as Fedora’s model is already structured in RDF. And Fedora is neither a JCR or CMIS repository.

Central to the design is that Fedora content (as ontology) is transformed and made available in the IKS Persistence Store. When content in Fedora is updated, the Persistence Store should be updated in line with these changes.

In terms of ensuring a sustainable integration approach, it is vital we adopted something that fits well with the way people are already integrating Fedora with other content services, which we’ve seen these days are increasingly being based around messaging architectures. Fedora provides a couple of notification methods that can be used to trigger events when content in Fedora is updated, and thus could be used for updating the IKS persistence store. The first requires writing a custom Java “decorator” module, adding this to Fedora’s libraries, and configuring Fedora to make use of it. The second is via Fedora’s existing JMS capabilities. The latter fits in well with patterns that are commonly used in Fedora enterprise integrations, but most importantly is the most common notification mechanism used for when new content is acquired by the repository. It also simplifies compatibility issues around Fedora deployment, which is usually deployed in a standard Tomcat servlet container, whereas Apache Stanbol is deployed as OSGi bundles – as the JMS message consumer is deployable within OSGi without placing any requirements on Fedora’s deployment.

However, Fedora’s current JMS messaging module produces messages that are tightly coupled to Fedora’s Java API methods, and in some cases are missing some information that could be useful or needed by message consumers, including for Apache Stanbol. This is a shortcoming that means message consumers must to some extent represent the Fedora digital object model in their own code.

For these reasons we have invested resources into developing an enhanced Fedora messaging module, and we anticipate that the same module will be re-usable in other, related scenarios such as indexing and workflow services. Reuse will be important in terms of ensuring longer-term sustainability for the integration, and there is good interest already from others in the community in adopting and supporting it. The message consumer component to work against Apache Stanbol is under active development.

As mentioned, KReS is being used during our development for alignment of VRA expressed in ontology, SKOS thesaurus concepts, and for the classification of digital content items according to the Fedora Resource Index RDF-S schema, with storage in the Ontology Network. We shall then look to apply these in the setting of a user’s browsing session for navigating content items.

We hope this overview gives a flavour of the work underway and the potential in store for Fedora and semantics!

Author: Wernher

Wernher Behrendt is senior researcher at Salzburg Research and the coordinator of the IKS project