Using custom vocabularies with Apache Stanbol

There are at least two major reasons, why you may want to shy away from using many of the popular services for text analysis and annotation in an enterprise setting. Firstly, their nature as a service which forces you to send content to a third party, and secondly, their restricted use of other target datasets to suggest links to named entities available in public glossaries such as wikipedia. In contrast Apache Stanbol (incubating) provides you with the freedom to work within your own IT environment and your own business terminology as you see fit.

Apache Stanbol now enables you to upload your own custom vocabulary to annotate unstructured text with related web documents indexed to that vocabulary. This particular enhancement engine is called “Keyword Linking Engine”. The engine and together with the “Entity Hub” for managing local terminologies has been designed and written by Rupert Westenthaler. The enhanced content along with the entities can then be used in more advanced semantic search applications. In this blog I will show the use of the enhancement capabilities of Stanbol together with an inline annotation widget to enrich unstructured texts with images.

Enrich texts with images

In my example, I use metadata from the image archive of the Austrian National Library which has been made available as Europeana Linked Open Data from CKAN (Euopeana produces several datasets from various other European archives and museums). The specifity of this dataset is, that the main entities are images and photographs from Austrian History, and not just descriptions of entities (persons, places, organisations, concepts) such as in wikipedia or other open data sources.

(1) Create a SOLr index out of your custom vocabulary

To start, you need to make use of the indexing capabilities of the Stanbol Entity Hub – the component for caching indexes of linked data to be used as targets for the enhancement process.  See  the Readme, which describes the entire process in detail. For our example the most important steps:

  1. First you need to built the indexing tool itself by building it in the directory genericrdf with $ mvn assembly:single
  2. Then you copy the outcoming org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar from the target directory into a working directory and initalize the indexing process with $ java -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar init
  3. The indexing tool provides you with configuration options you may want to use in oder to get a proper index of your RDF input.
    1. Do your adjustments to the mappings configuration. This file defines, which properties will be indexed. In our case, as the main namespaces such as Dublin Core, FOAF etc. are already present, you just need to add a few lines to the file mappings.txt:
      # - – Europeana / Austrian National Library**
    2. Provide a name, some description on the source and licence information to the as well as choose the indexing strategy. For most of the cases, you may just use the default values.
  4. Put the RDF source files into the indexing/resources folder and call $ java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar index
As a result of the indexing machinery, you will get an archive of the index and an OSGI bundle ready to add to Apache Stanbol.

(2) Configure the keyword linking engine to work with your vocabulary

  1. Move the ZIP archive of the index into your {stanbol-root}/sling/datafiles directory.
  2. Install and start the bundle creatied by the indexing at the OSGI console.
  3. Deactivate all other EnhancentEngines and configure the KeywordLinkingEngine to use the index by specifying the referenced site.

Configuring the Keyword Linking Engine via the Apache Felix Console

The user interface for configuring  the Stanbol Keyword Linking Engine allows you  e.g. to choose the target vovabulary, to choose the number of suggestions and also restrict the engine to specific languages.

(3) Get (semi-)automatic depiction for articles of your domain

Paste an example text from wikipedia about the Austrian Civic War in the 1930s (because the domain of the image library is in this time period and region) to the system. Use the IKS annotate widget together with Apache Stanbol to get entity annotation suggestions for some occurences within your text. With the annotate widget, designed and written by Szaby Grünwald, you can select, accept or decline  annotations. By accepting them, the entity link is stored in HTML/RDFa in a human and machine readable format.

For all selected and accepted links, a slightly modified html view for this showcase of annotate.js retrieves and displays the relevant images from the image repository. In this example case, we retrieved the images directly from the europeana library.

Depiction of unstructured text with images form Europeana.

What could be done better?

What I’ve shown in this example is the ability of Apache Stanbol to easily handle local vocabularies and to use them in the enhancement process. The frontend widgets retrieve such information and support (semi-) automatic annotation of unstructrured texts. In the example its about depiction of historical situations, but the system is not restricted to this example. One could also imagine using a very specific product catalogue and using the engine for creating a faceted semantic search over a repository of documents about such products or use the same engine to classify incoming mails according to some enterprise specific keywords.

Still, there are some features missing, which would be needed to support more real world implementations, such as

  • the multilingual support for both, the analysis engines as well as the frontend interaction widget needs to be improved,
  • a better human and visual disambiguation support through the preview of entities,
  • the possibility to switch to an automatic annotation mode with a very high recall rate,
  • a broader connection from the frontend to the datastore in order to easily change views according to client’s needs.

Try it and provide us with feedback!

The Apache Stanbol engine with a default configurationcan be tested at our demo installation. If you want to work more intensive with it, the use a software snapshot of the launcher and run the Stanbol locally. In order to work with the software, please go directly to Apache Stanbol (incubating) project and consult the documentation there. The annotate.js can be tested here – the source code is available here. What you also need to experiment further is an RDF dump of your target vocabulary – the terminology you want to link your text to. If you don’t have such data available for your specific case, you may look for an editor to create a thesaurus of your terms, e.g. the PoolParty Thesaurus Manager (commercial) or  TemaTres (open source).

Author: Wernher

Wernher Behrendt is senior researcher at Salzburg Research and the coordinator of the IKS project