This is the documentation page of the UIMA-Stanbol integration project I have introduced in a blog post on IKS blog: http://blog.iks-project.eu/uima-apache-stanbol-integration/
For details read the blog post; the bottom line is that the UIMA analysis engine, that is a generation older than the Stanbol engine is well worth integrating to Stanbol. By the integration we can enable many publicly available UIMA analysis engines for every Stanbol user, and at the same time we can provide a path for non-public UIMA deployments e.g. at companies to start using Stanbol on the top of their infrastructure.
The use case/user story
The main use case of this project is when a Stanbol admin wants to use a given UIMA Analysis Engine with the least possible modifications in either systems. In an example user story, the user finds on the internet a PEAR file (Standard UIMA package format) that provides part-of-speech tagging based on a Hidden Markov Model . She is interested in the noun phrases for supporting entity recognition. She wants to access those noun phrases as RDF triples. Moreover,she is interested in the verb tenses to guess whether this document is a news article (dominated by past tense) or a howto (very few past tense and lots of imperatives).
On the UIMA architecture
In this section I will discuss only one characteristic of the UIMA framework that is central from the UIMA-Stanbol project’s point of view.
UIMA started as a system that is usable in both C++ and Java (though C++ seems to be less used nowadays), and the first versions were developed before Generics was introduced in java.
Therefore, the flexibility to provide custom types of annotations was achieved by a java (or C++) code generating solution.
This is how a simple type definition looks like:
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier"> <name>TutorialTypeSystem</name> <description>Type System Definition for the tutorial examples - as of Exercise 1</description> <vendor>Apache Software Foundation</vendor> <version>1.0</version> <types> <typeDescription> <name>org.apache.uima.tutorial.RoomNumber</name> <description></description> <supertypeName>uima.tcas.Annotation</supertypeName> <features> <featureDescription> <name>building</name> <description>Building containing this room</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription> </features> </typeDescription> </types> </typeSystemDescription>
From this code the RoomNumber and the RoomNumber_Type java classes will be generated. Inside the PEAR files you find these classes. When you run a PEAR file from e.g. the UIMA SimpleServlet, a PEAR loader kicks in, that overrides the class loading mechanism in order to load classes from the PEAR, which at the same time interferes with OSGI.
The class loading problem in general is nicely solved by OSGI, but unfortunately the effort at IBM AlphaWorks to make UIMA OSGI compatible seems to be discontinued. The current support only extends to OSGI-Buddies that are usable in Eclipse but not in Apache Felix. (see: http://uima.apache.org/staging/osgi.html)
There are still ways to make a particular UIMA component work in Stanbol, but that involves the Compilation of the UIMA project (requires the vast UIMA SDK), and also the compilation of a Bundle that provides the classes necessary). This requires knowledge in both frameworks, which makes this solution fall short of the original use case.
Therefore, I will concentrate first on a solution that involves two java virtual machines that are communicating over HTTP: one is the standard SimpleServlet that is part of UIMA architecture, and is able to load any PEAR files of a close-enough uima version without compilation. The other part is a Bundle for Stanbol, that also needs no compilation, only configuration and deploy. The name of this Bundle is UIMA Remote Client. Besides being easy to getting started with, this approach has an other big advantage: the UIMA can run on a different machine, that might be an endpoint of a larger cluster of annotators, e.g. an Asynchronous Scaleout deployment.
UIMA Remote Client
UIMA Remote Client is a Stanbol Enhancement Engine that accesses a UIMA SimpleServlet over HTTP. The SimpleServlet is able to load any UIMA pear files. The Remote Client depends on a library called CasLight – originally developed for the Sztakipedia project -, that is a lightweight implementation for handling a core subset of UIMA FeatureStructures (like Annotations), using java generics, without the need for UIMA Libraries. This is also provided as an OSGI bundle in as a part of this project.
UIMA Remote Client does not add any RDF triples to the content. Instead, it creates a new, dedicated ContentPart, and puts every annotation that comes from the UIMA Engines there. This can be later processed by other Enhancement Engines. An other Enhancement Engine called UIMA To Triples is provided by this project that can convert UIMA annotations to RDF triples following certain pre-configured Rules. However, I expect that in most of the cases UIMA will have a bigger role as a pre-processor for other Enhancement Engines.
The configuration of UIMA Remote Client looks like this (click to enlarge):
The stanbol.enhancer.engine.name.name is the name of this Stanbol Enhancer Engine and it is recommended to left unchanged. The things to confugure are the following:
- UIMA source name + endpoint: In this field you can define where the endpoint of your UIMA analysis engine is. A name for this source must also be provided, for referring to it. The syntax of this field looks like this: sourcename;uri, e.g.
You can define multiple sources that will be called sequentially.
- Content Type URI reference: This will be the Uri Reference of the ContentPart that is created for UIMA Annotations. In this content part a single hu.sztaki.caslight.FeatureSetListHolder will be placed, from which you can access the annotations of every source. Normally there is no reason to change this setting.
- Supported MIME types: Here you can define which Mime Types should be accepted for processing by the Remote Client. This depends on the capabilities of your UIMA Engines. Most uima engines work only on plain text though.
Once the UIMA Remote Client is configured properly and is added to an Enhancement Chain, it will call the endpoints and store all the results in a FeatureSetListHolder object in the Content Part you can access by the UriRef configured above. You can use it in your own Enhancement Engine by retrieving the FeatureSetListHolder, (see link for javadoc below), or turn them to annotations using UIMA to Triples.
Getting Started with PEAR files
If you have a pear file that contains a UIMA Analysis Engine or Aggregate Analysis engine, all you have to do is to package it together with a simpleservlet and start it with Jetty or Tomcat or other web servlet container.
- If you have Eclipse, the best way is to follow this guide http://uima.apache.org/downloads/sandbox/simpleServerUserGuide/simpleServerUserGuide.html
- If you don’t want to use eclipse you can un-pack any wars provided by this project (see below), and replace the pear. If you use a different name for the pear file you should adjust the web.xml file
The UIMA To Triples Enhancement Engine
The UIMA to triples Enhancement Engine provides a way to turn UIMA Annotations into RDF triples, that will be appended to the result of the content processing. UIMA To Triples can be configured to filter out certain annotations by feature values and also to translate annotation type names and feature names to custom strings to provide the RDF output.
Moreover, UIMA to triples converts the UIMA coveredText to Stanbol’s selectedText (these are the same really) and also uima begin and end tags to stanbol’s start and end tags.
UIMA To Triples can be useful in cases when there is a starightforward plan on how UIMA Annotations should be directly turned into RDF triples.
The configurable properties are the following:
- stanbol.enhancer.engine.name.name This is the name of the enhancer engine, as stanbol refers to it. Normally there is no reason to change this settig
- UIMA Source Names the name of the UIMA sources to process, as configured in the UIMA Remote Client
- Content Part URI Reference the name of Uri Ref of the Content Part that holds the UIMA Annotations. Should be the same as in the UIMA Remote Client. Normally you can leave this on the default value
- UIMA Annotations to Processwith this list you can filter which Annotations and which features of the annotations you want to convert to RDF. The format is: TypeName;featureName1=featureRegex1;featureName2=featureRegex2… The filters work like this
- One line in this property list allows one annotation type to be transformed.
- If there are no entries in this list, nothing will be transformed
- If you list only an annotation type name, eg.:TokenAnnotation then all the TokenAnnotations will be printed.
- By appending features to the annotation type, eg.: TokenAnnotation;posTag you specify that only those TokenAnnotations should be printed that have a posTag feature.
- You can further filter the results by providing value patterns (in java regex syntax) to the features. E.g. TokenAnnotation;posTag=np will allow only those TokenAnnotations that have a posTag feature which value contains np (noun phrase by the way). Multiple features are in an AND relation, so TokenAnnotation;posTag=np;lemma=A.* will only allow those annotations that are np-s and their lemma starts with an A
- The lines in this property list are in an OR relation. So if you have filtered out a certain TokenAnnotation in the first line you can add an other one that allows it. E.g
will give you the NP-s and those TokenAnnotations which lemma is not
- UIMA type/feature name to RDF name mappings: Syntax: oldName;newname . You can provide here a mapping according to which names should be translated to RDF. E.g. you might want to an UIMA posTag feature to appear as sso:posTag. You can give mappings for type names as well as for feature names here.
You can try out the stuff here:
This instance is configured as follows:
- There are three UIMA engines set up on the server: http://pedia2.sztaki.hu:8080/engpostagger/, http://pedia2.sztaki.hu:8080/snowball/, http://pedia2.sztaki.hu:8080/languagerec/ (Feel Free to try the out manually!)
- The UIMA remote client sends the plain text content to the engpostagger and snowball
- The UIMA To Triples filters the results and only processes those which posTag matches n.* (noun types), or those which lemma is ‘not’.
- TokenAnnotations are translated to Word type TextAnnotations in the output, posTags are translated to sso:posTags
You can find the deliverables here:
The location contains:
- Source code
- UIMA Resources: pear files, war files
In August I will make further experimetns with the UIMA Local client solution that might enable embedding UIMA engines in Stanbol without complilation. Also I will provide a book recommender for stanbol that is based on data from British National Bibliography and the Open Library.