UIMA-Apache Stanbol Integration

The Unstructured Information Management Architecture (UIMA) is a content analytics framework that was initiated by IBM Research. 7-8 years ago they envisaged a system that would analyze every kind of unstructured information – text, rich text, pictures, video, audio, etc. – and enrich it with annotations.

"Objects represented in the Common Analysis Structure (CAS)" - Picture from the UIMA Overview & SDK Setup

“Objects represented in the Common Analysis Structure (CAS)” – Picture from the UIMA Overview & SDK Setup

They designed a system in which there are annotation engines with a standard interface and descriptors, for easy interchange between users. The analysis engines can be organized in chains, even for distributed processing and are packaged in pear files for interchange (later they even experimented with OSGi).

Great idea, isn’t it?

In 2005 the US Government gave substantial support for the project through its Defense Advanced Research Projects Agency (DARPA). This lead to the formation of UIMA Working Group. Under the umbrella of this working group a couple of related technologies were adapted to UIMA (such as OpenNLP and GATE), and numerous new components were created in widely ranging topics from standard natural language processing tasks to bond yield recognition, chemical element annotation, protein and gene annotation, etc. UIMA became an Apache Incubator project in 2006, and as the framework matured an UIMA OASIS standard was created in 2009, the first of this kind.

Since then substantial work has been done on performance improvements and applications. Asynchronous Scaleout (AS) provides a way for distributed content analysis, so does Behemoth[9], that enables the execution of UIMA tasks on Hadoop clusters. Deep QA applies UIMA AS to question answering and was recently applied in the famous Watson project to play the Jeopardy! game.

Carol Kaelson/Jeopardy Productions Inc., via Associated Press

Watson in Jeopardy! Image credit: Carol Kaelson/Jeopardy Productions Inc., via Associated Press

During this summer I will work under the umbrella of IKS Early Adopter Program on the integration of UIMA with Apache Stanbol.

UIMA certainly has a heavy industry feeling about it.  Every time I run UIMA I feel like driving a freight train or commanding a shipping boat – not to mention the Scaleout mode, which is rather like launching a space rocket. This is because the high number of components and concepts, the complicated eclipse-based tooling and the serious learning curve.  Also, the software technology is from a previous generation, where you had work hard to do things that are today much easier. I’m referring here mainly to the java source code generation from XML business during packaging. UIMA came before Generics was introduced in java (moreover, there is a C++ UIMA implementation as well), in the age of Corba, RMI-IIOP, early SOAP technology that generated code from WSDL, etc.

It is very easy to run away from UIMA after reading the first pages of its tutorial, but still,  there are some very good annotators implemented in it.

So the plan is to make already written UIMA components exploitable in Stanbol. Let us imagine a Stanbol administrator/CMS developer, who wants to use a UIMA component downloaded from the web or developed in-house.  She wants to get annotations from the UIMA component through Stanbol with as minimal development effort as possible. This means avoiding the modification and compilation of either the UIMA component or Stanbol if it is not necessary. Therefore, the user configures a mapping from the UIMA TypeSystemDescription to RDF triples, and also configures a communication interface between UIMA and Stanbol. As soon as the communication interface and the type mapping is configured, and the Stanbol adapter is started, the UIMA component is used to analyze incoming content.

Now, there are basically two ways of doing this. Using one common Java Virtual Machine or using two separate ones. The advantage of using a single JVM is that one could just deploy an Enhancement Engine as an OSGi bundle, configure the pear and the mapping, and then have fun.

However, separate deployments might be necessary because of class loader tweaks that are made by UIMA, and other issues might come up with older java binaries.  Also, separate deployments could even be desirable if someone wants to use UIMA’s distributed deployment capabilities or already has a working system.

In the coming weeks I will work on both embedded and REST-based UIMA-Stanbol integration, and will share with you the results. (For more tech details, see preliminary discussions on Stanbol-dev mailing list)

Comments are closed.