Apache Stanbol in 2013

After graduating from the Apache Incubator in 2012 the Apache Stanbol project has established as a top-level project at the Apache Software Foundation (ASF). Time to have a glimpse at the plans for 2013.

The first thing to establish in 2013 is a regular release cycle for all Apache Stanbol components. This is important to make it more easy for the users to rely on stable artifacts without any need to check out the sources and compile the system on their own. The project should manage to cut releases at least every four months.

Closely related to releases is the still existing licensing problem of some natural language processing models. For example, Apache Stanbol uses some models of the Apache OpenNLP project and it is still unclear which license applies to those models. So making Stanbol more independent of compiled-in model files with problematic licenses is another task to do.

Besides these tasks there is room for improvements. The Entityhub will get updated to use a two layered storage architecture (STANBOL-704) similar to the one proposed for the Contenthub in STANBOL-471. With this new storage architecture Apache Stanbol will be able to store any content separated from different semantic indexes for access. This allows users to create new types of views on their content based on different semantic indexes.

Additionally, the new storage architecture for the Entityhub allows to improve the step of entity disambiguation. Rupert Westenthaler reported on the Stanbol developer mailing list about three improvement for the entity disambiguation engine:

  1. It will allow to to efficiently manage a “Shallow Knowledge Base (KB)” also for vocabularies that are managed in the Entityhub or by a managed site (see STANBOL-673), because batch processing with the Entityhub indexing tool will no longer be required to build a good “Shallow KB”.
  2. The separation of “entity store” and “entity index” (that is meant by “two layered”) will also allow to have several entity indexes for a single entity. This e.g. would also allow to build special indexes (e.g. a temporal and spacial index) that cover entities of several/all vocabularies. Those additional indexes could be than used to disambiguate along additional dimensions  what should improve the disambiguation results.
  3. Entity indexes could also allow to collect entity information from different sources. This would allow to combine information available for the entity in the vocabulary with additional information  e.g. mentions of the entity as collected by some feedback service or available via annotated documents in the Contenthub. This would allow disambiguation to work on “occurrence / mention-based” contexts.

Beside those improvements Apache Stanbol will have a close look at the technology that will be introduced by the new incubating project Apache Marmotta. With Apache Marmotta the Linked Data Server of the Linked Media Framework (LMF) becomes Apache software that can be used by Apache Stanbol. Additionally, the LMF already has an Apache Stanbol integration. The Linked Data Server could be an interesting component for the idea of a Factstore in Apache Stanbol that never reached a stable state. The Factstore in its current state will be removed from Apache Stanbol but with Marmotta there may be a new approach possible in 2013.

Other improvements to make Apache Stanbol more usable in industrial and productive contexts may also be coming in 2013. The Fusepool project announced that it will rely on Apache Stanbol and plans to contribute. We have already seen some discussions on the developer mailing list. The first contribution for more security aware coding in Apache Stanbol already hit the trunk. We will see what the future will bring. Exciting times for Stanbol!

Author: Wernher

Wernher Behrendt is senior researcher at Salzburg Research and the coordinator of the IKS project

Comments are closed.