Lessons Learnt while working with Apache Stanbol

In this post I will summarize the experiences I earned while developing various Enhancement Engines for Stanbol in the framework of an EAP project. If you are following this blog, you could read a lot about the project in posts [1],[2],[3],[4],[5] so I’m not going to be redundant here. Instead, I share my impressions about Stanbol in general. This post is a slightly extended version of the Lessons Learned section of the EAP validation form [6].

Encountering Stanbol

Overall, Stanbol is a very advanced and fairly large system. Stanbol relies on a deep stack of libraries and frameworks. The development process of Stanbol uses state-of art technology.

Therefore, developing modules for Stanbol requires the simultaneous understanding of many technologies, like maven, OSGi and Felix and the corresponding maven plugins, RDF formats and technologies, etc. Getting the first Enhancement Engine running and its results into the output is not easy. The fact that Stanbol is open source helps a lot, as one can investigate how others have written their EEs. I think that right now only people trained in software engineering or other IT fields can be realistically expected to produce enhancement engines.

But there is another kind of users that are potential authors of EEs: linguists, mathematicians, chemists, economists, physicists, etc. In general, experts of different domains. Today it is not hard to find hacker types in these communities who can turn their ideas to code, but when it comes to industrial software engineering tools they get stuck very easily. The same often applies to the juniors from IT education as well. To reach the mass of these potential contributors, I think template codes, Eclipse and Netbeans example projects, or even complete pre-set Virtual Machines (a’la the DBpedia appliance) should be provided in which they only have to find the “//your code comes here” part.

Integration with other tools

Apart from UIMA which I have worked with, there are other established communities in language processing. One is around the python-based NLTK another is around the NooJ tool that is written in .net. They are locked in by their technology in many respects when it comes to having nice web front ends, which is more and more important. These communities could really benefit from Stanbol’s capabilities: e.g. its nice web presentation and from its connections to many CMS-s. In my opinion Stanbol should have a standardized REST interface not only at the “top” (for the CMS-s) but also at the “bottom” (towards non-java Enhancement Engines), similar to the Remote UIMA tool provided by my EAP project. Naturally, in this latter case, the Stanbol would be the HTTP client. The data format could be some serialization of RDF in this case, similar to the output of Stanbol.

Right now, when integrating a system with Stanbol the code for adapting the other system must be inserted in a Stanbol EE. Zemanta and OpenCalais Engines are good examples: both systems had their own interfaces that were fixed prior the Stanbol adaptation. These are used in an EE, and in theory no changes were needed on the other side – so far, so good. But what if I have a system written in language X that has no interface for collaboration with other systems yet? Or if I just want to develop a brand new system? Only if I had a RESTful specification in my hands that my system has to implement in order to be used by Stanbol! Than I could work in my language of choice and on the code set I’m already familiar with – no code changes or development would be necessary on Stanbol‘s side, only some configuration.

OSGi, Felix and Stanbol

After the issues of getting started, working with Stanbol is a really delighting experience. Its modular design facilitated by OSGi and Felix makes the development of custom Enhancement Engines really easy. The web interface of Felix is a great help, and the possibility to run-time install-remove bundles helps a lot in iterative development. Also, the granularity of the modularization is optimal in Stanbol: in return for a small amount of work spent on a custom module, one gets a great functionality from the system.

For the same reason, Stanbol would be an ideal teaching tool for software engineering students. Once the students can get started, they can write EE-s that work immediately; also they can exercise collaboration. I can see myself making students working together while creating an enhancement chain, everyone contributing their own bundle. It is also ideal for getting experience with the aforementioned software engineering tools.

Inter-EE communication

One small obstacle is that it is not easy to find information on what exactly certain EE-s produce. This is important if one wants to develop an EE that relies on the output of another EE. In UIMA, the Type System XML defines rather precisely what comes out of an AE. A similar standardized descriptor might be useful for Stanbol EEs as well. Also, as the number of EEs grow, a repository might be useful – similar to the UIMA Component Repository that is out of order nowadays, but it contains many dozens of UIMA modules.

Stanbol on a server

From a production point of view, the stability and performance of Stanbol seems to be good. Also, Felix and its run-time configuration capabilities are a great help to provide a good uptime. However, at some point systems integration tools, like System V init scripts, munin, nagios, icinga, etc. plugins will be necessary for managing and monitoring. Also, at the definition of enhancement chains, configurable timeouts for Enhancers would be useful to ensure graceful degradation when a sub-system fails.

Feedback

For training the Enhancement Engines, a common feedback REST interface could be helpful in which the CMS could notify Stanbol about which Enhancement was accepted and which rejected. Per-user, per-project or per-company learning would be a huge added value to any CMS. Olivier Grisel’s work on topic classification [7] shows how useful it is to have ways for training and I think there should be a callback function in every EE through which it could be notified about what happened the Enhancements it produced.

Final thoughts

Finally it appears to be a strength of the system that most of the state-of-the art technologies from DBpedia to Zemanta or OpenCalais are integrated (this also highlights the significance of the Early Adopter Programme). I expect this to be a strong argument in favor of Stanbol while making technology decisions in many projects in the future.

References

[1] Running UIMA Engines in Stanbol using the same JVM: https://blog.iks-project.eu/running-uima-engines-in-stanbol-using-the-same-jvm/

[2] Introducing BookSpotter Enhancement Engine by Sztaki: https://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki

[3] Creating Enhancement Engines for Stanbol 0-10-0 incubating using netbeans 7-1-2 https://blog.iks-project.eu/creating-enhancement-engines-for-stanbol-0-10-0-incubating-using-netbeans-7-1-2/

[4] UIMA-Apache Stanbol Integration (REST) https://blog.iks-project.eu/uima-apache-stanbol-integration-2/

[5] UIMA-Apache Stanbol Integration (Project Introduction) https://blog.iks-project.eu/uima-apache-stanbol-integration/

[6] UIMA Proposal Validation http://wiki.iks-project.eu/index.php/UIMA_Proposal/validation

[7] Topic Classification http://dl.dropbox.com/u/5743203/IKS/ReviewMeeting2012/Topic-Classification.pdf

Comments are closed.