IKS Blog – The Semantic CMS Community

Apache Stanbol now with multi-language support

Most European small to medium sized companies manage digital content in multiple languages. So a semantic engine, that can understand multiple languages would be an asset. Apache Stanbol (incubating) has made a start. It provides multilingual features for some European languages. These linguistic capabilities are dependent on the capabilities of OpenNLP (Apache Incubator), especially the availability of part-of-speech taggers. The following languages are supported by the Keyword Linking Engine of Apache Stanbol: English, German, Danish, Swedish, Dutch, and Portuguese. For these languages models of OpenNLP can be built and added as bundles to Apache Stanbol.

Unfortunately, at the moment there is no specific linguistic support for other languages. The linguistic support is necessary for better recall.  For quality recall the choice of the target source is crucial, e.g. if you work with a (big) dbpedia you’ll get the best recall for the major languages and probably less results for smaller languages. (Please be aware, that the Apache Stanbol default installation works with a small index of only 43k entities and therefore might not provide enough results for you.)

Five steps to set up your own multilingual Apache Stanbol

  1. Run an Apache Stanbol instance on your server
  2. Add language models to your Apache Stanbol instance
  3. Activate language identification and the linking engine
  4. Configure the KeywordLinkingEngine
  5. (Optional: Use your own vocabulary as index)
  6. Run it.

Run Apache Stanbol

Get Apache Stanbol from source and  install  it on your server. You may also download a pre-built launcher (the “full launcher”. Start it with the full launcher anyway. Or, test the multilinguality features at our demo server (http://dev.iks-project.eu:8081/engines).

Build and add the necessary language bundles

To build the language bundles go to {stanbol-root}/data/ and call mvn clean install -P opennlp. This enables the profile to build the OpenNLP models for all languages. After this the bundles are available in the folder {stanbol-root}/data/opennlp/lang/{language}/target. The name of the bundles are org.apache.stanbol.data.opennlp.lang.{language}-*.jar. Add them via the bundles tab of your OSGI admin console to Apache Stanbol. The language bundles will fetch and install the relevant OpenNLP models for the languages you have selected.

Activate the Language identification engine and the KeywordLinkingEngine

Go to the admin console and deactivate some of the available engines. You should deactivate the standard NER engine and the Entity Linking Engines, as they do not support multiple languages at the moment. At least two engines need to be activated:

  • The Language Identification Engine provides you with the language of the text you want to enhance, it creates a dc:terms language property it can detect the following languages:
  • The Keyword Linking Engine analyses the text and provides links to various data sources (enhancements) for multiple languages.

Configure the Keyword Linking Engine

The first option you have is to declare the target site you want to use. If you want several targets, just run the same engine in parallel. The label field allows you to specify which properties to use for matching with your text. The redirect options are important especially for datasets, where e.g. the main entities are redirects from acronyms. With the suggestions you specify how many suggestions you’ll get. Specify the languages you want to use  or leave it blank to work with any language. The option to choose a default matching engine helps in cases, where there exists no appropriate language label for entity matching.

Configuration settings of the Keyword Linking Engine

Use your own (custom) index

In the event you want to use a custom index, ensure that this index contains language labels in all languages you want to work with and that they are properly indexed. Create the index and add the index to your Apache Stanbol instance. To do this, simply paste the {yourindex}.solr.zip into your the {stanbol-root}/sling/datafiles directory and install the respective OSGI bundle at your admin console. Finally, add a second, new instance of the Keyword Linking Engine to Apache Stanbol and configure it to work with the custom index (see configuration options above).

Run it

With Apache Stanbol, you can choose between directly accessing the engines via its RESTful services, use the engines web interface or the in-text annotation support via the enhancer-vie.

A simple test

For my simple example, I took a press release about the 10th European Day of Language, which is available in several translations and  pasted the lead paragraph into the stateless enhancers engine – in the following languages: English (en), German(de), Dutch(nl), Danish(da), Portuguese(pt) and Swedish(sv).

Brussels, 23 September 2011 – ‘You live a new life for every new language you speak; if you know only one language, you live only once.’ This Czech proverb is one of the slogans for the 10th European Day of Languages, which will be marked on and around 26 September with events including conferences, quizzes, poetry readings and street games (full list here). The aim is to promote language learning and celebrate Europe’s linguistic diversity, from the 23 ‘official’ languages of the EU to its wealth of co-official, regional and minority languages and dialects. Androulla Vassiliou, European Commissioner for Education, Culture, Multilingualism and Youth, will sign a joint declaration with Thorbjørn Jagland, Secretary General of the Council of Europe, to re-affirm their commitment to multilingualism. The EU’s Polish Presidency has put language learning high on its agenda and is urging young people to learn two languages in addition to their mother-tongue to further their personal and professional goals.

Expected entities

For Apache Stanbol and its capability to detect Named Entities (persons, places, organisations) as well as common terms (keywords, concepts) one might expect to get the following results from dbpedia:

  1. Brussels (place)
  2. European Day of Languages (event)
  3. Androulla Vassiliou (person)
  4. Thorbjørn Jagland (person)
  5. European Commission(er) for Education, Culture, Multilingualism and Youth (organisation / role)
  6. (Secretary General of) Council of Europe (organisation / role)
  7. EU – European Union / Europe (place / organisation)
  8. several concepts, such as: language, language learning, mother-tongue, multilingualism etc.

I do not expect to get relations such as <person> <role> <organisation> or proper time ranges such as “26. September 2011”, although such enhancements would definitely be useful to describe the entire text as proper real world situation as described in the press release.

Enhancement results for various supported languages

Portuguese text example running in Stanbol


If you run Apache Stanbol with the language identification engine, the keyword linking engine together with the appropriate language models and an dbpedia index of approx. 4 mio entities you will get positive results for the requested persons and places in all of the supported languages, the organisations are not so well supported. In detail – you’ll get 5 out of 7 most wanted entities for English and Portuguese, 4 of 7 for German and Dutch and only 3 of 7 for Danish and Swedish.  For every language, you will get 10-20 additional terms describing the main topics of this press releases such as “multilinguality” etc.

In comparison with other named entity recognition engines …

Apache Stanbol provides at least for some European Languages basic entity recognition and keyword extraction out of the box. Nonetheless, there are still a lot improvements required w.r.t. the recall and precision of the most wanted entities such as persons, locations and organisations as well as further entity types e.g. time, and relations which are necessary in order to infer “situations” from the detected entities.

A simple test, using the above text with comparable public services, shows:

  • DBpediaSpotlight provides 6 out of seven main entities and several other terms in English only, internalisation is planned.
  • The RDFaEditor uses combined information from several sources (Open Calais, Alchemy, Evri, Ontos, Extractiv, DBpedia) detects 4 out of seven main entities in the English version, no other terms are detected. It has no support for other languages.
  • Zemanta, a commercial service, finds all 7 entities in English only. It has no support for other languages.

It would be great, if some independent testing and benchmarking could show the real strengths and limits of entity extraction services and software for Content management systems. Metrics could include license issues and technological fit, language capabilities and extraction range along with the classical recall and precision metrics, probably refined with human expectations tested across domains.

For Apache Stanbol, despite the weaknesses mentioned above, one major improvement would be adding support for the major European languages such as French, Spanish, Italian and Polish and some world languages such as Arabic, Hindi or Mandarin. Contributions to Apache Stanbol and/or Apache OpenNLP to address these limits are very welcome!


Comments are closed.