Apache Stanbol – Free and Ready to use Semantic Engine for Content Management

The IKS vision is to bring semantic technologies as open source components to small and medium sized CMS providers. A major contribution of IKS to the CMS space is Apache Stanbol, the open source semantic enhancements engine that is being developed as immediately usable software for existing content management systems.

While traditional metadata services are usually covered by CMSes, Apache Stanbol provides semantic lifting of the textual content: the automatic detection of “Named Entities” such as persons, places and locations and their linking to external sources, e.g. to dbpedia descriptions of resources. The enhancement capability of Apache Stanbol is currently the most mature part of the engine, but the engine framework is not restricted to just this activity.

A RESTful API for content analysis and entity linking

We focus on linking named entities, because simple tagging with words is not enough to overcome ambiguity and complexity of meaning – tags don’t cut it. We acknowledge the huge amount of legacy data and unstructured texts around and therefore provide a mechanism for automatic detection of entities and links. Our system works as local running software, where the content does not need to be posted elsewhere but is kept in house.  This way, a content management system can make use of its own security and backup configuration. And last, but not least, its freely available under a permissive open source license.

Apache Stanbol uses a stateless interface to allow clients to submit content to the Enhancement Engines and get the resulting RDF enhancements at once without storing anything on the server-side. Its main mechanism has been developed especially to cater for the huge variety of running systems from simple web content management to all kinds of enterprise content management implemented in various programming languages and using various frameworks.

Getting Started – A simple example.

Just send the command below from your terminal or paste the simple sentence (or any other plain text) to the web interface of our stable demo installation.

curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
--data "Barack Obama is president of the United States of America." http://dev.iks-project.eu:8080/engines

You will get a response with two Text Enhancements – the entities “Barack Obama (Person)” and “United States of America (Location)” as well as several so called Entity Enhancements – links to the according resources in Wikipedia.

What happened? The natural language processing and recognition of named entities is provided by the first engine powered on Open NLP. Its stable version works with english texts only and detects the following entity types: Person, Organisation and Location. The next engine takes this as input and provides possible links to the resources at dbpedia.

How structured description and linking of content works.

The resulting RDF graph is the nugget you get for every text from Apache Stanbol. Here, the magic starts (with your help)!

  • You may simply use the links and show them beside your text,
  • you can use the entities to describe your entire repository and to search or browse it, if you store all enhancments,
  • you may also fetch additional information of these entities from their source, retrieving images of entities, further links etc. ,
  • you may use the entity information to annotate your html with structured information, e.g. rich snippets descriptions, for better findability.

For all these cases, early adopters from the IKS consortium and the CMS community have already implemented solutions for their content management system – see e.g. Nuxeo and several early adopters for Alfresco, Drupal, Plone and several other systems.

Try its basic features and create advanced demos.

Stanbol is an incubating Apache project, but its basic features are already stable enough to be used and showcased to customers of a CMS. In order to use it, set up your own installation from the source code. Follow the instructions of our technical documentation and connect the services to your CMS.

What do you get to play with? If you just want to use the basic features, then use the stable launcher to start your engine. Then you will get just the two most important engines – one for detecting Named Entities, one for linking to dbpedia together with a prepared index of 43k entities, so that you are not dependent on the availability of the dbpedia services. You can find a simple demo system here.

If you want to try the full power including all available engines, you need to use the full launcher. An advanced demo shows its components. You will  get pre-processing engines for detecting the language  of your content and the ability to work with several document formats (doc, pdf …). In the current configuration, it detects named entities from dbpedia and concepts from gemet, an environmental thesaurus.

You may use the engines endpoint for simply submitting text or you may want to use the enhancerVIE for getting inline suggestions and RDFa annotation of the html content. While the “engines” and the “entityhub” are  stable components, several experimental components are available, too. The “contenthub” is an experimental version of a stateful engine, which stores content and annotations. The “factstore” stores not only the entities but also the relations between them. The “ontonet” is dedicated to ontology management  and the “rules” component is being used to refactor results from the engines in order to create special output formats (e.g.  google rich snippets). The sparql endpoint allows queries to the embedded triple store.

For detecting named entities you may use the NER engine together with one or more Entity Linking Engines in order to link not only to dbpedia, but other public resources such as geonames. You can configure every linking engine to point to a different site as target. This way you can make use of any publicly available corpus of entities defined in RDF.

Not just dbpedia, but also local vocabularies are supported.

A recent development – the Taxonomy Engine – makes it possible to create your own index on the basis of arbitrary RDF data available in your enterprise – ranging from a product catalogue to your CRM data. With such private indexes (“entity hubs”) installed, you can analyse documents and find immediately the links to the according entities. It takes a few hours for a developer, to create such indexes to be used by Apache Stanbol. With these links in place, you can then create advanced unified search solutions over disparate document spaces.

Categorization of documents

What about categorization of documents? This is a planned feature. The existing engines need to be reimplemented to use the entity hub index and to build predefined topic indexes out of the dbpedia skos hierarchy and the fulltext of the related articles (to be able to perform similarity queries using the MoreLikeThis feature of Solr). We also need to extend the Apache Stanbol vocabulary to handle topics that are not entities.

Depictions for entities

If one wants to display not only labels and links of an entity then it is possible to fetch depictions of these if available. This feature is implemented in the Apache Stanbol web interface – you then get according images of persons, logos of companies or flags of countries.  This picture-retrieving facility can also be used for other RDF data.

Rules and refactoring results

One early adopter has been using the rules engine to refactor Apache Stanbol enhancements according to the target structure of rich snippets annotation designed for search engine optimization.

Become an early adopters and/or participate in the UX challenge.

For European SME CMS providers or integrators we provide the opportunity to evaluate IKS technology within a contract of approx. €6.000.- . You’ll test and integrate our technology within your system and you’ll get insight into development and contact with the developers, the chance to raise feature issues and the opportunity to create a compelling demo for your customers. Checkout the guidelines for early adopters. If you are interested especially in compelling user interfaces and interaction, you may join the IKS UI/UX contest. Deadline is October 14th, 2011.

Author: Wernher

Wernher Behrendt is senior researcher at Salzburg Research and the coordinator of the IKS project

Comments are closed.