DBpedia Spotlight – Integration in Apache Stanbol

With the variety of content management systems (CMS) out there, it has become very easy to publish content on the web, even manage a community website, with no programming or web design skills required. Thus the volume of user or publisher generated content is rapidly increasing, and the effective management of this data, and it’s usability has become a challenge. Apache Stanbol is an open source technology stack, which was designed to aid this process, without requiring any structural changes in the CMS. Semantic enhancement of unstructured text (among many other useful features) can be used over RESTful endpoints, making it very easy to integrate into existing workflows. The modular design, easy administration, extendibility and vibrant community make it a very powerful framework. The automatic processing, analysis and enrichment of text is not an easy task , thus appropriate tools are needed, which can be easily integrated in Apache Stanbol.

DBpedia Spotlight – Introduction
DBpedia Spotlight is an open source software designed to step up to this task. It automatically annotates mentions of DBpedia resources in text, and goes through the whole analysis life cycle – from entity detection (spotting) to candidate selection (possible DBpedia resources the mentions might refer to) and last but not least disambiguation, in case there are multiple candidates for a single entity. Pablo Mendes (co-founder of DBpedia Spotlight) and myself , Iavor Jelev ( CTO at babelmonkeys / GzEvD) were very happy to integrate the functionality of DBpedia Spotlight in Apache Stanbol as part of the early adopters programme. In this blog post we want to give you an overview of the integration, details on the new EnhancementEngines and EnhancementChains, as well as tips on how to use them. If you are not familiar with Apache Stanbol or the concept of an EnhancementEngine, please refer to the great post of Anuj Kumar which covers this subjects in detail. Thanks to Anuj for this, I wished his post were available when we were starting out with developing our EnhancementEngines. We actually intended to do a similar introduction to the development process of an engine, but Anuj has done a great job and we will build on his post. If you are already familiar with the concepts EnhancementEngine and EnhancementChain, you should be able to easily follow this report.

DBpedia Spotlight – Changes to the REST endpoints
As we mentioned in the paragraph introducing DBpedia Spotlight, it does the entire annotation process (spotting, candidates selection and disambiguation). The available RESTful endpoints on our side were three at that time:

  • annotate – spots the potential mentions, retrieves the candidate DBpedia resources, disambiguates them if needed, and links the mentions to the best one
  • candidates – same as annotate, but does not disambiguate the candidates for each mention. Rather it returns the top K ones.
  • disambiguate (soon to be deprecated) – does not do spotting, it just selects the candidates for the given mentions and does disambiguation.

As you probably notice, the endpoints on the DBpedia Spotlight side act as EnhancementChains, because multiple steps are performed in sequence with one single request (for more detailed information please refer to the documentation [1] and the user manual [2]). We wanted to do our integration “the Stanbol way”, so the process included adjustments and new implementations on our end as well, in order to separate the processing steps. This would not only allow the implementation of the different stages in separate EnhancementEngines, which could then be executed as a DBpedia Spotlight EnhancementChain, but of course substituting some of the steps for existing Stanbol EnhancementEngines as well. This way you can use only parts of the DBpedia Spotlight functionality, if this is more useful for your use case.

One important addition we did on the DBpedia Spotlight side, was to implement a new REST endpoint, which is solely responsible for spotting [3]. It takes a text as input, and returns discovered mentions of possible DBpedia resources, without selecting or disambiguating them. An example response to a query containing the text “President Obama met Angela Merkel in Berlin on Monday” looks as follows:

<annotation text="President Obama met Angela Merkel in Berlin on Monday">
        <surfaceForm name="President Obama" offset="0"/>
        <surfaceForm name="met" offset="16"/>
        <surfaceForm name="Angela Merkel" offset="20"/>
        <surfaceForm name="Berlin" offset="37"/>
        <surfaceForm name="Monday" offset="47"/>
</annotation>

Example 1: DBpedia Spotlight spot REST endpoint response

This is the endpoint we used in order to implement a DBpedia Spotlight spotter EnhancementEngine in Stanbol, which will be discussed in the next chapter.

Another step in the processing separation process was to enable the annotate endpoint to consume surfaceForms as input, thus enabling the usage of DBpedia Spotlight with a different spotter. The implementation we decided will be most suitable was to adopt the XML format shown in example 1 and to use it as an input for the spotter in the annotate endpoint. This way, if you want to use for instance Stanbol named entity recognition (NER), all you have to do is to transform its results into this XML format, and DBpedia Spotlight will skip its own spotting step. A new spotter implementation SpotXmlParser was added for parsing this input

Apache Stanbol – Implementing the EnhancementEngines
Our first goal was to build and install Apache Stanbol as our development environment and demo system [4]. Once this was done, we implemented four EnhancementEngines, which will be discussed in this chapter. We will use $engineUrl [5] as an abbreviation of the full URL to improve readability.

$engineUrl/dbpspotlightannotate
The first EnhancementEngine we implemented was dbpspotlightannotate, which uses the annotate REST endpoint of DBpedia Spotlight. As we mentioned in the previous chapter this engine does the whole annotation process in one run. It has very flexible configuration, which can be customized via the Felix Admin interface.

Here is a brief description (for more details, please refer to the user manual [2]):

  • Spotlight URL – the URL which will be used for the request (default value: http://spotlight.dbpedia.org/rest/annotate). You would not want to change this parameter, unless you run a local installation of DBpedia Spotlight
  • Spotter – the algorithm which will be used for Spotting (aka term recognition). Currently available: NER, LingPipeSpotter, OpenNLPChunkerSpotter, Kea
  • Disambiguator – the algorithm used for ranking of senses based on context. Currently available: Document, Occurrences
  • Types Restriction – the DBpedia Ontology types you wish to restrict your results to (for instance “Person,Location”)
    Sparql – restrict the result with a SPARQL query
  • Support – filter the results based on a support metric (default value is -1, which means no restriction)
  • Confidence – filter the results based on a confidence metric (default value is -1, which means no restriction)

You don’t have to perform any configuration modifications before you start using the engine, as default values have been pre-configured. These are intended for advanced users, who want to explore the different possibilities in order to improve their results or performance.

You can test the EnhancementEngine through the web interface or the console.

curl -X POST -H “Content-type: text/plain” –data “President Obama met Angela 
Merkel in Berlin on monday” 
http://spotlight.dbpedia.org/stanbol/enhancer/engine/dbpspotlightannotate

The engine dbpspotlightannotate stores the surface forms identified by the spotter as TextAnnotations (shown in example 2) and the DBpedia resources as EntityAnnotations (shown in example 3).

{
    "@subject": "urn:enhancement-9e58724e-3a42-cb56-067f-7411d9eb9837",
    "@type": [
            "Enhancement",
            "TextAnnotation"
        ],
    "created": "2012-07-05T11:38:35.197Z",
    "creator": "org.apache.stanbol.enhancer.engines.dbpspotlightannotate.
            DBPSpotlightAnnotateEnhancementEngine",
    "end": 33,
    "extracted-from": "urn:content-item-sha1-
            43a1aa3144f4a5ee3dda4112fb7f89f80df4aa89",
    "selected-text": "Angela Merkel",
    "start": 20,
    "type": "DBpedia:OfficeHolder,DBpedia:Person,Schema:Person,
        Freebase:/people/person,Freebase:/people,Freebase:/book/
        author,Freebase:/book, Freebase:/government/politician,
        Freebase:/government, Freebase:/award/ranked_item,
        Freebase:/award,Freebase:/award/award_winner,
        Freebase:/film/person_or_entity_appearing_in_film,
        Freebase:/film"
}

Example 2: TextAnnotation for the surface form “Angela Merkel”

{
    "@subject": "urn:enhancement-26da0456-b36b-645d-c96f-229f7513f4cd",
    "@type": [
            "Enhancement",
            "EntityAnnotation"
        ],
    "created": "2012-07-05T11:38:35.197Z",
    "creator": "org.apache.stanbol.enhancer.engines.dbpspotlightannotate.
            DBPSpotlightAnnotateEnhancementEngine",
    "entity-label": {
            "@literal": "Angela Merkel",
            "@language": "en"
        },
    "entity-reference": "http://dbpedia.org/resource/Angela_Merkel",
    "entity-type": [
            "http://www.freebase.com/schema/film/person_or_entity_appearing_in_film",
            "OfficeHolder",
            "http://www.freebase.com/schema/award/award_winner",
            "http://www.freebase.com/schema/government",
            "http://www.schema.org/Person",
            "http://www.freebase.com/schema/book",
            "http://www.freebase.com/schema/film",
            "http://www.freebase.com/schema/people",
            "http://www.freebase.com/schema/people/person",
            "Person",
            "http://www.freebase.com/schema/award",
            "http://www.freebase.com/schema/award/ranked_item",
            "http://www.freebase.com/schema/book/author",
            "http://www.freebase.com/schema/government/politician"
        ],
    "extracted-from": "urn:content-item-sha1-
            43a1aa3144f4a5ee3dda4112fb7f89f80df4aa89",
    "relation": "urn:enhancement-9e58724e-3a42-cb56-067f-7411d9eb9837"
}

Example 3: EntityAnnotation for the surface form “Angela Merkel”

$engineUrl/dbpspotlightspot
We described in detail how the full processing stack of DBpedia Spotlight integrates in Stanbol. The goal now is to split the different stages in separate EnhancementEngines, starting with dbpspotlightspot. This engine stores the surface forms as TextAnnotations, as discussed above. It also performs other kinds of phrase recognition besides NER [3].

$engineUrl/dbpspotlightdisambiguate
This engine reads TextAnnotations, stored by another Stanbol EnhancementEngine, and uses them as input for candidate selection, disambiguation and linking. The results are then stored as EntityAnnotations and linked to the TextAnnotations.

$engineUrl/dbpspotlightcandidates
This EnhancementEngine is equivalent to dbpspotlightannotate, except all possible disambiguations for a TextAnnotation are returned, and not only the best disambiguation as in /dbpspotlightannotate. Here is an example for a candidate resource.

{
    "@subject": "urn:enhancement-f1e7ced2-a738-5786-3d19-89e8cc71ddfa",
    "@type": [
            "Enhancement",
            "EntityAnnotation"
        ],
    "created": "2012-07-05T12:10:44.482Z",
    "creator": "org.apache.stanbol.enhancer.engines.dbpspotlightcandidates.
            DBPSpotlightCandidatesEnhancementEngine",
    "entity-label": {
            "@literal": "East Berlin",
            "@language": "en"
        },
    "entity-reference": "East_Berlin",
    "extracted-from": "urn:content-item-sha1-
            43a1aa3144f4a5ee3dda4112fb7f89f80df4aa89",
    "http://spotlight.dbpedia.org/ns/contextualScore": -1.0,
    "http://spotlight.dbpedia.org/ns/finalScore": 0.13204871,
    "http://spotlight.dbpedia.org/ns/percentageOfSecondRank": 0.82986754,
    "http://spotlight.dbpedia.org/ns/priorScore": 1.1408546E-5,
    "http://spotlight.dbpedia.org/ns/support": 796.0,
    "relation": "urn:enhancement-67f4dc7c-8d6d-951f-58d5-e8f0db55f72f"
}

Example 4: EntityAnnotation example of a candidate for the surface form “Berlin”

Apache Stanbol – Using the EnhancementEngines in EnhancementChains
If you want to use the full functionality of DBpedia Spotlight, you should use the EnhancementEngine dbpspotlightannotate. In case you are interested in using it only for spotting or disambiguation, we created three EnhancementChains to demonstrate how you can do that. We will use chainUrl [6] as an abbreviation for the full URL.

$chainURL/dbpspotlight
This chain replicates the functionality of dbpspotlightannotate, by chaining dbpspotlightspot and dbpspotlightdisambiguate. Please note that langidis run first, and only english texts are processed. In the near future, DBpedia Spotlight will support multiple languages and this constraint will be adapted accordingly.

$chainURL/dbpspotlightonlyspot
Demonstrates the use of dbpspotlightspot with a different linker, in this case dbpediaLinking.

$chainURL/dbpspotlightonlydisambiguate
Demonstrates the use of dbpspotlightdisambiguate with a different NER engine, in this case ner.

Conclusion
We have described the integration of DBpedia Spotlight into Apache Stanbol. This integration enables the enhancement of text content through the recognition and disambiguation of up to 3.5 million entities and concepts of approximately 320 types from DBpedia. Through the EnhancementEngines and EnhancementChains we have presented, users can flexibly integrate alternative implementations of spotting and disambiguation, while chaining them together to better suit specific use cases. The default implementation is based on a freely available web service deployment. DBpedia Spotlight is also available as Apache-licensed open source and can be locally installed for higher reliability, lower response times as well as privacy reasons (no need to send content outside of an enterprise). For more information about DBpedia Spotlight, visit: https://github.com/dbpedia-spotlight/dbpedia-spotlight/

References
[1] http://dbpedia.org/spotlight
[2] http://wiki.dbpedia.org/spotlight/usersmanual?v=i0m
[3] https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Spotting
[4] http://spotlight.dbpedia.org/stanbol/
[5] http://spotlight.dbpedia.org/stanbol/enhancer/engine/
[6] http://spotlight.dbpedia.org/stanbol/enhancer/chain/

Jelev

Author: Jelev

Iavor Jelev is a software developer, specialized in database systems and NLP. He studied at the Freie University Berlin, and has more than 12 years expirience in software development and project management. After working on a natural language processing based stock market trading recommendation system at the Fraunhofer ITWM as his diploma thesis, his passion for the semantic web was ignited, and he is currently a partner and CTO at babelmonkeys, a business branch of the GzEvD GmbH in Berlin. He currently leads the software development of RoboTagger, a DBPedia Spotlight similar tool for semantic enrichment of unstructured data.

Comments are closed.