Entity Disambiguation project by Kritarth Anand accepted for Google Summer of Code 2012

Congratulations to Kritarth Anand. His entity disambiguation project with Apache Stanbol was accepted for this year’s Google Summer of Code .

Kritarth Anand’s work will bring Entity Disambiguation to Apache Stanbol with the overall goals:

  1. Decide between Entity A and B (or in other words – correctly rank them)
  2. Provide a better confidence estimation (especially important if no human is in the loop)
  3. Grouping of Entities (could be interesting if there already exists some RDFa in parsed content and we want to exploit this to detect further entities)

The proposal is based on the Apache Stanbol issue (STANBOL 223) – summarised below, as originally posted by Olivier Grisel. Visit the Apache Stanbol site to follow the discussion

Using the Stanbol Enhancer

Adding Disambiguation support to the Stanbol Enhancer includes the following points

1. Dataset: For Disambiguation you need not only a set of Entities but also additional data used for the disambiguation

* This might need some pre-processing of the data (e.g. using mentions of the entity in sentences; Using data from linked Entities to create a context)
* This data needs to be accessible to the Stanbol Enhancer (e.g. by using the Entityhub, and our own SolrIndex or even other means)

2. Deciding on possible algorithms

* This Issue has two possible algorithms (see below and comments)

3. Workflow:

a) Disambiguate while linking (basically you have the String “Paris” and the Sentence/Document as context and want to know if you
should link to Paris, France or Paris, Texas)
b) Disambiguate already linked Entities (you have 5 suggested Entities by two different Engines and you want to disambiguate (rank)
them)

4. Validation of the Disambiguation: We need to compare enhancement quality with/without disambiguation

* The Benchmarking (enhancer/benchmark) tool could be used for that
* Question: How much time would be needed to create Benchmarking Examples

5. What are the expected results?

* implementation of a (maybe more) disambiguation algorithm(s)
* integration to the Stanbol Enhancer as one or more EnhancementEngines
* management of the data needed for disambiguation (e.g. as part of the Entityhub)
* support (tools) for creating/extracting data needed for disambiguation
* Validation results using the enhancer/benchmarking tool
* Documentation on the Apache Stanbol Webpage
* Simple Web interface showing the improved enhancement results (I am thinking of a single text box to put the text and two enhancement results one with and one without entity disambiguation.

Optional – integration of user feedback to enhance learning/validation set

 

Comments are closed.