Hi everybody! We are Carlo and Umberto, students of Computer Science at the University of Bologna. We want to share our experience in developing a software project, named BaMaNews, during the course in Knowledge Management taught by Valentina Presutti. We have used Apache Stanbol, in particular the Topic Engine for the Enhancer component, in order to develop a new way of navigating newspaper websites.
At the end of the post we provide a link to download our installation guide of the Topic Classification Engine.
What is Facet?
In the original concept of faceted classification, there was the intention of creating a method of classification that would provide not only a library catalog for books consultation, but also a way to arrange the books on the shelves according to a suitable order that would allow users to locate directly the documents relevant for them.
The facet is a particular aspect under which a topic can be treated. So the contents of a document can be described by a combination of values of these facets. Documents, with faceted classification, are not cataloged in a hierarchy but instead are described in terms of properties that these possess.
What is BaMaNews?
The project “BaMaNews” has the purpose to represent the faceted navigation in the context of the fruition of newspaper articles. So, BaMaNews offers an alternative to the usual navigation of newspapers websites (based on “section” or “keywords”). In particular, the main purpose is to allow a news navigation based on research of entities that are present in the articles.
The entities considered were: people, places and organizations (in the system called “Persone”, “Luoghi” and “Organizzazioni”). These are the facets of our system. The recognition of these entities has been achieved through the use of Stanbol Apache enhancers.
In addition to the recognition of entities of the documents, we have identified (again with Stanbol Topic engine) the topics addressed by the documents. “Topic” is the fourth facet (in the system called “Topic”). An additional facet is given by the categories under which the newspaper website has categorized the news (in the system called “Categorie”).
For the project we address the Guardian website. The analysis made it necessary an indexing step: the system retrieves data using the Guardian API, retrieves the body of the articles and analyzes, selects and saves the entities and topics that compose and describe the articles.
If you want to try BaMaNews, you can find it at this link.
Knowledge Management course and Apache Stanbol
The project “BaMaNews” has been created during the Knowledge Management course given by Valentina Presutti at the University of Bologna. In this course we have learned the principles and components of the Knowledge Management discipline, what is the Semantic Web and in particular which are the technology in support of KM and how to use them.
Apache Stanbol tool has been one of the technologies we have studied, used and finally exploited to create our project. Besides the use of the official release of the tool, we have also installed, configured and tested the Topic Classification Engine that has allowed us to perform a classification of articles according to arguments.
For this step, it has been very useful to us the documentation published by Olivier Grisel. Below we propose a revision of this documentation with description and explanation of the steps to perform.
During our project we have processed more than 5000 newspaper articles, and this are the results obtained on a sample of 2500 articles:
- 2026 saved articles.
- 12183 different entities recognized (by our system):
- 7697 Person
- 2191 Organizations
- 1282 Places
- 1013 Topics
- 34931 one to one relationships between entities and articles.
Topic Classification Engine
Here is the document with the description of the procedure we have followed in order to build and deploy the Topic Engine.