Use Apache Stanbol to extend your CMS search and discovery experience

This article introduces new developments in Apache Stanbol to extend search and discovery of semantically indexed content. We have codenamed this work “index pipeline”. It brings together the components Apache Stanbol Contenthub, Apache Solr, Linked Media Framework (LMF) and LDPath. With the integration of these components users can now build richer indexes that make use of the content semantics such as named entities and concepts from thesauri. This approach extends the current keyword based indexing search techniques available in most Content Management Systems, without affecting the underlying search infrastructure.

Contenthub of Apache Stanbol is a semantic storage for the content. Two main services are provided by the Contenhub. First, content items can be submitted through the semantic enhancement facilities of Stanbol. Second, through a powerful search mechanism, content items can be searched.

Contenthub makes use of Apache Solr[1] as its backend to store the content items. Solr provides powerful indexing and text-based search mechanisms. It supports a rich and highly flexible schema specification, and has an extensive search plugin API for developing custom search behavior. Contenthub interacts with an Embedded Solr Server through Solrj[2].

Solr indexes submitted documents according to the configuration within a Solr Core. A core is a running instance of a Solr index along with several configurations. In a Solr Core, apart from the linguistics related configurations, “schema.xml”[3] is the main configuration file. This schema file contains all of the details about which parts of the content items should be stored, and how those parts should be dealt with while indexing and search.

Current implementation of Contenthub provides a default Solr core with a default schema file to index the content items. This schema contains the following fields to be indexed:

<!–
following fields are used in contenthub. These are default fields a content have
content is the raw text of document and mimeType is type of document
–>
<field name=”title” type=”text_general” indexed=”true” stored=”true”/>
<field name=”stanbolreserved_content” type=”text_general” indexed=”true” stored=”true”/>
<field name=”stanbolreserved_mimetype” type=”string” indexed=”true” stored=”true”/>
<field name=”stanbolreserved_creationdate” type=”tdate” indexed=”true” stored=”true” multiValued=”false”/>
<field name=”stanbolreserved_enhancementcount” type=”long” indexed=”false” stored=”true”/>
<!– Semantic fields –>
<field name=”stanbolreserved_countries” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_imagecaptions” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_regions” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_governors” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_capitals” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_largestcities” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_leadernames” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_givennames” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_knownfors” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_birthplaces” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_placeofbirths” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_workinstitutions” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_captions” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_shortdescriptions” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”stanbolreserved_fields” type=”text_general” indexed=”true” stored=”true” multiValued=”true”/>

“Semantic fields” in the above definition maps to the properties extracted through the enhancements of the content. For instance, “countries” field includes the countries of cities (e.g. Istanbul → Turkey) if the location in the content is recognized by Stanbol.

Linked Media Framework (LMF)[4] is an outcome of an FP7 project, “Knowledge in a Wiki (KiWi)”[5] which provides implementations of central Semantic Web technologies to offer advanced services. LMF consists of LMF Core and LMF Modules. LMF Semantic Search[6] is one of the LMF Modules which uses Apache Solr as its backend (like we do in Contenthub) and creates a search index over selected properties of resources.

Ability of creating Solr cores which direct the system while indexing and searching the content items holds importance, considering that default index of Contenthub needs to be adjusted to different domains to meet different indexing and search criteria. LMF Semantic search creates Solr indexes with the help of so-called “RDF Path Programs”[7]. Recently, LMF team has provided a standalone library for the evaluation of RDF Path Programs and named it as “LDPath”[8]. LDPath is a simple path based query language over RDF (similar to Xpath or SPARQL Property Paths) which is particularly designed for querying the Linked Data Cloud by following RDF links between resources. LDPath programs are self descriptive indeed. Below, a sample LDPath program is presented:

@prefix foaf : <http://xmlns.com/foaf/0.1/> ;
@prefix geo : <http://www.w3.org/2003/01/geo/wgs84_pos#> ;
title = foaf:name :: xsd:string ;
summary = dc:description :: xsd:string ;
lng = foaf:based_near / geo:long :: xsd:double ;
friends = foaf:knows / (foaf:name | fn:concat(foaf:givename,” “,foaf:surname)) :: xsd:string;
contrycode = foaf:based_near / <http://www.geonames.org/ontology#countryCode> :: xsd:string ;
type = rdf:type :: xsd:anyURI ;

In Contenthub, LDPath programs are being used to build semantic search indexes so that indexing and search mechanisms can be adjusted according to user needs. Related services are currently under development. To illustrate this concept, let’s consider the city → country relation. In default index of Contenthub, following SPARQL query is executed on the enhancements of a content item to retrieve the country information of the recognized entities if exists. Afterwards, this information is indexed along with the content item itself as configured in the default index.

PREFIX fise: <http://fise.iks-project.eu/ontology/>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX dbont: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?stanbolreserved_countries WHERE {
?enhancement a fise:EntityAnnotation.
?enhancement dc:relation ?textEnh.
?textEnh a fise:TextAnnotation.
?enhancement fise:entity-reference ?entity.
?entity dbont:country ?stanbolreserved_countries.
}

To overcome the limitations of the built-in “semantic” capabilities of Contenthub, LDPath programs can be used to create Solr cores on the fly and indexing the parts of the content items pointed by the LDPath programs. Users can build their own LDPath programs to expose their own semantics to Contenthub. For example, an LDPath program can identify the properties of the entities to be indexed along with the content items and then searched through semantic search. In addition, selected properties by the LDPath program determines the faceted search options indirectly. This allows a great customization in the sense of personalization from a user point of view.

To illustrate, LDPath program which is presented above lead to a Solr schema including the following fields:

<field name=”type” type=”string” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”summary” type=”string” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”lng” type=”double” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”friends” type=”string” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”country_code” type=”string” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”type” type=”string” indexed=”true” stored=”true” multiValued=”true”/>

This schema is the outcome of an automatic process within Contenthub. Corresponding values will retrieved by evaluating the LDPath program on the enhancements of the content items and indexed through the Solr index.

Integrating LDPath programs into Contenthub creates an additional requirement regarding the Stanbol users. Users should be able to write their own LDPath programs and submit to Contenthub. As mentioned earlier, the integration is still under development and several services to deal with LDPath programs will be available. Although those services will ease the adoption, users may still need a basic level knowledge about LDPath programs.

[1] http://lucene.apache.org/solr/
[2] http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer
[3] http://wiki.apache.org/solr/SchemaXml
[4] http://code.google.com/p/kiwi/
[5] http://www.kiwi-project.eu/
[6] http://code.google.com/p/kiwi/wiki/SemanticSearch
[7] http://code.google.com/p/kiwi/wiki/RdfPathLanguage
[8] http://code.google.com/p/ldpath/

Comments are closed.