This blog post describes the health domain specific demonstration that we will be presenting in the Extended Semantic Web Conference which will be held between 27th and 31th of May in Heraklion, Greece. The demonstration mainly shows the semantic enhancement, indexing and search of content and knowledge related with the content using various components of Apache Stanbol, which are Contenthub, Enhancer, Entityhub and CMS Adapter.
This post will describe the demonstration scenario in a moderate technical detail by giving the information about how Stanbol components are merged to provide health domain specific semantic functionalities for non-semantic content management systems.
Preparation of Health Related Indexes
In this demonstration, we use different health related datasets for different purposes. Before explaining where those datasets are being used, we explain datasets themselves and how they become ready to be processed by the Stanbol.
The original source of the datasets that we use in this demonstration is the National Center for Biomedical Ontology (BioPortal). From this portal, we use the following datasets.
- SNOMED/CT: This is a very comprehensive clinical healthcare terminology containing terms about diagnosis, clinical findings, body structures, procedures, etc.
- RxNORM: This dataset is about the generic and branded drugs and it aims to provide normalized names for those drugs. Also, it links the drug names to commonly used vocabularies in pharmacy management.
- Adverse Reaction Terminology (ART): This is a terminology aiming to provide a basis for coding of adverse reaction terms. It provides a hierarchical structure starting from body system/organ level for drug problems.
After transforming these datasets into RDF format, we have used the indexing component of Entityhub to bundle these datasets as different Solr indexes so that they can be used during the enhancement and storage operations.
Indexing component of Entityhub produces a compressed zip file containing the Solr index representing the RDF dataset and a jar file which provides installation of the Solr index to the OSGi environment, where Stanbol runs, as an OSGi bundle which can be used by various components of Stanbol.
To be able recognize named entities, which are related with health domain, from the documents to be submitted, we need to configure the Enhancer component. This configuration is done by assigning the Solr indexes created for each RDF dataset with a separate KeywordLinkingEngine. In the Figure – Configuring a Keyword Linking Engine for RxNORM Dataset, configuration of Keyword Linking Engine associated with the RxNORM dataset is seen. By adding a new enhancement engine for each dataset, we make Stanbol Enhancer to look up for the entities defined in the health related datasets during the document enhancement process.
Semantic Indexing of Documents
As the actual content management system to be enhanced with semantic functionalities, we have used the CRX product of Adobe. CRX is a JCR compliant content management system. To simulate a content management environment working on health related documents, first we populated the system with health related documents having different topics such as cancer, diabetes, eye related diseases, etc.
The next step after populating the system with health related documents is indexing the documents in a semantic way in the scope of Stanbol’s Contenthub. To do this, we create a Solr index using the LDPath. To be able create an index which is compatible with the external datasets, we have analyzed the possible properties that entities of these datasets can have. Using those properties, we have created an LDPath and using the LDPath program we created a Solr index through the semantic index management functionalities of Contenthub. In the Figure – Submitting an LDPath program, the screen for LDPath submission is seen. The index that was created with LDPath is used to index the documents managed within the CRX.
After creating the semantic index, we submit the documents from CRX to Contenthub using the CMS Adapter component. As in the Figure – Structure of the CRX, the health related articles are collected under the root of articles node. So, considering this structure we configure the CMS Adapter during the document submission process accordingly as in the Figure – Document submission to the Contenthub. As a result, all of the documents under the articles path will be submitted to the Solr index named as healthcare.
During the document submission process, as soon as the content arrives in the Contenthub, before any indexing operation, it is sent to Stanbol Enhancer and its enhancements are obtained. Enhancements obtained in RDF are stored in a triple store abstracted by Apache Clerezza. Enhancements of all of the documents are collected in a single RDF graph so that a SPARQL query can be executed considering all documents.
As soon as the content enhancement process is completed, Contenthub realizes one last additional semantic knowledge gathering. In this activity, Contenthub uses the named entities recognized during the content enhancement process. It requests additional knowledge for each named entity by querying the Entityhub with same LDPath program which was used to create the healthcare index. As a result, only relevant information of the entities for this use case is obtained. Furthermore, the acquired information is fully compatible with the healthcare index.
At the end of the indexing process, the healthcare index, which was created considering the health specific properties, is filled with semantically meaningful information obtained from external RDF datasets. The additional information obtained for submitted documents will be used to provide semantic search functionalities for the documents.
Semantic Search over the Documents
In our demonstration, by making use of the indexed content and knowledge, we have applied faceted search for the document retrieval in a semantically meaningful way.
First, we initiated the search process by doing a keyword search with the keyword diabetes to get all of the documents including the diabetes keyword. As a result we obtained the results as depicted in the Figure – Search results for the diabetes keyword. In addition to the documents results, on the left hand side, facets matching the results are presented. Each facet result has possible values together with number of documents that match for the corresponding value of the facets. The facets corresponds with the fields defined in the LDPath which was used to create the healthcare index.
- Search results for the diabetes keyword
Facets related with all of the three datasets can be seen in the Figure – Facets related with datasets used.
In the next step, we constrain the documents according to finding site of diseases. For this operation, we use a field related with the SNOMED/CT dataset. Disease entities within the SNOMED/CT have a property named has_finding_site which indicates the finding site of a disease within the body structure. We choose the nerve_structure value of this facet. The meaning of this constraining operation is that remaining results, after choosing the facet value, mention about the nerve structure of the body as finding site of the diabetes or any other related disease mentioned in the documents.
In the second step of faceted search, we would like to further constrain the documents according to a specific drug or medication. Conveniently, we use a facet related with the RxNORM dataset. We use the rxnorm_label facet to filter the results and choose Avandia value of this facet. As a result, we get the results depicted in the Figure – Constrained search results for Avandia constraint.
In the last step of faceted search, search results are constrained according to a specific adverse reaction by using a facet related with the adverse reaction terminology dataset. This time, we choose the art_label facet and choose the Headache value. As a result, there remains only a single document satisfying the chosen facet constraints as seen in the Figure – Constrained search results after selecting headache constraint.
The remaining document in the search results is a diabetes related document which mention about the nerve structure of the body as finding site of the diabetes or any related disease; Avandia as a specific drug/medicament and Headache as an adverse reaction regarding the Avandia or any related drug.
In this way, we have demonstrated a semantically meaningful flow of document filtering from the health domain perspective: First a finding site regarding with a disease, then a specific drug/medicament for the disease and lastly an adverse reaction about the chosen drug. In this way, it is also possible to navigate on documents by following a different path while choosing the facet constraints.