What is BookSpotter EE
As you can guess from its name, BookSpotter Enhancement Engine by Sztaki tries to find book occurrences in the processed content. To do that, right now it relies on the downloadable data sets of the British National Bibliography (BNB) and The Open Library (OL). However, it could use potentially every catalog which title list is downloadable. Once BookSpotter finds something, it will insert an entity to the analysis result that points to the correspondig catalog item on BNB or OL.
How BookSpotter Works
BookSpotter loads a list of titles (currently we are working on a set of 5.6 Million titles) into memory and each time a text comes in it runs trough it, checking its tokens agains its memory database. Once a title mentioning is found, an SQL database is consulted to get the authors of the work in question. If one or more authors appears to be present in the text it will add a TextAnnotation and an EntityAnnotation for it, with a confidence value that is generated heuristically from the author occurrences.
Let me share some considerations that led me to choose this architecture.
- The most frequently asked question is that why the tool does not rely on library APIs that are provided by many great libraries. The reason is that it would involve submitting too many search queries that in turn would put too much pressure on the library systems. If we consider a usual portion of text that consists of 1000-5000 words, we have approx 10-50 phrases that could be easily book titles as far as the algorithm is concerned. Given our experience that you cannot expect better figures than 1-3 query/sec from a library system you would get about 30 secs of processing time for a single text. And that just gets worse as the number of users increases. Of course, the number of candidates could be cut back in theory to a much smaller number, but that involves either restrictive heuristics (e.g. requiring certain verbs – e.g. ‘write’ to be present) or language-dependant NLP. We have experimented with those in Sztakipedia, but without much luck. This doesn’t mean that Smart solutions to that problem won’t be discovered in the future, but I am sure it will be a result of a larger research effort
- Maybe you will notice that according the algorithm sketched above, a book will only be spotted if there are at least one part from the author string there, too. Yes, there are plenty of cases where the author is not there, only the title. On the other hand the majority of everday words are also titles. If you look up the words from the previous sentence I’ve just written you will get the idea. Therefore, restriction is needed.
- We also rely on the case (Upper/lower) of the words, but of course, that also is language-dependent, e.g. in German it won’t help a lot.
Book Selection Experiments
Overall, the two data sets together contains more than 47 million titles. Of course, there are some errors, redundancies, some titles that are too specific and unlikely to be spotted for one reason or another, and also there is a strong need of reducing the size of the data set.
The Early Adopter Programme (EAP) allowed me to hire some summer work hours from the TU Budapest’s Student Work Association for the manual evaluation of 5000 book records from which we could get a picture about the whole data set. Together with the really helpful student we have learnt that the following problems need our attention.
- Extra things in the title
- Author string in title : …by SomeBody …/SomeBody, …(SomeBody)
- Info about edition in the title
- Extra characters/encoding problems
- Year of the publication, place of the publication, affiliation in the title string
- Special characters in the string
- Very rarely also the chapter titles or complete sentences in the title string
- Things missing from the record
- No author
- The title is only a nubmer (e.g. 1)
- Unclosed parentheses in the title
- Title appears unfinished
After filtering the problematic recors we had a smaller data set, which still contained a lots of entries we felt should be ommitted. Think about titles that describe the statistics on national agriculture in a certain year’s certain quarter, or the log of a certain day of a certain parlament issued as a volume. True, one day one might refer to just these items. However, given that right now we needed a still smaller data set we have made a further filtering. After manually investigating lots of records we maximalized the lenght of the titles in 6 words and left out the rest. This reflects our admittedly questionable perception that titles that are longer than this are usually not the ones that will be very rarely discussed on the internet. (of course there are sad exceptions that have been filtered out this way) So this how we get the 5.6 million titles.
For the functional evaluation we have used 25 texts collected on the internet that contained book descriptions.
You can find our test suite and the results here: http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
These experiments show that the system is able to give good hits, but also false positives. Further filtering of the results cut back the false positives but also the positive hits so we decided to rather live with more false positives than to throw out too many good results. (Naturally, the size of the test set is too small anyway, it was only intended as functional testing)
This reflects our belief that this system (and in fact Stanbol in general) should be used in an interactive way, relying on user confirmation. The subjective feeling of the system is quite good: mostly you will have some good results for your text, although sometimes seemingly important titles are missing (e.g. the case of 1984 which was a victim of a heuristic method for filtering out many-many yearbooks)
Trying it out
You can try the BookSpotter out on our Stanbol instance.
Resources and Installation
Every deliverable of this work can be found here: http://pedia2.sztaki.hu/stanbol/
First, you will need the data set in sql:
- http://pedia2.sztaki.hu/stanbol/bookspotter/bookspotter.sql.tar.gzThis contains a database with three tables:
- work identifiers, titles
- author identifiers, names
- author-work relations
This database is used to check whether the author-candidates and the title-candidates match. Altogether it contains about 13 Million entries.
On the top of this, we also need a titles file:
This contains one work id/title pair on a line. Why duplicating the data that is already in the SQL? Because all the titles will be loaded into the memory, and that cannot be done directly from the SQL database without generating some really serious stress on the database server.
Finally, there are three OSGi bundles:
- BookSpotterLib 0.5.2 :http://pedia2.sztaki.hu/stanbol/bundles/BookSpotterLib-0.5.2.jar This Library does the main work, the BookSpotter bundle only wraps it as a Stanbol Enhancer Engine (There is also an alternate, standalone, REST-enabled wrapper for the system)
- CasLight 0.5.0: http://pedia2.sztaki.hu/stanbol/bundles/CasLight-0.5.0.jar This is a really light-weight UIMA-like CAS implementation, which is used by the system internally (note: it is standalone, uses plain java objects, no UIMA dependency involved)
- BookSpotter 0.5.0 : http://pedia2.sztaki.hu/stanbol/bundles/BookSpotter-0.5.0.jar This contains the Enhancer Engines
If you want your own instance of BookSpotter, you can install it like this:
- Download and Insert the SQL into a mysql database instance.
- Download the titles file and put it to a location on the server which can be reached by stanbol.
- Install the CasLight, BookSpotterLib, BookSpotter, bundles and the mysql connector jar to stanbol trough its system console.
- Configure BookSpotter with the database access information, and the path for the book titles.
The following screenshot shows a sample configuration:
I hope you will enjoy using BookSpotter.