Monday, April 11, 2011

Named Entity Extraction with Java

In February I posted on embedding weka in a java application based on content from this book by Mark Watson.

Recently I have been doing some work that requires some entity extraction, and I found another great tool and example set from the same book.

Here is some simple code that uses the classes provided:

package com.irodata.entity_extraction;

import com.markwatson.nlp.propernames.Names;

public class EntityExtraction {
    public static void main(String[] args) {
        Names names = new Names();
        System.out.println("Hello World, Is New York a real place?");
        System.out.println("New York: " + names.isPlaceName("New York"));
        System.out.println("Hello World, Is Oz a real place?");
        System.out.println("Oz: " + names.isPlaceName("Oz"));

The output is this:

Hello World, Is New York a real place?
New York: true
Hello World, Is Oz a real place?
Oz: false

I am going to be using and expansion of this to do some identification of places and proper names in some database text data to assist with analysis.  If anything good and simple (and therefore appropriate for this blog) turns up I will be sure to share.  For now, all I can say is that this is a good set of classes if you need to do some quick text work in Java.

I only had one small gotcha that I should reference for anyone that wants to try this out:  

The text says this:

The “secret sauce” for identifying names and places in text is the data in the file test data/propername.ser – a serialized Java data file containing hash tables for human and place names.

When I first tried to build a test implementation, I kept getting this error: data/propername/propername.ser (No such file or directory)

After taking a look at the Names class, it appears that the location of the serialized data file was hard coded.  It is possible that I am missing a simple Java convention, but the easiest solution for me was to copy the .ser file to the location that Names was looking for it (/data/propername/) and then import this file system resource into the eclipse project.  It is possible that there is a better way to do this, if you know of one, please send me an email and I will update this post and give you credit.  Otherwise, this worked.  The resulting project looks like this: