Apache UIMA Example Wrappers for the OpenNLP Tools

Introduction

OpenNLP Tools is an open source package of natural language processing components written in pure Java. The tools are based on Adwait Ratnaparkhi's Ph.D. dissertation (UPenn, 1998), which shows how to apply Maximum Entropy models to various language ambiguity problems. The OpenNLP Tools rely on the OpenNLP MAXENT package, a mature Java package for training and using maximum entropy models.

The OpenNLP Tools package (as of Version 1.3) includes a sentence detector, tokenizer, part-of-speech tagger, noun phrase chunker, shallow parser, named entity detector, and co-reference resolver. All together these tools provide a rich and powerful set of text analysis capabilities.

The Apache UIMA Example Wrappers for OpenNLP provides UIMA annotators for most of the OpenNLP Tools components, allowing you to run the OpenNLP Tools as UIMA annotators. The wrapper annotators were written to be very simple examples of how pre-existing analysis components can be deployed using the UIMA framework. The wrappers provide a thin layer over the OpenNLP classes and use the "outermost" APIs to those classes. As such, most of the work performed by the wrappers involves translating the contents of the CAS (i.e., the document and any annotations) into the input format required by the OpenNLP API, then translating the result returned by the OpenNLP API into new annotations in the CAS.

The wrappers are not meant to represent an optimal integration of the OpenNLP Tools into the UIMA framework. In fact, it is quite likely that a more efficient integration could be achieved, e.g., by moving some of the OpenNLP data structures into the CAS and avoiding much of the copying and translating performed by the current implementation.

This version of the example wrappers requires version 1.3.0 of the OpenNLP Tools and only supports the English version of the tools (and, correspondingly, the English version of the models).

The rest of this Readme will show you how to compile and use the OpenNLP Wrappers.

Prerequisites

To get started, you need to download OpenNLP Tools V1.3.0 from SourceForge.net, compile the OpenNLP Tools package, create or download from SourceForge.net the model files for the components you wish to run, and finally compile the UIMA Wrappers for OpenNLP.

Download OpenNLP Tools

opennlp.sourceforge.net

opennlp-tools-1.3.0.tgz

Compile OpenNLP Tools

opennlp-tools-1.3.0.jar

Download the Model files

opennlp.sourceforge.net

Compile the UIMA Wrappers for OpenNLP

UIMA SDK User's Guide and Reference

Open the Properties dialog for the uima_examples project. You can either "right click" on the exmple project and select "Properties" from the menu, or select (highlight) the examples project then click "Project->Properties" from the main menu.
Click on "Java Build Path" to open the build path panel.
Click on the "Source" tab to see the source folders on the build path.
Click "Add Folder..." and add "opennlp_wrappers/src" to the source folders build path.

maxent-2.4.0.jar
trove.jar
opennlp-tools-1.3.0.jar

At this point, your wrappers should compile and you are now ready to run the OpenNLP Tools as UIMA Annotators.

Quick Test

For a quick test, open the descriptor file for the sentence detector wrapper

opennlp_wrappers/descriptors/OpenNLPSentenceDetector.xml

using the Component Descriptor Editor plugin for Eclipse (see Chapter 8 of the UIMA SDK User's Guide and Reference). Click on the "Parameter Settings" tab and set the value of the "ModelFile" parameter to point to the English sentence detector model you downloaded in step 3 above, e.g.:

C:\opennlp-models-1.3.0\english\sentdetect\EnglishSD.bin.gz

Save the descriptor. Start the UIMA Document Analyzer from Eclipse as described in Chapeter 12 of the UIMA SDK User's Guide and Reference. Set the Input and Output directories as shown in Section 12.2. For the Location of TAE XML Descriptor, specify:

opennlp_wrappers/descriptors/OpenNLPSentenceDetector.xml

Note that the opennlp_wrappers folder is in the examples folder of the UIMA SDK. Leave the remaining input fields alone and press "Run". This will run the OpenNLP sentence detector on the UIMA SDK sample data.

Double click on a document in the results list to bring up the Java annotation viewer. You should see Sentence annotations (though since the spans are contiguous, it may appear that an entire paragraph is highlighted). Click on a Sentence annotation to see the annotation details in the right-hand pane. When you expand the details, you should see reasonable begin and end values.

Using the Example Wrappers

The OpenNLP Example Wrappers package includes source code for the wrapper annotator classes, source code for the JCasGen-generated type classes, and descriptor files for the analysis engines and type system.

The source code is in "opennlp_wrappers/src", which you should now be somewhat familiar with after following the instructions in the previous section to compile the code. The Analysis Engine descriptors are in "opennlp_wrappers/descriptors".

The following table summarizes the wrapper annotator classes and their corresponding descriptor files (note that all of the wrapper annotators are in the org.apache.uima.examples.opennlp.annotator package):

Java Class Descriptor File Description

NEDetector.java OpenNLPNEDetector.xml Named entity detector (called name finder in OpenNLP)

Parser.java OpenNLPParser.xml Shallow parser

POSTagger.java OpenNLPPOSTagger.xml Part-of-speech tagger

SentenceDetector.java OpenNLPSentenceDetector.xml Sentence detector

Tokenizer.java OpenNLPTokenizer.xml Tokenizer

The descriptors folder also contains an aggregate analysis engine descriptor, OpenNLPAggregate.xml, which can be used to run one or more wrapper components.

The type system descriptor, OpenNLPExampleTypes.xml, can be found in the org.apache.uima.examples.opennlp package in the "src" folder. The type system descriptor is located here so that the analysis engine descriptors can import it by name.

All of the annotators use the JCas interface to the CAS, so JCasGen has been run on the type system. All of the JCasGen-generated type classes are in the org.apache.uima.examples.opennlp package.

OpenNLP Wrapper Type System

The OpenNLP Wrapper type system defines UIMA annotation types for the various annotations produced by each of the OpenNLP Tools components. You can view the type system in detail by using the Component Descriptor Editor plug-in for Eclipse and loading the type system descriptor.

All of the types reside in the org.apache.uima.examples.opennlp namespace. The types are summarized in this table:

Sentence	Spans a sentence, produced by OpenNLPSentenceDetector.
Token	Spans a token, produced by OpenNLPTokenizer. If OpenNLPPOSTagger has been run, the the posTag field of the Token will contain the part-of-speech tag.
Person	Spans a Person entity, produced by OpenNLPNEDetector.
Organization	Spans an Organization entity, produced by OpenNLPNEDetector.
Time	Spans a Time entity, produced by OpenNLPNEDetector.
Date	Spans a Date entity, produced by OpenNLPNEDetector.
Location	Spans a Location entity, produced by OpenNLPNEDetector.
Percentage	Spans a Percentage entity, produced by OpenNLPNEDetector.
Money	Spans a Money entity, produced by OpenNLPNEDetector.
Clause	Supertype for all of the Clause annotations produced by OpenNLPParser.
Phrase	Supertype for all of the Phrase annotations produced by OpenNLPParser.

OpenNLPSentenceDetector

The OpenNLPSentenceDetector detects sentence boundaries and creates Sentence annotations that span these boundaries. The sentence detection is performed by opennlp.tools.lang.english.SentenceDetector.

Inputs
- none - The analysis engine operates directly on the document in the CAS
Outputs
- Sentence - one Sentence annotation for each detected sentence in the document.
Parameters

Name Type Description

ModelFile String Path to the OpenNLP model file for the English sentence detector

OpenNLPTokenizer

The OpenNLPTokenizer tokenizes the text and creates token annotations that span the tokens. The tokenization is performed with opennlp.tools.lang.english.Tokenizer, which tokenizes according to the Penn Tree Bank tokenization standard. In general, tokens are separated by white space, but punctuation marks (e.g., ".", ",", "!", "?", etc.) and apostrophe endings (e.g., "'s", "'nt", etc.) are separate tokens.

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS
Outputs
- Token - one Token annotation for each detected token in the document.
Parameters

Name Type Description

ModelFile String Path to the OpenNLP model file for the English sentence tokenizer

OpenNLPPOSTagger

The OpenNLPPOSTagger assigns part-of-speech tags to tokens using opennlp.tools.lang.english.PosTagger. This annotator requires that sentence and token annotations have been created in the CAS. The annotator updates the POS field of each token annotation with the part-of-speech tag.

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS
- Token - The analysis engine requires Token annotations in the CAS
Outputs
- Token.posTag - the posTag field in each Token annotation is updated with the part-of-speech tag for the corresponding word.

Parameters

Name	Type	Description
ModelFile	String	Path to the OpenNLP model file for the English POS tagger. Note that as of OpenNLP Tools 1.3.0, the POS tagger model file can be found in the parser model files folder.

OpenNLPNEDetector

The OpenNLPNEDetector detects named entities in the text and creates corresponding entity annotations that span the found entities. The annotator uses opennlp.tools.lang.english.NameFinder, instantiating one NameFinder for each entity class to be detected. Each entity class has a separate MaxEnt model file. All model files must be stored in a single model file directory and use the following naming convention: "class.bin.gz", where "class" is the entity class name and ".bin.gz" must appear as shown, e.g., "person.bin.gz".

This analysis engine takes a parameter called "EntityTypeMapping" which maps each entity class name to an entity annotation type. The entity class name must match a model file in the model file directory, and the entity annotation type must be defined in the type system and have a corresponding JCas Java class. This allows the actual annotation types produced by the analysis engine to be specified as a run-time parameter.

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS
- Token - The analysis engine requires Token annotations in the CAS
Outputs
- EntityAnnotation - The analysis engine creates an EntityAnnotation for each entity detected in the document. The actual annotation is typically a sub-type of EntityAnnotation specialized for the particular entity class found, e.g., Person, Organizatoin, etc. See the EntityTypeMapping parameter for more details.

Parameters

Name	Type	Description
ModelDirectory	String	Path to the directory that contains the OpenNLP model files for the English name finder. All model files must be stored in a single model file directory and use the following naming convention: "class.bin.gz", where "class" is the entity class name and ".bin.gz" must appear as shown, e.g., "person.bin.gz".
EntityTypeMappings	String Array	Mapping from entity names (obtained from the model filename) to the JCas class for the corresponding annotation. Each mapping string is of the form "name,class", i.e., the entity type name followed by a comma followed by the annotation class.

OpenNLPParser

The OpenNLPParser parses the document and creates phrasal and clausal annotations over the text using opennlp.tools.lang.english.TreebankParser.

This analysis engine takes a parameter called "ParseTagMapping" which maps each parse tag to a syntax annotation type. The parse tags come from the standard Penn Tree Bank phrase and clause tags (produced by the OpenNLP parser), and each syntax annotation type must be defined in the type system and have a corresponding JCas Java class.

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS
- Token - The analysis engine requires Token annotations in the CAS
Outputs
- Phrase - The analysis engine creates a Phrase for each phrase tag produced by the TreebankParser. The actual annotations created are sub-types of Phrase, specific to the actual phrase tag. See the ParseTagMapping parameter for more details.
- Clause - The analysis engine creates a Clause for each clause tag produced by the TreebankParser. The actual annotations created are sub-types of Clause, specific to the actual clause tag. See the ParseTagMapping parameter for more details.

Parameters

Name	Type	Description
ModelDirectory	String	Path to the directory that contains the OpenNLP model files for the English parser.
UseTagDictionary	Boolean	Flag indicating whether or not to use the tag dictionary
CaseSensitiveTagDictionary	Boolean	Flag indicating whether or not the tag dictionary is case sensitive
BeamSize	Integer	The beam size for the parse search
AdvancePercentage	Float	The probability mass percentage threshold for advancing outcomes
ParseTagMappings	String Array	Mapping from parse result tags produced by the TreeBankParser to the JCas class for the corresponding annotation. Each mapping string is of the form "tag,class", i.e., the tag name followed by a comma followed by the annotation class name.

Tips and Traps

The OpenNLP Tools can require a lot of Java heap memory, especially if you run multiple annotators simultaneously. You'll likely want to increase your maximum heap size with the -XmxSize command line argument to the JVM. Try -Xmx1024M just to be safe. If you are using an Eclipse run configuration for the UIMA SDK tools (Document Analyzer and CPE Configurator), you can specify this VM argument on the "Arguments" tab of the run configuration.
The jar files that come with the OpenNLP Tools package may have been compiled with Java 1.5. Although you can compile the UIMA wrappers with Java 1.4, if you try to run your UIMA application (e.g., the Document Analyzer) with Java 1.4 and you get a "java.lang.UnsupportedClassVersionError: ... (Unsupported major.minor version 49.0)", try running your application with Java 1.5.
To train new models for the OpenNLP components, see the README file distributed with the OpenNLP Tools package.
Note that OpenNLPTokenizer requires Sentence annotations, and OpenNLPPOSTagger, OpenNLPNEDetector, and OpenNLPParser require Sentence and Token annotations, so in most cases you will be running an aggregate that minimally includes OpenNLPSentenceDetector and OpenNLPTokenizer.
The models for the OpenNLP name finder and parser were created using a tokenization produced by the OpenNLP tokenizer. If you use a different sentence detector and tokenizer that produce a tokenziation diffenrent from the Penn Tree Bank standard, you may not get the best possible performance from the name finder and parser.

To Dos

Wrap the new OpenNLP co-reference resolution component.