com.basistech.rlp.lucene
Class RLPTokenizer

java.lang.Object
  extended by Tokenizer
      extended by com.basistech.rlp.lucene.RLPTokenizer

public class RLPTokenizer
extends Tokenizer

A Tokenizer that uses RLP Basis Linguistic language processors, such as BL1, CLA, JLA, KLA, and ARBL. Objects of this class hold a reference to a com.basistech.rlp.RLPContext object, and reuse that context for each new input. These objects close their context when they are closed.
The API of this class has changed from that of the RLPTokenizer previuosly provided in the samples directory. If you have an application that was written to the sample integration code, you may need to modify your application code to use this package.
This Tokenizer will convert various RLP result types into a TokenStream. setPostName(boolean) methods of this class control which of the results of RLP are included in the token stream. Alternatively, setPostTypes(java.util.EnumSet) can be used to turn on a set of result types. If you configure this class to include more than one RLP result type, this class will follow the Lucene convention for "synonym" tokens: it will set the position increment to 0 for all but the first Token.
The RLP result type is stored in the Type field of each Lucene Token in a short name listed in the table below. The table also list the (Name part of setPostName(boolean) method that determins whether the result type should generate a token, and the corresponding getter method isPostName()).

RLP Result TypesetPostNameShort name
TOKENWordword
NORMALIZED_TOKENWordnorm
MANY_TO_ONE_NORMALIZED_TOKENM1NormalizedTokenm1norm
STEMStemstem
LEMMALemmalemma
COMPOUNDCompoundComponentscomp
READINGReadingsreading
ROOTSRootroot
PART_OF_SPEECHPartOfSpeech(see below)

The method root names usually follow the names of the RLP result types, the symbolic values that can be found in com.basistech.rlp.RLPConstants except:

For example, if a RLPTokenizer is created in this way:

 tok = RLPTokenizer(new StringReader("She said he said."), 
                    LanguageCode.ENGLISH, "/some/where/rlp-context-no-op.xml", 
                    EnumSet.of(PostType.WORD, PostType.STEM); 
 
Successive calls to #next() will return these Tokens:
  1. ("She", "word", +1)
  2. ("she", "stem", +0)
  3. ("said", "word", +1)
  4. ("say", "stem", +0)
  5. ("he", "word", +1)
  6. ("he", "stem", +0)
  7. ("said", "word", +1)
  8. ("say", "stem", +0)
  9. (".", "word", +1)
  10. (".", "stem", +0)
  11. null
where a token is represented by the triplet (text, type, position increment).

Note that Token pairs for "he" and "." are identical except for the position increment values. To remove duplicate, you can apply Solr's RemoveDuplicatesTokenFilter. (Specify its factory class, RemoveDuplicatesTokenFilterFactory, instead, in schema.xml, if you are using it from Solr.) You can also apply Lucene's standard org.apache.lucene.analysis.LowerCaseFilter (LowerCaseFilterFactory) before RemoveDuplicatesTokenFilter to remove almost duplicate the "she" stem Token.

If you turn on the compound handling (by calling setPostCompoundComponents(boolean) with true, or including PostType.COMP in the EnumSet given to the constructor), please note that:

Let's take the German compound word "Ferienwohnungen" (vacation apartments) as an example. Assuming only the word and compound components are turned on, the RLPTokenizer will generate these Tokens:
  1. ("Ferienwohnungen", "word", +1, 0, 15)
  2. ("Ferien", "comp", +0, 0, 15)
  3. ("Wohnung", "comp", +0, 0, 15)
Notice that "wohnung" is said to start at the position 0 and end at 15, although it only has 7 letters and it actually starts at position 6. Also notice that the components are given in the singular form, although the original word was in the plural form, and each is capitalized.

Unlike other result types, com.basistech.rlp.RLPConstants#PART_OF_SPEECH does not generate its own Token. Instead, the part-of-speach tag will be attached as a Payload to each Token generated from the word. aToken.getPayload().getData() will return its part-of-speach tag as a byte array of US-ASCII characters. To have this as a String, call new String(aToken.getPayload().getData(), "US-ASCII") .

This class does not handle the result type from the RLP Stopwords language processor. Use RLPPOSFilter of this package, the standard Lucene StopFilter, or the builtin stopword feature of some language processors such as JLA.

When writing code that inspects the type field of Tokens, use the constant fields of this class such as TOKEN_TYPE_STRING_WORD instead of the literal string "word", for example, so that misspellings can be caught by the compiler.


Nested Class Summary
static class RLPTokenizer.PostType
          This enum is used to specify a set of result types that this Tokenizer will use, in the form of EnumSet<PostType>.
 
Field Summary
static String TOKEN_TYPE_STRING_COMPOUND_COMPONENT
          The string that will be found in the type field of tokens that correspond to RLPConstants#COMPOUND.
static String TOKEN_TYPE_STRING_LEMMA
          The string that will be found in the type field of tokens that correspond to RLPConstants#LEMMA.
static String TOKEN_TYPE_STRING_MANY_TO_ONE_NORMALIZED_TOKEN
          The string that will be found in the type field of tokens that correspond to RLPConstants#MANY_TO_ONE_NORMALIZED_TOKEN.
static String TOKEN_TYPE_STRING_NORMALIZED_TOKEN
          The string that will be found in the type field of tokens that correspond to RLPConstants#NORMALIZED_TOKEN.
static String TOKEN_TYPE_STRING_READING
          The string that will be found in the type field of tokens that correspond to RLPConstants#READING.
static String TOKEN_TYPE_STRING_ROOT
          The string that will be found in the type field of tokens that correspond to RLPConstants#ROOTS.
static String TOKEN_TYPE_STRING_STEM
          The string that will be found in the type field of tokens that correspond to RLPConstants#STEM.
static String TOKEN_TYPE_STRING_WORD
          The string that will be found in the type field of tokens that correspond to RLPConstants#TOKEN.
 
Constructor Summary
RLPTokenizer(Reader input, LanguageCode lang, String rlpContextDef, EnumSet<RLPTokenizer.PostType> postTypes)
          This is the only constructor.
 
Method Summary
 void close()
          For internal use.
 LanguageCode getDetectedLanguage()
          Return the detected language by RLI, if it is enabled.
 LanguageCode getLanguage()
          Returns the language previously set by setLanguage(LanguageCode).
 EnumSet<RLPTokenizer.PostType> getPostTypes()
          Bulk getter of types used to generate tokens.
 boolean isPostCompoundComponents()
          Determines whether component tokens are generated for each compound word (German, Dutch, Hungarian, Chinese, Japanese, and Korean).
 boolean isPostLemma()
          Determines whether a token is generated for each lemma (Arabic only).
 boolean isPostM1NormalizedToken()
          Determines whether a token is generated for each many-to-one normalized word (Japanese, and other languages).
 boolean isPostNormalizedToken()
          Determines whether a token is generated for each normalized word (Arabic only).
 boolean isPostPartOfSpeech()
          Determines whether a part-of-speech (POS) is stored in each Token's Payload field.
 boolean isPostReadings()
          Determines whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).
 boolean isPostRoot()
          Determines whether a token is generated for each root (Arabic only).
 boolean isPostStem()
          Determines whether a token is generated for each stem.
 boolean isPostWord()
          Determines whether a token is being generated for each word in its original form in the text
 Token next(Token result)
          Returns the next token in the token stream.
 void reset()
          For internal use only.
 void reset(Reader input)
          For internal use only.
 void setLanguage(LanguageCode language)
          Set the language of the text.
 void setPostCompoundComponents(boolean b)
          Specifies whether component tokens are generated for each compound word.
 void setPostLemma(boolean b)
          Specifies whether a token is generated for each lemma (Arabic only).
 void setPostM1NormalizedToken(boolean b)
          Specifies whether a token is generated for each many-to-one normalized word (Japanese, and other languages).
 void setPostNormalizedToken(boolean b)
          Specifies whether a token is generated for each normalized word (Arabic only).
 void setPostPartOfSpeech(boolean b)
          Specifies whether a part-of-speech (POS) is stored in each Token's Payload field.
 void setPostReadings(boolean b)
          Specifies whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).
 void setPostRoot(boolean b)
          Specifies whether a token is generated for each root (Arabic only).
 void setPostStem(boolean b)
          Specifies whether a token is generated for each stem.
 void setPostTypes(EnumSet<RLPTokenizer.PostType> es)
          Bulk setter of token-generating features.
 void setPostWord(boolean b)
          Specifies whether to generate a token for each of the original words in the text.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TOKEN_TYPE_STRING_WORD

public static final String TOKEN_TYPE_STRING_WORD
The string that will be found in the type field of tokens that correspond to RLPConstants#TOKEN.

See Also:
Constant Field Values

TOKEN_TYPE_STRING_NORMALIZED_TOKEN

public static final String TOKEN_TYPE_STRING_NORMALIZED_TOKEN
The string that will be found in the type field of tokens that correspond to RLPConstants#NORMALIZED_TOKEN.

See Also:
Constant Field Values

TOKEN_TYPE_STRING_MANY_TO_ONE_NORMALIZED_TOKEN

public static final String TOKEN_TYPE_STRING_MANY_TO_ONE_NORMALIZED_TOKEN
The string that will be found in the type field of tokens that correspond to RLPConstants#MANY_TO_ONE_NORMALIZED_TOKEN.

See Also:
Constant Field Values

TOKEN_TYPE_STRING_STEM

public static final String TOKEN_TYPE_STRING_STEM
The string that will be found in the type field of tokens that correspond to RLPConstants#STEM.

See Also:
Constant Field Values

TOKEN_TYPE_STRING_LEMMA

public static final String TOKEN_TYPE_STRING_LEMMA
The string that will be found in the type field of tokens that correspond to RLPConstants#LEMMA.

See Also:
Constant Field Values

TOKEN_TYPE_STRING_COMPOUND_COMPONENT

public static final String TOKEN_TYPE_STRING_COMPOUND_COMPONENT
The string that will be found in the type field of tokens that correspond to RLPConstants#COMPOUND.

See Also:
Constant Field Values

TOKEN_TYPE_STRING_READING

public static final String TOKEN_TYPE_STRING_READING
The string that will be found in the type field of tokens that correspond to RLPConstants#READING.

See Also:
Constant Field Values

TOKEN_TYPE_STRING_ROOT

public static final String TOKEN_TYPE_STRING_ROOT
The string that will be found in the type field of tokens that correspond to RLPConstants#ROOTS.

See Also:
Constant Field Values
Constructor Detail

RLPTokenizer

public RLPTokenizer(Reader input,
                    LanguageCode lang,
                    String rlpContextDef,
                    EnumSet<RLPTokenizer.PostType> postTypes)
This is the only constructor. This class does not have the default constructor.

Parameters:
input - The input reader.
lang - Two-letter language code, supported by RLP. Specify null to have RLI detect the language. (RLI feature must be licensed.)
rlpContextDef - Path to RLP Context Defnition file, or the definition itself (must start with "<").
postTypes - Result types that should generate tokens. If null is specified, only the "word" token will be generated unless setPostName() is called before the first next() call.
Method Detail

reset

public void reset(Reader input)
           throws IOException
For internal use only.

Parameters:
input - Input reader.
Throws:
IOException
See Also:
Tokenizer

reset

public void reset()
           throws IOException
For internal use only.

Throws:
IOException
See Also:
TokenStream

next

public Token next(Token result)
           throws IOException
Returns the next token in the token stream. Note: setLanguage(LanguageCode) must be called before the first call to this method unless RLI is in your context and included in your license.

Parameters:
result - For best performance use new Token() to begin, then the token returned by the previous call to this method.
Returns:
The next token (result), or null if at end of stream.
Throws:
IOException
See Also:
TokenStream

close

public void close()
           throws IOException
For internal use.

Throws:
IOException
See Also:
Tokenizer

isPostWord

public boolean isPostWord()
Determines whether a token is being generated for each word in its original form in the text

Returns:
true if a token is generated for each of the original words in the text.

setPostWord

public void setPostWord(boolean b)
Specifies whether to generate a token for each of the original words in the text.

Parameters:
b - Pass true if a token is generatedfor each of the original words in the text. (Initial value: true)

isPostNormalizedToken

public boolean isPostNormalizedToken()
Determines whether a token is generated for each normalized word (Arabic only).

Returns:
true if a token is generatedfor each normalized word.

setPostNormalizedToken

public void setPostNormalizedToken(boolean b)
Specifies whether a token is generated for each normalized word (Arabic only).

Parameters:
b - Pass true if a token is generatedfor each normalized word.

isPostM1NormalizedToken

public boolean isPostM1NormalizedToken()
Determines whether a token is generated for each many-to-one normalized word (Japanese, and other languages).

Returns:
true if a token is generated for each many-to-one normalized word.

setPostM1NormalizedToken

public void setPostM1NormalizedToken(boolean b)
Specifies whether a token is generated for each many-to-one normalized word (Japanese, and other languages).

Parameters:
b - Pass true if a token is generated for each many-to-one normalized word.

isPostStem

public boolean isPostStem()
Determines whether a token is generated for each stem.

Returns:
true if a token should be generated for each stem.

setPostStem

public void setPostStem(boolean b)
Specifies whether a token is generated for each stem.

Parameters:
b - Pass true if a token should be generated for each stem.

isPostLemma

public boolean isPostLemma()
Determines whether a token is generated for each lemma (Arabic only).

Returns:
true if a token should be generated for each lemma.

setPostLemma

public void setPostLemma(boolean b)
Specifies whether a token is generated for each lemma (Arabic only).

Parameters:
b - Pass true if a token should be generated for each lemma.

isPostPartOfSpeech

public boolean isPostPartOfSpeech()
Determines whether a part-of-speech (POS) is stored in each Token's Payload field.

Returns:
true if a POS tag is stored in each Token's Payload field.

setPostPartOfSpeech

public void setPostPartOfSpeech(boolean b)
Specifies whether a part-of-speech (POS) is stored in each Token's Payload field.

Parameters:
b - Pass true if a POS tag is stored in each Token's Payload field.

isPostCompoundComponents

public boolean isPostCompoundComponents()
Determines whether component tokens are generated for each compound word (German, Dutch, Hungarian, Chinese, Japanese, and Korean).

Returns:
true if componet Tokens are generated for each compound word.

setPostCompoundComponents

public void setPostCompoundComponents(boolean b)
Specifies whether component tokens are generated for each compound word.

Parameters:
b - Pass true if componet Tokens are generated for each compound word German, Dutch, Hungarian, Chinese, Japanese, and Korean).

isPostReadings

public boolean isPostReadings()
Determines whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).

Returns:
true if reading Tokens are generated.

setPostReadings

public void setPostReadings(boolean b)
Specifies whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).

Parameters:
b - Pass true if reading Tokens should be generated.

isPostRoot

public boolean isPostRoot()
Determines whether a token is generated for each root (Arabic only).

Returns:
true if a token is generated for each root.

setPostRoot

public void setPostRoot(boolean b)
Specifies whether a token is generated for each root (Arabic only).

Parameters:
b - Pass true if a token is generated for each root.

getPostTypes

public EnumSet<RLPTokenizer.PostType> getPostTypes()
Bulk getter of types used to generate tokens.

Returns:
set of enabled token-generating features.

setPostTypes

public void setPostTypes(EnumSet<RLPTokenizer.PostType> es)
Bulk setter of token-generating features. Calling this is equivalent of calling setPostName(true or false) for all features. You must use EnumSet that includes PostType.POS in order to use RLPPOSFilter.

Parameters:
es - EnumSet of PostType values, e.g. EnumSet.of(PostType.WORD, PostType.STEM)

getLanguage

public LanguageCode getLanguage()
Returns the language previously set by setLanguage(LanguageCode).

Returns:
The language code. Note this method does not return the detected language by RLI LP. Use getDetectedLanguage() for that purpose.

setLanguage

public void setLanguage(LanguageCode language)
Set the language of the text.
Note: In order to convert the two-letter language code such as "en" (for English) to a com.basistech.util.LanguageCode, use com.basistech.util.LanguageCode#lookupByISO639(String).
To have RLI (Rosette Language Identifier) detect the language automatically, include RLI LP in your RLP Context Definition, and specify com.basistech.util.LanguageCode#UNKNOWN, or do not call this method at all. RLI is licensed separately.

Parameters:
language - The language of the text.

getDetectedLanguage

public LanguageCode getDetectedLanguage()
Return the detected language by RLI, if it is enabled. Otherwise, it just returns the language set by previous call to setLanguage(LanguageCode).

Returns:
The detected language.


Copyright © 2007-2008 Basis Technology Corporation. All Rights Reserved.