|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.ObjectTokenizer
com.basistech.rlp.lucene.RLPTokenizer
public class RLPTokenizer
A Tokenizer that uses RLP Basis Linguistic language processors, such as BL1, CLA, JLA, KLA, and ARBL.
Objects of this class hold a reference to a com.basistech.rlp.RLPContext object,
and reuse that context for each new input. These objects close their context when they are closed.
The API of this class has changed from that of the RLPTokenizer previuosly provided in the samples
directory.
If you have an application that was written to the sample integration code, you may need to modify your
application code to use this package.
This Tokenizer will convert various RLP result types into a TokenStream.
setPostName(boolean) methods of this class control which of the results of RLP are included
in the token stream.
Alternatively, setPostTypes(java.util.EnumSet
can be used to turn on a set of result types.
If you configure this class to include more than one RLP result type,
this class will follow the Lucene convention for "synonym" tokens:
it will set the position increment to 0 for all but the first Token.
The RLP result type is stored in the Type field of each Lucene Token in a short name listed in the table below.
The table also list the (Name part of setPostName(boolean) method that determins whether the
result type should generate a token, and the corresponding getter method isPostName()).
RLP Result Type | setPostName | Short name |
---|---|---|
TOKEN | Word | word |
NORMALIZED_TOKEN | Word | norm |
MANY_TO_ONE_NORMALIZED_TOKEN | M1NormalizedToken | m1norm |
STEM | Stem | stem |
LEMMA | Lemma | lemma |
COMPOUND | CompoundComponents | comp |
READING | Readings | reading |
ROOTS | Root | root |
PART_OF_SPEECH | PartOfSpeech | (see below) |
com.basistech.rlp.RLPConstants
except:
For example, if a RLPTokenizer is created in this way:
Successive calls to
tok = RLPTokenizer(new StringReader("She said he said."),
LanguageCode.ENGLISH, "/some/where/rlp-context-no-op.xml",
EnumSet.of(PostType.WORD, PostType.STEM);
#next()
will return these Tokens:
Note that Token pairs for "he" and "." are identical except for the position increment values.
To remove duplicate, you can apply Solr's RemoveDuplicatesTokenFilter
.
(Specify its factory class, RemoveDuplicatesTokenFilterFactory
, instead,
in schema.xml, if you are using it from Solr.)
You can also apply Lucene's standard org.apache.lucene.analysis.LowerCaseFilter (LowerCaseFilterFactory)
before RemoveDuplicatesTokenFilter to remove almost duplicate the "she" stem Token.
If you turn on the compound handling (by calling setPostCompoundComponents(boolean)
with true,
or including PostType.COMP in the EnumSet given to the constructor), please note that:
Unlike other result types, com.basistech.rlp.RLPConstants#PART_OF_SPEECH
does not generate its own Token.
Instead, the part-of-speach tag will be attached as a Payload to each Token generated from the word.
aToken.getPayload().getData()
will return its part-of-speach tag as a byte array of US-ASCII characters.
To have this as a String, call
new String(aToken.getPayload().getData(), "US-ASCII")
.
This class does not handle the result type from the RLP Stopwords language processor.
Use RLPPOSFilter
of this package, the standard Lucene StopFilter,
or the builtin stopword feature of some language processors such as JLA.
When writing code that inspects the type field of Tokens, use the constant fields of this class such as TOKEN_TYPE_STRING_WORD
instead of the literal string "word", for example, so that misspellings can be caught by the compiler.
Nested Class Summary | |
---|---|
static class |
RLPTokenizer.PostType
This enum is used to specify a set of result types that this Tokenizer will use, in the form of EnumSet<PostType>. |
Field Summary | |
---|---|
static String |
TOKEN_TYPE_STRING_COMPOUND_COMPONENT
The string that will be found in the type field of tokens that correspond to RLPConstants#COMPOUND . |
static String |
TOKEN_TYPE_STRING_LEMMA
The string that will be found in the type field of tokens that correspond to RLPConstants#LEMMA . |
static String |
TOKEN_TYPE_STRING_MANY_TO_ONE_NORMALIZED_TOKEN
The string that will be found in the type field of tokens that correspond to RLPConstants#MANY_TO_ONE_NORMALIZED_TOKEN . |
static String |
TOKEN_TYPE_STRING_NORMALIZED_TOKEN
The string that will be found in the type field of tokens that correspond to RLPConstants#NORMALIZED_TOKEN . |
static String |
TOKEN_TYPE_STRING_READING
The string that will be found in the type field of tokens that correspond to RLPConstants#READING . |
static String |
TOKEN_TYPE_STRING_ROOT
The string that will be found in the type field of tokens that correspond to RLPConstants#ROOTS . |
static String |
TOKEN_TYPE_STRING_STEM
The string that will be found in the type field of tokens that correspond to RLPConstants#STEM . |
static String |
TOKEN_TYPE_STRING_WORD
The string that will be found in the type field of tokens that correspond to RLPConstants#TOKEN . |
Constructor Summary | |
---|---|
RLPTokenizer(Reader input,
LanguageCode lang,
String rlpContextDef,
EnumSet<RLPTokenizer.PostType> postTypes)
This is the only constructor. |
Method Summary | |
---|---|
void |
close()
For internal use. |
LanguageCode |
getDetectedLanguage()
Return the detected language by RLI, if it is enabled. |
LanguageCode |
getLanguage()
Returns the language previously set by setLanguage(LanguageCode) . |
EnumSet<RLPTokenizer.PostType> |
getPostTypes()
Bulk getter of types used to generate tokens. |
boolean |
isPostCompoundComponents()
Determines whether component tokens are generated for each compound word (German, Dutch, Hungarian, Chinese, Japanese, and Korean). |
boolean |
isPostLemma()
Determines whether a token is generated for each lemma (Arabic only). |
boolean |
isPostM1NormalizedToken()
Determines whether a token is generated for each many-to-one normalized word (Japanese, and other languages). |
boolean |
isPostNormalizedToken()
Determines whether a token is generated for each normalized word (Arabic only). |
boolean |
isPostPartOfSpeech()
Determines whether a part-of-speech (POS) is stored in each Token's Payload field. |
boolean |
isPostReadings()
Determines whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese). |
boolean |
isPostRoot()
Determines whether a token is generated for each root (Arabic only). |
boolean |
isPostStem()
Determines whether a token is generated for each stem. |
boolean |
isPostWord()
Determines whether a token is being generated for each word in its original form in the text |
Token |
next(Token result)
Returns the next token in the token stream. |
void |
reset()
For internal use only. |
void |
reset(Reader input)
For internal use only. |
void |
setLanguage(LanguageCode language)
Set the language of the text. |
void |
setPostCompoundComponents(boolean b)
Specifies whether component tokens are generated for each compound word. |
void |
setPostLemma(boolean b)
Specifies whether a token is generated for each lemma (Arabic only). |
void |
setPostM1NormalizedToken(boolean b)
Specifies whether a token is generated for each many-to-one normalized word (Japanese, and other languages). |
void |
setPostNormalizedToken(boolean b)
Specifies whether a token is generated for each normalized word (Arabic only). |
void |
setPostPartOfSpeech(boolean b)
Specifies whether a part-of-speech (POS) is stored in each Token's Payload field. |
void |
setPostReadings(boolean b)
Specifies whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese). |
void |
setPostRoot(boolean b)
Specifies whether a token is generated for each root (Arabic only). |
void |
setPostStem(boolean b)
Specifies whether a token is generated for each stem. |
void |
setPostTypes(EnumSet<RLPTokenizer.PostType> es)
Bulk setter of token-generating features. |
void |
setPostWord(boolean b)
Specifies whether to generate a token for each of the original words in the text. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String TOKEN_TYPE_STRING_WORD
RLPConstants#TOKEN
.
public static final String TOKEN_TYPE_STRING_NORMALIZED_TOKEN
RLPConstants#NORMALIZED_TOKEN
.
public static final String TOKEN_TYPE_STRING_MANY_TO_ONE_NORMALIZED_TOKEN
RLPConstants#MANY_TO_ONE_NORMALIZED_TOKEN
.
public static final String TOKEN_TYPE_STRING_STEM
RLPConstants#STEM
.
public static final String TOKEN_TYPE_STRING_LEMMA
RLPConstants#LEMMA
.
public static final String TOKEN_TYPE_STRING_COMPOUND_COMPONENT
RLPConstants#COMPOUND
.
public static final String TOKEN_TYPE_STRING_READING
RLPConstants#READING
.
public static final String TOKEN_TYPE_STRING_ROOT
RLPConstants#ROOTS
.
Constructor Detail |
---|
public RLPTokenizer(Reader input, LanguageCode lang, String rlpContextDef, EnumSet<RLPTokenizer.PostType> postTypes)
input
- The input reader.lang
- Two-letter language code, supported by RLP. Specify null to have RLI detect the language.
(RLI feature must be licensed.)rlpContextDef
- Path to RLP Context Defnition file, or the definition itself (must start with "<").postTypes
- Result types that should generate tokens. If null is specified, only the "word"
token will be generated unless setPostName() is called before the first next() call.Method Detail |
---|
public void reset(Reader input) throws IOException
input
- Input reader.
IOException
Tokenizer
public void reset() throws IOException
IOException
TokenStream
public Token next(Token result) throws IOException
setLanguage(LanguageCode)
must be called before the first call to this method
unless RLI is in your context and included in your license.
result
- For best performance use new Token() to begin, then the token returned by the previous call to this method.
IOException
TokenStream
public void close() throws IOException
IOException
Tokenizer
public boolean isPostWord()
public void setPostWord(boolean b)
b
- Pass true if a token is generatedfor each of the original words in the text. (Initial value: true)public boolean isPostNormalizedToken()
public void setPostNormalizedToken(boolean b)
b
- Pass true if a token is generatedfor each normalized word.public boolean isPostM1NormalizedToken()
public void setPostM1NormalizedToken(boolean b)
b
- Pass true if a token is generated for each many-to-one normalized word.public boolean isPostStem()
public void setPostStem(boolean b)
b
- Pass true if a token should be generated for each stem.public boolean isPostLemma()
public void setPostLemma(boolean b)
b
- Pass true if a token should be generated for each lemma.public boolean isPostPartOfSpeech()
public void setPostPartOfSpeech(boolean b)
b
- Pass true if a POS tag is stored in each Token's Payload field.public boolean isPostCompoundComponents()
public void setPostCompoundComponents(boolean b)
b
- Pass true if componet Tokens are generated for each compound word German, Dutch, Hungarian,
Chinese, Japanese, and Korean).public boolean isPostReadings()
public void setPostReadings(boolean b)
b
- Pass true if reading Tokens should be generated.public boolean isPostRoot()
public void setPostRoot(boolean b)
b
- Pass true if a token is generated for each root.public EnumSet<RLPTokenizer.PostType> getPostTypes()
public void setPostTypes(EnumSet<RLPTokenizer.PostType> es)
es
- EnumSet of PostType values, e.g. EnumSet.of(PostType.WORD, PostType.STEM)public LanguageCode getLanguage()
setLanguage(LanguageCode)
.
getDetectedLanguage()
for that purpose.public void setLanguage(LanguageCode language)
com.basistech.util.LanguageCode
,
use com.basistech.util.LanguageCode#lookupByISO639(String)
.
com.basistech.util.LanguageCode#UNKNOWN
,
or do not call this method at all.
RLI is licensed separately.
language
- The language of the text.public LanguageCode getDetectedLanguage()
setLanguage(LanguageCode)
.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |