com.basistech.rlp.lucene
Class RLPJaAnalyzer

java.lang.Object
  extended by Analyzer
      extended by com.basistech.rlp.lucene.RLPAnalyzer
          extended by com.basistech.rlp.lucene.RLPJaAnalyzer

public final class RLPJaAnalyzer
extends RLPAnalyzer

An Analyzer for Japanese that uses RLP. To use this analyzer, you must have a valid RLP license that enables the JLA language processor. The default RLP context definition also requires RCLU to be licensed. RCLU is used for Form KC Normalization and lowercasing.
This Analyzer uses RLPTokenizer, LowerCaseFilter, and RLPPOSFilter (only if POS generation is turned on and the allowed POS tag list is provided.
Note: Although this is currently implemented as a subclass of RLPAnalyzer, this is regarded as an implementation detail, and it may change in the future. The eventual contract is that it is a subclass of Lucene Analyzer.


Constructor Summary
RLPJaAnalyzer()
          This default constructor uses the default RLP Context which includes RCLU and JLA LPs.
RLPJaAnalyzer(String rlpContextDef)
          This constructor uses default set of the post types, which are STEM (which is actually a lemma) COMP (compound word decomposition) POS (part-of-speech in Token's payload field) Note that the many-to-one normalizer (used to be called JON) output and readings are not used.
RLPJaAnalyzer(String rlpContextDef, EnumSet<RLPTokenizer.PostType> postTypes)
          This constructor uses the part-of-speech filter with the default part-of-speech tag set.
RLPJaAnalyzer(String rlpContextDef, EnumSet<RLPTokenizer.PostType> postTypes, String[] allowedPOSTags)
          This constructor does not use default values.
 
Method Summary
static String[] getDefaultAllowedPOSTags()
          Gets the array of part-of-speech (POS) tags that is assumed when constructor without such argument is used.
static String getDefaultContextDefinition()
          Gets the context definition that is assumed when a constructor without such argument is used.
static EnumSet<RLPTokenizer.PostType> getDefaultPostTypes()
          Gets the set of post types that is assumed when a constructor without such argument is used.
static void main(String[] args)
          (Internal use only) Tokenizes a Japanese sentence, and displays the results.
 
Methods inherited from class com.basistech.rlp.lucene.RLPAnalyzer
getDetectedLanguage, tokenStream
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RLPJaAnalyzer

public RLPJaAnalyzer(String rlpContextDef,
                     EnumSet<RLPTokenizer.PostType> postTypes,
                     String[] allowedPOSTags)
This constructor does not use default values.

Parameters:
rlpContextDef - Context definition that RLP uses to process text: an XML string or path to XML file.
postTypes - RLP Result types for which the tokenizer will generate tokens.
allowedPOSTags - POSTagFilter will accept tokens with these POS tags.
See Also:
RLPAnalyzer.RLPAnalyzer(LanguageCode, String, EnumSet, String[])

RLPJaAnalyzer

public RLPJaAnalyzer(String rlpContextDef,
                     EnumSet<RLPTokenizer.PostType> postTypes)
This constructor uses the part-of-speech filter with the default part-of-speech tag set.

Parameters:
rlpContextDef - Context definition that RLP uses to process text: an XML string or path to XML file.
postTypes - RLP Result types for which the tokenizer will generate tokens.
See Also:
RLPAnalyzer.RLPAnalyzer(LanguageCode, String, EnumSet)

RLPJaAnalyzer

public RLPJaAnalyzer(String rlpContextDef)
This constructor uses default set of the post types, which are Note that the many-to-one normalizer (used to be called JON) output and readings are not used.

Parameters:
rlpContextDef - Context definition that RLP uses to process text: an XML string or path to XML file.
See Also:
RLPAnalyzer.RLPAnalyzer(LanguageCode, String)

RLPJaAnalyzer

public RLPJaAnalyzer()
This default constructor uses the default RLP Context which includes RCLU and JLA LPs. RCLU is used for NKFC normalization and lower-casing of Latin alphabet. JLA is used with the deep-decomposition option.

See Also:
RLPAnalyzer.RLPAnalyzer(LanguageCode)
Method Detail

getDefaultContextDefinition

public static String getDefaultContextDefinition()
Gets the context definition that is assumed when a constructor without such argument is used. For the Japanese Analyzer, the context includes RCLU and JLA. RCLU is used to normalize the input to Unicode Normalization Form KC, which would contains katakana letters in its fullwidth form, Latin letters and symbols in the halfwidth form, and lowercase Latin letters.

Returns:
The context definition string in XML that the default constructor uses.

getDefaultAllowedPOSTags

public static String[] getDefaultAllowedPOSTags()
Gets the array of part-of-speech (POS) tags that is assumed when constructor without such argument is used. The list is made of these POS tags which basically represent all content words such as these: Fine-tuning this list is recommended for optimal search result.

Returns:
The default array of allowed part-of-speech tags that the default constructor uses.

getDefaultPostTypes

public static EnumSet<RLPTokenizer.PostType> getDefaultPostTypes()
Gets the set of post types that is assumed when a constructor without such argument is used. For Japanese, this contains STEM, COMP (compound decomposition), and POS. WORD is turned off as it would just double the number of Tokens. READING is not included, as it is not likely to be used in search. M1NORM is not included, as it is an optional feature.

Returns:
The default set of post types.

main

public static void main(String[] args)
(Internal use only) Tokenizes a Japanese sentence, and displays the results.

Parameters:
args - A Japanese sentence (arg[0]). If you do not include an arg, a default sentence is processed.


Copyright © 2007-2008 Basis Technology Corporation. All Rights Reserved.