com.basistech.rlp.lucene
Class RLPAnalyzer

java.lang.Object
  extended by Analyzer
      extended by com.basistech.rlp.lucene.RLPAnalyzer
Direct Known Subclasses:
RLPEnAnalyzer, RLPJaAnalyzer

public class RLPAnalyzer
extends Analyzer

A generic (language-neutral) Analyzer that uses RLPTokenizer.
This Analyzer uses RLPTokenizer, LowerCaseFilter, and RLPPOSFilter (only if POS generation is turned on and the allowed POS tag list is provided).


Constructor Summary
RLPAnalyzer()
          This is equivalent of RLPAnalyzer(LanguageCode#UNKNOWN).
RLPAnalyzer(LanguageCode lang)
          Same as the two-parameter constructor except that an RLP Context that includes RCLU and the base linguistic LPs for all supported languages will be used.
RLPAnalyzer(LanguageCode lang, String rlpContextDef)
          Same as the three-parameter constructor except that Tokens derived from all the available result types will be generated without POS tags in their payload.
RLPAnalyzer(LanguageCode lang, String rlpContextDef, EnumSet<RLPTokenizer.PostType> postTypes)
          When this constructor is used, no POS filter will be used.
RLPAnalyzer(LanguageCode lang, String rlpContextDef, EnumSet<RLPTokenizer.PostType> postTypes, String[] allowedPOSTags)
          The most flexible constructor takes 4 arguments.
 
Method Summary
static String getDefaultContextDefinition()
          Gets the RLP Context Definition string that is assumed when a constructor that doesn't take one is used.
static EnumSet<RLPTokenizer.PostType> getDefaultPostTypes()
          Gets the set of post types that is assumed when a constructor that doesn't take a post types argument is used.
 LanguageCode getDetectedLanguage()
          Returns the language detected by RLI, if it is enabled.
 TokenStream tokenStream(String fieldName, Reader reader)
          An implementation of Analyzer#tokenStream(String, Reader).
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RLPAnalyzer

public RLPAnalyzer(LanguageCode lang,
                   String rlpContextDef,
                   EnumSet<RLPTokenizer.PostType> postTypes,
                   String[] allowedPOSTags)
The most flexible constructor takes 4 arguments.

Parameters:
lang - The language of the text. If it is LanguageCode#UNKNOWN, then RLI will be used to auto-detect the language. To use RLI, rlpContextDef must include RLI, and the RLI feature must be licensed.
rlpContextDef - Either the file path or the definition itself in XML form.
postTypes - Essentially a bit vector that tells what RLP result types should generate tokens. Specify EnumSet.allOf(PostType.class) to enable all possible token types with part-of-speech tags.
allowedPOSTags - List of part-of-speech tags that should not be filtered out; null means no POS filtering. Note part-of-speech tags are language dependent. This argument will be ignored and no POS filter will be used if postTypes does not include POS.

RLPAnalyzer

public RLPAnalyzer(LanguageCode lang,
                   String rlpContextDef,
                   EnumSet<RLPTokenizer.PostType> postTypes)
When this constructor is used, no POS filter will be used. Make sure that postTypes does not contains PostType.POS for optimal performance.

Parameters:
lang - The language of the text. If it is LanguageCode#UNKNOWN, then RLI will be used to auto-detect the language. To use RLI, rlpContextDef must include RLI, and the RLI feature must be licensed.
rlpContextDef - Either the file path or the definition itself in XML form.
postTypes - Essentially a bit vector that tells what RLP result types should generate tokens. Specify EnumSet.allOf(PostType.class) to enable every possible token types with part-of-speech tags.

RLPAnalyzer

public RLPAnalyzer(LanguageCode lang,
                   String rlpContextDef)
Same as the three-parameter constructor except that Tokens derived from all the available result types will be generated without POS tags in their payload.

Parameters:
lang - The language of the text. If it is LanguageCode#UNKNOWN, then RLI will be used to auto-detect the language. To use RLI, rlpContextDef must include RLI, and the RLI feature must be licensed.
rlpContextDef - Either the file path or the definition itself in XML form.

RLPAnalyzer

public RLPAnalyzer(LanguageCode lang)
Same as the two-parameter constructor except that an RLP Context that includes RCLU and the base linguistic LPs for all supported languages will be used. The Context Def will also include RLI if lang == null. RCLU will be used for Unicode NKFC normalization.

Parameters:
lang - The language of the text. If it is LanguageCode#UNKNOWN, then RLI will be used to auto-detect the language. To use RLI, rlpContextDef must include RLI, and the RLI feature must be licensed.

RLPAnalyzer

public RLPAnalyzer()
This is equivalent of RLPAnalyzer(LanguageCode#UNKNOWN). RLI feature must be licensed.

Method Detail

getDefaultContextDefinition

public static String getDefaultContextDefinition()
Gets the RLP Context Definition string that is assumed when a constructor that doesn't take one is used. The Context Definition contains RLI (for language detection), RCLU (for Unicode NKFC normalization), and all base linguistics language processors (LPs). For these LPs to work, make sure your RLP license covers these LPs for the languages you use.

Returns:
The default context definition string.

getDefaultPostTypes

public static EnumSet<RLPTokenizer.PostType> getDefaultPostTypes()
Gets the set of post types that is assumed when a constructor that doesn't take a post types argument is used. It includes every post type except POS (part of speech).

Returns:
The default set of post-types.

tokenStream

public TokenStream tokenStream(String fieldName,
                               Reader reader)
An implementation of Analyzer#tokenStream(String, Reader).

Parameters:
fieldName - Not used.
reader - Input reader.
Returns:
The TokenStream generated by RLPTokenizer and LowerCaseFilter. When the allowedPOSTags arg is specified and the postTypes argument includes POS, RLPPOSFilter is also used.

getDetectedLanguage

public LanguageCode getDetectedLanguage()
Returns the language detected by RLI, if it is enabled. Otherwise, it returns the language set by the constructor.
This method only makes sense when used from RLPAnalyzer, not from a subclass specialized for a language.

Returns:
The detected language.


Copyright © 2007-2008 Basis Technology Corporation. All Rights Reserved.