com.basistech.rlp.solr
Class RLPTokenizerFactory

java.lang.Object
  extended by BaseTokenizerFactory
      extended by com.basistech.rlp.solr.RLPTokenizerFactory

public class RLPTokenizerFactory
extends BaseTokenizerFactory

Solr tokenizer factory for RLPTokenizer. This class name should be specified in the <tokenizer> element in schema.xml. Following attributes will be recognized:


Note that postName has an effect only if the underlying base linguistic processor supports the feature. For example, postCompoundComponents="true" will have a real effect only for Japanese (lang="ja"), German (de), Dutch (nl) or Hungarian (hu). For almost the similar effect of the dictionary form of the words, use postLemma="true" for Arabic and postStem="true" for other languages.
Most public setProperty and isProperty or getProperty methods are there in anticipation of Solr's eventual adoption of Spring.


Constructor Summary
RLPTokenizerFactory()
           
 
Method Summary
 TokenStream create(Reader input)
          Creates a token stream using RLPTokenizer.
 LanguageCode getLanguage()
          Retrieves the language code that has been set (LanguageCode.UNKNOWN if it hasn't been set).
 String getRLPContextDef()
          Retrieves the path to the RLP XML context definition file or string.
 void inform(ResourceLoader loader)
          An implementation of ResourceLoaderAware#inform(ResourceLoader) as required by the interface.
 boolean isPostCompoundComponents()
          Determines whether component tokens are generated for each compound word (German, Dutch, Hungarian, Chinese, Japanese, and Korean).
 boolean isPostLemma()
          Determines whether a token is generated for each lemma (Arabic only).
 boolean isPostM1NormalizedToken()
          Determines whether a token is generated for each many-to-one normalized word (Japanese, and other languages).
 boolean isPostNormalizedToken()
          Determines whether a token is generated for each normalized word (Arabic only).
 boolean isPostPartOfSpeech()
          Determines whether a part-of-speech (POS) is stored in each Token's Payload field.
 boolean isPostReadings()
          Determines whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).
 boolean isPostRoot()
          Determines whether a token is generated for each root (Arabic only).
 boolean isPostStem()
          Determines whether a token is generated for each stem.
 boolean isPostWord()
          Determines whether a token is being generated for each word in its original form in the text
 void setLanguage(LanguageCode language)
          Designates the language of the text to be processed.
 void setPostCompoundComponents(boolean b)
          Specifies whether component tokens are generated for each compound word.
 void setPostLemma(boolean b)
          Specifies whether a token is generated for each lemma (Arabic only).
 void setPostM1NormalizedToken(boolean b)
          Specifies whether a token is generated for each many-to-one normalized word (Japanese, and other languages).
 void setPostNormalizedToken(boolean b)
          Specifies whether a token is generated for each normalized word (Arabic only).
 void setPostPartOfSpeech(boolean b)
          Specifies whether a part-of-speech (POS) is stored in each Token's Payload field.
 void setPostReadings(boolean b)
          Specifies whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).
 void setPostRoot(boolean b)
          Specifies whether a token is generated for each root (Arabic only).
 void setPostStem(boolean b)
          Specifies whether a token is generated for each stem.
 void setPostWord(boolean b)
          Specifies whether to generate a token for each of the original words in the text.
 void setRLPContextDef(String rlpContextDef)
          Sets the path to the RLP XML context definition file or string must start with "<").
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RLPTokenizerFactory

public RLPTokenizerFactory()
Method Detail

create

public TokenStream create(Reader input)
Creates a token stream using RLPTokenizer.

Parameters:
input - The input reader.
Returns:
The token stream.

getLanguage

public LanguageCode getLanguage()
Retrieves the language code that has been set (LanguageCode.UNKNOWN if it hasn't been set).

Returns:
The language that has been set.

setLanguage

public void setLanguage(LanguageCode language)
Designates the language of the text to be processed.

Parameters:
language - The language of the text.

getRLPContextDef

public String getRLPContextDef()
Retrieves the path to the RLP XML context definition file or string.

Returns:
The RLP context definition.

setRLPContextDef

public void setRLPContextDef(String rlpContextDef)
Sets the path to the RLP XML context definition file or string must start with "<").

Parameters:
rlpContextDef - Path to RLP context definition file or string.

isPostWord

public boolean isPostWord()
Determines whether a token is being generated for each word in its original form in the text

Returns:
true if a token is generated for each of the original words in the text.

setPostWord

public void setPostWord(boolean b)
Specifies whether to generate a token for each of the original words in the text.

Parameters:
b - Pass true if a token is generatedfor each of the original words in the text. (Initial value: true)

isPostNormalizedToken

public boolean isPostNormalizedToken()
Determines whether a token is generated for each normalized word (Arabic only).

Returns:
true if a token is generatedfor each normalized word.

setPostNormalizedToken

public void setPostNormalizedToken(boolean b)
Specifies whether a token is generated for each normalized word (Arabic only).

Parameters:
b - Pass true if a token is generatedfor each normalized word.

isPostM1NormalizedToken

public boolean isPostM1NormalizedToken()
Determines whether a token is generated for each many-to-one normalized word (Japanese, and other languages).

Returns:
true if a token is generated for each many-to-one normalized word.

setPostM1NormalizedToken

public void setPostM1NormalizedToken(boolean b)
Specifies whether a token is generated for each many-to-one normalized word (Japanese, and other languages).

Parameters:
b - Pass true if a token is generated for each many-to-one normalized word.

isPostStem

public boolean isPostStem()
Determines whether a token is generated for each stem.

Returns:
true if a token should be generated for each stem.

setPostStem

public void setPostStem(boolean b)
Specifies whether a token is generated for each stem.

Parameters:
b - Pass true if a token should be generated for each stem.

isPostLemma

public boolean isPostLemma()
Determines whether a token is generated for each lemma (Arabic only).

Returns:
true if a token should be generated for each lemma.

setPostLemma

public void setPostLemma(boolean b)
Specifies whether a token is generated for each lemma (Arabic only).

Parameters:
b - Pass true if a token should be generated for each lemma.

isPostRoot

public boolean isPostRoot()
Determines whether a token is generated for each root (Arabic only).

Returns:
true if a token is generated for each root.

setPostRoot

public void setPostRoot(boolean b)
Specifies whether a token is generated for each root (Arabic only).

Parameters:
b - Pass true if a token is generated for each root.

isPostPartOfSpeech

public boolean isPostPartOfSpeech()
Determines whether a part-of-speech (POS) is stored in each Token's Payload field.

Returns:
true if a POS tag is stored in each Token's Payload field.

setPostPartOfSpeech

public void setPostPartOfSpeech(boolean b)
Specifies whether a part-of-speech (POS) is stored in each Token's Payload field.

Parameters:
b - Pass true if a POS tag is stored in each Token's Payload field.

isPostCompoundComponents

public boolean isPostCompoundComponents()
Determines whether component tokens are generated for each compound word (German, Dutch, Hungarian, Chinese, Japanese, and Korean).

Returns:
true if componet Tokens are generated for each compound word.

setPostCompoundComponents

public void setPostCompoundComponents(boolean b)
Specifies whether component tokens are generated for each compound word.

Parameters:
b - Pass true if componet Tokens are generated for each compound word German, Dutch, Hungarian, Chinese, Japanese, and Korean).

isPostReadings

public boolean isPostReadings()
Determines whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).

Returns:
true if reading Tokens are generated.

setPostReadings

public void setPostReadings(boolean b)
Specifies whether reading Tokens are generated for each word that the language analyzer can predict (Chinese and Japanese).

Parameters:
b - Pass true if reading Tokens should be generated.

inform

public void inform(ResourceLoader loader)
An implementation of ResourceLoaderAware#inform(ResourceLoader) as required by the interface.

Parameters:
loader - See ResourceLoaderAware#inform(ResourceLoader).


Copyright © 2007-2008 Basis Technology Corporation. All Rights Reserved.