|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Class Summary | |
---|---|
RLPAnalyzer | A generic (language-neutral) Analyzer that uses RLPTokenizer. |
RLPAnalyzerDispatcher | Utility class to create an appropriate RLP based Analyzer for the given language. |
RLPEnAnalyzer | An Analyzer for English that uses RLP. |
RLPJaAnalyzer | An Analyzer for Japanese that uses RLP. |
RLPPOSFilter | This Filter removes tokens based on the POS (part-of-speech) tags that are stored in each token's payload by RLPTokenizer. |
RLPTokenizer | A Tokenizer that uses RLP Basis Linguistic language processors, such as BL1, CLA, JLA, KLA, and ARBL. |
Enum Summary | |
---|---|
RLPTokenizer.PostType | This enum is used to specify a set of result types that this Tokenizer will use, in the form of EnumSet<PostType>. |
Exception Summary | |
---|---|
RLPLuceneException | A RuntimeException from the RLP Lucene integration classes. |
This package contains a Tokenizer, Analyzers, and other classes for the integration of RLP with Lucene and Solr.
RLPTokenizer
is the core of this package.
It applies the resut of RLP language analysis to Lucene, with a flexible API.
The package provides a language-neutral Analyzer called RLPAnalyzer and some language-specific Analyzer classes that use RLPTokenizer and RLPPOSFilter (optional). Customers who write to the Lucene API (rather than Solr) can use one of these Analyzer classes, or write their own Analyzer using the provided source code as reference.
These Analyzers may not work very well when used for QueryParser,
because the query terms are usually too short to analyze accurately; in particular, the
part-of-speech tags may be incorrect.
You might want to disable the part-of-speech filtering by specifying EnumSet<PostType> that do not have POS.
You might even want to use WhitespaceAnalyzer
, skipping RLP altogether
if your users usually enter words in their dictionary form separated by spaces.
Note about stop-word filitering: RLPAnalyzer and RLPXxAnalyzer do not use the Lucene standard StopFilter
,
in favor of RLPPOSFilter.
If word-based filtering is desired, use RLP's StopWord language processor (LP) by specifying an RLP Context Definition that includes StopWord LP.
For Chinese, Japanese and Korean, use the stopword removal option of CLA/JLA/KLA LP instead of StopWord LP.
Note for Nutch Developer: These Analyzers do not work at all with Nutch for query analysis for Chinese and Japanese characters. This is because the NutchAnalysis class turns each Chinese and Japanese characters into a separate token. You would need to modify the source code of NutchAnalysis, NutchAnalysis.java.
RLPPOSFilter
is a TokenFilter that removes Tokens based on the part-of-speech tags that RLPTokenizer attaches.
RLPAnalyzerDispatcher
is provided primarily for the Lucene demo application, and is not meant to be used for other purposes.
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |