Table of Contents
Copyright © 2004-2009 Basis Technology Corporation. All rights reserved. This document is property of and is proprietary to Basis Technology Corporation. It is not to be disclosed or reproduced in whole or in part without the express written consent of Basis Technology Corporation.
Web: | http://www.basistech.com |
Support: | ProductSupport@basistech.com |
Initial Release: | April 06, 2009 |
The Rosette Linguistics Platform (RLP) is the backbone of Basis Technology's text and language analysis technology. RLP provides advanced natural-language processing techniques to help your applications unlock information in unstructured text. RLP includes modules for language and encoding identification, converting text to Unicode, identifying basic linguistic features, and locating key entities like the names of people, places, and objects of interest. RLP supports English and a variety of Asian, European, and Middle Eastern languages. The detailed linguistic information provided by RLP can be used to increase the accuracy and depth of information-retrieval, text-mining, entity-extraction, and other text-analysis applications.
A single C++, C, Java, or .NET API provides the calling application access to all of the capabilities and all of the languages supported by RLP for which the customer has obtained a license. Additional features can be added by simply adding new license keys and possibly relinking with a later release of the platform.
RLP is packaged with modules for Named Entity Extraction (NE) and Base Linguistics (BL): base noun phrase detection, tokenization, sentence boundary detection, part-of-speech tagging, and morphological analysis including stemming, alternative readings (transcriptions), and compound analysis. These modules can process Arabic, Chinese, Czech, Dutch, English, Farsi (Persian), French, German, Greek, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, and Urdu.
Language support for each of these operations is indicated in the following table:
Table 1. RLP Language Support for Base Linguistics (BL) and Named Entity Extraction (NE)
Language | Base Linguistics | NE | ||||||
---|---|---|---|---|---|---|---|---|
Tokenization | POS | SBD | BNP | Stemming | Compounds | Readings | ||
Arabic | ✓ | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | |
Chinese (Simplified) | ✓ | ✓ | ✓ | ✓ | n/a | n/a | ✓ | ✓ |
Chinese (Traditional) | ✓ | ✓ | ✓ | ✓ | n/a | n/a | ✓ | ✓ |
Czech | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | ||
Dutch | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
English[a] | ✓ | ✓ | ✓ | ✓ | ✓ | n/a | n/a | ✓ |
Farsi (Persian) | ✓ | ✓ | ✓ | n/a | ✓ | |||
French | ✓ | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | |
German | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Greek | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | ||
Hungarian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Italian | ✓ | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | |
Japanese | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Korean | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Polish | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | ||
Portuguese | ✓ | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | |
Russian | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | ||
Spanish | ✓ | ✓ | ✓ | ✓ | ✓ | n/a | ✓ | |
Urdu | ✓ | ✓ | ✓ | n/a | ✓ | |||
[a] RLP also provides specialized support for upper-case
English text. When processing English text that is entirely
upper case, specify the English Upper-Case language
code ( |
POS is part-of-speech tagging, SBD is sentence-boundary detection, and BNP is base-noun-phrase detection. For Chinese, the readings are pinyin transcriptions; for Japanese, the readings are Furigana transcriptions rendered in Hiragana. Blanks in this table indicate that the functionality is not available; n/a indicates that the feature does not apply to that language.
If you work with multilingual input data, RLP provides tools for locating regions of contiguous text in a single language, so that you can process each region with the appropriate language processors.
In addition to the languages listed above, the Rosette Language Identifier (RLI) can identify text in the following languages: Albanian, Transliterated Arabic, Bahasa Indonesia, Bahasa Malay, Bengali, Bulgarian, Catalan, Croatian, Danish, Estonian, Finnish, Gujarati, Hebrew, Hindi, Icelandic, Kannada, Kurdish, Latvian, Lithuanian, Malayalam Norwegian, Pashto, Transliterated Pashto, Transliterated Farsi (Persian), Romanian, Serbian (Cyrillic and Latin), Slovak, Slovenian, Somali, Swedish, Tagalog, Telugu, Thai, Turkish, Ukrainian, Transliterated Urdu, Uzbek (Cyrillic and Latin), and Vietnamese.
Other key features:
RLP is written in a portable subset of ISO/ANSI C++.
C++, C, Java, and .NET APIs are available. The APIs do not vary from one human language to another.
Text is internally encoded in Unicode (UTF-16).
RLP operations are thread safe.
RLP's features are enabled by license keys issued by Basis Technology. Please contact us to obtain the required evaluation or production license file, and refer to Installing RLP for information about where to put the license file.
RLP 6.5.2 supports the following platforms. Tags identifying the operating system, CPU, and C++ compiler are embedded in the platform names. Java is supported except where noted otherwise.
Table 2. Supported Platforms
OS | CPU | Compiler | BT_BUILD[a] |
---|---|---|---|
AIX 5.2 | PowerPC | xlc 5.2 | ppc-aix52-xlc[b] |
FreeBSD 4.8 | IA32 | 3.4 | ia32-freebsd48-gcc34[b] |
FreeBSD 6.0 | AMD64 | gcc 3.4.4 | amd64-freebsd6-gcc344[b] |
FreeBSD 6.0 | IA32 | gcc 3.4.4 | ia32-freebsd6-gcc344[b] |
HP-UX 11.00 | IA64 | HP aCC 5.41 | ia64-hpux11-aCC541[b] |
HP-UX 11.00 | PA-RISC32 | HP aCC A.03.33 | parisc-hpux11-aCC333-aa[b] |
Linux (glibc 2.2) | IA32 | gcc 3.2 | ia32-glibc22-gcc32 |
Linux (glibc 2.3) | AMD64 | gcc 3.4 | amd64-glibc23-gcc34 |
Linux (glibc 2.3) | AMD64 | gcc 4.0 | amd64-glibc23-gcc40 |
Linux (glibc 2.3) | IA32 | gcc 3.2 | ia32-glibc23-gcc32 |
Linux (glibc 2.3) | IA32 | gcc 3.4.4 | ia32-glibc23-gcc34 |
Linux (glibc 2.3) | IA32 | gcc 4.0 | ia32-glibc23-gcc40 |
Linux (glibc 2.4) | AMD64 | gcc 4.1 | amd64-glibc24-gcc41 |
Linux (glibc 2.4) | IA32 | gcc 4.1 | ia32-glibc24-gcc41 |
Linux (glibc 2.5) | AMD64 | gcc 4.1 | amd64-glibc25-gcc41 |
Linux (glibc 2.5) | AMD64 | gcc 4.2 | amd64-glibc25-gcc42 |
Linux (glibc 2.5) | IA32 | gcc 4.1 | ia32-glibc25-gcc41 |
Linux (glibc 2.5) | IA32 | gcc 4.2 | ia32-glibc25-gcc42 |
MAC OS 10.5 (Darwin 9) | 32-bit/64-bit Intel | gcc 4.0 | universal-darwin9-gcc40 |
Solaris 10 | AMD64 | CC 5.8 | amd64-solaris10-cc58 |
Solaris 10 | AMD64 | gcc 4.1.2 | amd64-solaris10-gcc41 |
Solaris 10 | IA32 | CC 5.8 | ia32-solaris10-cc58 |
Solaris 10 | IA32 | gcc 3.4 | ia32-solaris10-gcc34 |
Solaris 10 | SPARC32 | CC 5.8 | sparc-solaris10-cc58 |
Solaris 10 | SPARC64 | CC 5.8 | sparc-solaris10-cc58-64 |
Solaris 10 | SPARC64 | gcc 4.1.2 | sparc-solaris10-gcc412-64 |
Solaris 7-8 | SPARC32 | CC 5.2 (Forte Developer 6) | sparc-solaris28-cc52 |
Solaris 7-8 | SPARC64 | CC 5.2 (Forte Developer 6) | sparc-solaris28-cc52-64 |
Solaris 9 | IA32 | gcc 3.4.5 | ia32-solaris9-gcc34 |
Solaris 9 | SPARC32 | CC 3.4 | sparc-solaris9-gcc34 |
Solaris 9 | SPARC32 | CC 5.8 (Sun Studio 11) | sparc-solaris9-cc58 |
Solaris 9 | SPARC64 | CC 5.8 (Sun Studio 11) | sparc-solaris9-cc58-64 |
Solaris 9 | SPARC64 | gcc 4.1 | sparc-solaris9-gcc41-64 |
Windows 32 | IA32 | Visual Studio 7.1 | ia32-w32-msvc71 |
Windows 32 | IA32 | Visual Studio 7.1 | ia32-w32-msvc71-static[c] |
Windows 32 | IA32 | Visual Studio 8.0 | ia32-w32-msvc80 |
Windows 32 | IA32 | Visual Studio 8.0 | ia32-w32-msvc80-static[c] |
Windows 64 | AMD64 | Visual Studio 8.0 | amd64-w64-msvc80 |
Windows 64 | AMD64 | Visual Studio 8.0 | amd64-w64-msvc80-static[c] |
[a] [b] Java not supported. [c] Built with a statically linked library. Does not include support for the Java API, the Core Library for Unicode (RCLU), iFilter, or HTML Stripper. |
Java API Updates in com.basistech.rlp.RLPResultAccess: Added access to NamedEntityData
.
Replaced getIntegerData()
with getDetectedLanguage()
and getDetectedScript
, which
return the appropriate com.basistech.util Enum
type (LanguageCode
and ISO639
).
Added API to guarantee that multiple instances of a given entity are returned with the same entity type
Added facility for creating a blacklist of entities that are not to be returned by the Named Entity Extractor for the specified entity type.
For the Base Linguistics Language Analyzer, added caching of morphological data about commonly used words in English and German to accelerate linguistic processing.
Named Entities. Expanded named entity support to include Russian and three new entity types: TITLE, NUMBER, and DISTANCE. For this release, we have acquired, annotated, and performed statistical training with new data; improved and added more regular expressions; and introduced internal language-specific binary gazetteers (created by Basis Technology). To improve performance, the PERCENT entity type has been removed.
Added Regular Expression support for naming and reusing regular expressions (including expression fragments).
Added Named Entity Redactor support for joining adjacent named entities into a single named entity. By default, adjacent TITLE and PERSON entities are joined into a PERSON entity.
Added the com.basistech.neredact.prefer_length
property. When
set to true (the default), this property instructs the Named Entity Redactor to resolve
a conflict between overlapping candidate entities in favor of the longer candidate.
Added the com.basistech.neredact.max_entity_tokens
property.
When a named entity returned by NamedEntityExtractor contains more than this number of
tokens (the default is 8), Named Entity Redactor discards the entity.
Added the FragmentBoundaryDetector, which uses whitespace to separate items in
fragmentary text (such as lists and tables), so that the NamedEntityExtractor will not
combine a series of fragments into a single entity. The Regular Expression processor
also contains a new context property
(com.basistech.regexp.respect_boundaries
) that you can set to
instruct the processor not to cross fragment boundaries when matching text.
Added support for language-specific text gazetteers.
For a problem and workaround applying regular expressions to Japanese text with "unit in square" characters, see Known Problems in 6.5.0.
Expanded the .NET API to provide complete access to RLP functionality. The .NET API is modeled on the Java API.
Enhanced the Base Linguistics (BL1) processor to run simultaneously in multiple threads..
Replaced the Japanese Orthographic Normalizer (JON) with the ManyToOneNormalizer, which provides a multi-language utility for using language-specific normalization dictionaries to provide normalized tokens. We continue to distribute a Japanese normalization dictionary. Users can add their own normalization dictionaries for any of the languages we support.
If you are using a context configuration file or string that contains
<languageprocessor>JON</languageprocessor>
, you
must change this entry to
<languageprocessor>ManyToOneNormalizer</languageprocessor>
.
For each token, JON returned a NORMALIZED_TOKEN, whereas ManyToOneNormalizer returns a MANY_TO_ONE_NORMALIZED_TOKEN.
Moved sample code for integrating RLP with Lucene and Solr into a separate RLP For Lucene package (rlplucene-6.0.0-sdk-unix.tar.gz or rlplucene-6.0.0-sdk-win.zip).
The sample Japanese normalization dictionary has been updated to include place name variants.
Added support for instantiating multiple Environment objects in the same process. Each of these Environment objects is a wrapper for the same underlying Environment. Accordingly all Environment objects must be initialized with the same environment configuration (normally rlp-global.xml).
Replaced Unix Make files and Windows Visual Studio Solution and Project files with scripts for building the C++ , C, and .NET sample applications. The Unix .sh scripts are designed to be run in a Bash Shell. The Windows(.bat scripts should be run in the Command Prompt. We continue to provide Ant scripts for building and running the Java sample applications, as well as Unix .shscripts and Windows .bat scripts for running all the sample applications.
RLI can now detect UTF-16LE and UTF-16BE, even if the endianness does not match the endianness of the host operating system. If the endianness of the file matches the endiannnes of the host, RLI reports the encoding as UTF-16. If the endianness of file and host do not match, RLI reports the full encoding: UTF-16LE or UTF-16BE.
Approximately 24,000 traditional Chinese words have been added to the Chinese dictionary.
Replaced the Mac OS 10.4 (Darwin 8.9.1) platform for 32-bit Intel platform with the universal Mac OS 10.5 (Darwin 9) platform for 32-bit and 64-bit Intel.
The sparc-solaris9-gcc345
platform has been renamed to
sparc-solaris9-gcc34
.
To obtain the Rosette Demo, download and install rlpdemo-6.0.3.4.msi.
RLP is no longer being built on the amd64-glibc23-gcc32
platform.
This release removes dynamic dependencies on libgcc_s.so
and
libstdc++.so
for the following Linux and Solaris platforms. The
first three platforms are new; the last three have been changed to eliminate these
dependencies. The compressed software installation packages are
rlp-6.0.2-sdk-amd64-glibc25-gcc41.tar.gz
rlp-6.0.2-sdk-amd64-solaris10-gcc41.tar.gz
rlp-6.0.2-sdk-sparc-solaris9-gcc41-64.tar.gz
rlp-6.0.2-sdk-amd64-glibc23-gcc34.tar.gz
rlp-6.0.2-sdk-ia32-glibc23-gcc34.tar.gz
rlp-6.0.2-sdk-amd64-solaris10-gcc41
(amd64-solaris10-gcc41
was formerly called
amd64-solaris10-gcc412
)The Windows 32-bit SDK packages are now .zip files in place of .msi files, and the Rosette Demo has been moved from the SDK to a separate package.
Added named entity support for Korean, Farsi (Persian), and Urdu.
Added support for new named entity types: RELIGION
,
NATIONALITY
, GPE
(a geo-political entity), and
FACILITY
(a man-made structure or architectural entity).
366 neologisms have been added to the Japanese dictionary.
171 neologisms have been added to the Chinese dictionary.
A preliminary .NET API that provides limited coverage of RLP functionality.
Enhancements to the Windows GUI Demo.
Merging of the Korean Hangul and compound noun dictionaries into a single compiled Korean user dictionary that users can edit and recompile.
The Rosette Language Identifier (RLI) returns DETECTED_SCRIPT, the ISO15924 code for the writing script of the text to be processed.
The Rosette Language Identifier (RLI) is able to detect UTF-16 encoding.
Moved the routine for scanning the RLP license and generating a list of supported features from the introductory RLP sample applications to separate C++, Java, and C sample applications.
Removed deprecated Java interface and classes from com.basistech
package: ISO15924Constants
, BTLanguageCodes
, and
ISO15924
. Use com.basistech.util.ISO15924
and
com.basistech.util.LanguageCode
.
Expanded the scope of Tokenizer to tokenize all languages. In a context configuration, Tokenizer should be placed after processors that provide their own language-specific tokenization (BL1, CLA, JLA, and KLA ), and (along with Sentence Boundary Detector) before processors that use the tokenization it provides (ARBL, FABL, and URBL).
Enabled use of the same RLP context with a Chinese Script converter (CSC) to convert simplified Chinese to traditional Chinese or vice versa.
The Arabic Base Linguistics processor (ARBL
) returns a new result
type: lemma, and also returns multiple analyses of each token. Each analysis contains a
stem, lemma, normalized token, root, and part-of-speech tag. See also Known Problems in 5.4.1.
Sample code to integrate RLP with Lucene and Solr, for applications that index and search English and Japanese documents. This code provides a starting point for creating analyzers to index and search documents in any of the languages that RLP supports.
Base linguistics support for Farsi (Persian).
Base linguistics support for Urdu.
A C API.
Adoption of Java 1.5.
Java clients are no longer required to set the
java.library.path
system property. Additionaly this directory is no
longer required on the Windows path
.
Use of the BT_ROOT
environment variable to set the root directory
is no longer supported. New APIs to set the Basis root directory (the installation
directory) have been introduced to replace deprecated APIs for setting the
rlp root directory (a child of BT_ROOT
).
Utilities that took a BT_ROOT
argument interpreted as the
rlp root directory now take a BT_ROOT
argument meaning the Basis root directory.
Deprecated BT_RLP_Environment::InitializeFromBuffer
in favor of
BT_RLP_Environment::InitializeFromFile
.
In the com.basistech
package, we have deprecated the
com.basistech.ISO15924
interface and the
ISOLanguageCodes
and ISO15924
classes in favor of
the new com.basistech.util.ISO15924
Enum.
In the com.basistech.rlp
package, we have deprecated the
RLPNEConstants
interface and RLPNEUtils
class in
favor of com.basistech.util.NEConstants
Utilities shared by RLP and other Basis Technology products have been segregated
from RLP. If using Java, you must include both btrlp.jar and
btutil.jar on the classpath. As noted above, some Java APIs have
been deprecated in favor of new APIs in the utilities package:
com.basistech.util
.
Support for Chinese user dictionaries has been extended to include part-of-speech tags, decomposition patterns, and the compilation of binary dictionaries.
The Japanese Language Analyzer (JLA) identifies the dictionary (the standard dictionary or a user dictionary) in which each token is found.
Special syntax for entering Private Use Area (PUA) Unicode characters in Japanese user dictionaries.
Addition of 1403 nouns, 2657 proper nouns, and 4 verbs to the Chinese dictionary used by the Chinese Language Analyzer (CLA).
Addition of 68 nouns, 579 proper nouns, 5 adjectives, and 2 verbs to the Japanese dictionary used by the Japanese Language Analyzer (JLA).
RCLU returns a new result type if com.basistech.rclu.mapoffsets
is set to true. MAP_OFFSETS
is an array of unsigned integers that map
the characters in normalized text to their location in the input text prior to
normalization.
Optional builds with statically linked libraries for Windows.
Three processors have been removed from RLP: Arabic Transliteration/Transcription (artrans), Arabic Name Vocalizer (arvoc), and Arabic Name Variations (arvars). The functionality provided by these processors is in a separate product, the Rosette Name Translator (RNT).
Improved Named Entity accuracy for English.
RCLU support for mapping token offsets in text that undergoes character
transformations (normalization) to offsets in the original text prior to the
normalization. Normalization may transform one Unicode character into two or three
characters or vice versa. If you set the
com.basistech.rclu.mapoffsets
property to "true" or "yes" and apply
com.basistech.rclu.FormCNormalization
,
com.basistech.rclu.FormDNormalization
,
com.basistech.rclu.FormKDNormalization
, or
com.basistech.rclu.FormKCNormalization
, RCLU returns an array of
unsigned integers in addition to the transformed text. Each element in the array is the
original text character index corresponding to the character of transformed text
indicated by the element's position in the array. If
com.basistech.rclu.mapoffsets
is set to "false" or "no" (the
default), or a transformation other than the ones listed above is performed, this result
is null.
Support for reading the RLP license directly from the RLP environment configuration,
rather than from a separate file. The RLP environment configuration, which may be
instantiated from a file or buffer, may now contain a license
element
with the license.
Support for Chinese user dictionaries has been extended to include part-of-speech tags, decomposition patterns, and the compilation of binary dictionaries.
Support for European language (BL1) user dictionaries.
Rosette Language Identifier (RLI) support for identifying a number of additional languages: Hindi, Tamil, Kannada, Telugi, Bengali, Gujarati, Malayalam, Ukranian (Cyrillic), Serbian (Cyrillic and Latin), Uzbek (Cyrllic and Latin), transliterated Farsi, and transliterated Urdu (Farsi and Urdu in Arabic script already supported).
RLI support for the Traditional Chinese Big5-HKSCS encoding.
RLI can identify upper-case English as such, enabling more accurate processing of upper-case English when RLI identifies the language.
RLI has been retrained to improve accuracy detecting Russian and Bulgarian.
Named Entity Extractor accuracy improvements identifying named entities in Arabic, Chinese, English, and Japanese.
Addition of a C++ named entity iterator (BT_RLP_NE_Iterator) to streamline access to named entities.
Addition of 1,632 Chinese names and 89 non-Chinese names to the Chinese dictionary used by the Chinese Language Analyzer (CLA).
Addition of 1,758 nouns to the Japanese dictionary used by the Japanese Language Analyzer (JLA).
All language processors (including Unicode Converter, which used to be built in to the RLP environment) are dynamically loaded as needed.
C++ API changes for BT_RLP_Context
.
GetStringResult
and GetUTF16StringResult
now
return pointers as constants. GetUTF16StringResult now takes a second argument, which
returns the length of the non-null-terminated string. The new signatures are as follows:
BT_Char8 const *GetStringResult(BT_RLP_EntityType type)
BT_Char16 const *GetUTF16StringResult(BT_RLP_EntityType type,
BT_UInt32& resultLength)
Improved performance and accuracy for Arabic base linguistics and named entity
extraction. For efficiency, folded all 12 pronoun POS tags
(PRON_
) into a single POS tag
(XX
PRONOUN)
, and all 7 demonstrative pronoun tags
(DEM_PRON_
) into a single POS tag
(XX
DEM_PRON
).
Added a context property (com.basistech.bl.query)
for processing
queries (search terms) more accurately and efficiently.
Enhanced support for user-defined named entity types.
Added API support for supplying MIME type and encoding when processing input (you can also include language processors in the RLP context that detect MIME type and encoding).
Refactoring of language processors.
The Raw Input processor no longer exists. In its place, you simply start your context with RLI and RCLU.
The formal distinction between processor types (input, language, output) no longer exists.
RDIF functionality for extracting plain text from markup up input (such as HTML, XML, PDF, and Microsoft Office documents) has been folded into new language processors: mime_detector, iFilter, and HTML Stripper.
Addition of the Chinese Script Converter (CSC). CSC is a language processor that converts text in Simplified Chinese script (used in the People's Republic of China) to text in Traditional Chinese script (used in Taiwan, Hong Kong, and Macau), or vice versa.
RLI has added support for GB 18030.
To enhance performance, RLP has switched from the PCRE regular expression engine to the Tcl regular expression engine.
This release adds support for Mac OS 10.4. The compressed software and documentation installation packages for this platform are
rlp-5.0.4-sdk-ia32-darwin891-gcc40.tar.gz
rlp-5.0.4-doc-unix.tar.gz
In addition to the SDK (RLP) and release number (5.0.4), the software package
filename identifies the architecture (IA32), the Unix core environment (Darwin 8.9.1),
and the C++ compiler (gcc 4.0). Using the Basis Technology naming convention,
BT_BUILD
is ia32-darwin891-gcc40
; see
"Getting Started: Supported Platforms" in the RLP Application Developer's
Guide.
Note: This platform uses .dylib
instead of
.so
as the file extension for shared objects, and
DYLD_LIBRARY_PATH
instead of LD_LIBRARY_PATH
as
the environment variable for setting the load library path. See Known Problems in RLP 5.0.4.
If you are using the Java API, Java 1.5 is required.
Added C++ and Java API to enable an RLP application to ascertain license information.
C++ BT_RLP_Environment
functions:
bool HasLicenses() const = 0;
bool HasLicenseForLanguage(BT_LanguageID lid, BT_UInt32 functionality) const =
0;
virtual bool HasLicenseForNamedFeature(const char *feature, BT_UInt32
functionality) const = 0;
For the arguments you can use for functionality
and
feature
, see bt_rlp_license_types.h.
rlp_sample
has been updated to illustrate the use of these new
functions. See rlp_sample.cpp.
Java RLPEnvironment
methods:
boolean hasLicenses() throws RLPException
boolean hasLicenseForLanguage(int language_id, int functionality) throws
RLPException
virtual boolean hasLicenseForNamedFeature(String feature, int functionality)
throws RLPException
For the arguments you can use for functionality
and
feature
, see RLPConstants.
RLPSample
has been updated to illustrate the use of these new
methods. See RLPSample.java.
Rosette Language Identifier (RLI) enhancements: improved support for identifying language with short text streams, ability to detect Shift-JIS-2004 encoding, and support for identifying the following additional languages: Pashto, Urdu, Kurdish, Somali, transliterated (Romanized) Arabic, and transliterated (Romanized) Farsi. Added language codes to distinguish transliterated Arabic and transliterated Farsi from Arabic and Farsi in Arabic script.
Streamlining of input processors. The Raw Input processor, in conjunction with the RLI and RCLU language processors, can now handle Unicode and non-Unicode input encodings. The Latin 1 input processor is no longer required to handle US-ASCII and ISO 8859-1, so it has been eliminated.
Lexicon updates for the Japanese Language Analyzer (JLA) and the Chinese Language Analyzer (CLA):
For Japanese, added 3 verbs, 2 given personal names, 3 foreign place names, 187 general proper nouns, 1731 common nouns, and 7 place names.
For Chinese, added 15 adjectives, 47 adverbs, 395 phrases, 35 abbreviations, 125 noun phrases, 105 numeric entries, 196 proper nouns, 78 pronouns,34 temporal nouns, and 194 verbs.
Speed improvements for the NameEntityExtractor, improved support for handling upper-case English input, and added support for traditional Chinese.
Enhanced API support for accessing RLP result data. The C++
BT_RLP_ResultIterator
class now includes an At
function, which provide random accesss to individual results.
The default logging level has been changed from "none" to "error". See "Capturing Log Output" in Chapter 3 of the RLP Application Developer's Guide.
RLP Demo (GUI) fixes/enhancements.
Removal of support for the .NET interface. In a subsequent release, we plan to reintroduce the .NET interface with additional API to provide complete access to RLP functionality.
Java API Changes
RLPResultRandomAccess
, RLPIntegerPair
, and
RLPIntegerTriple
have been deprecated. Use
RLPResultAccess
to access result data. Depending on the result type,
RLPResultAccess
returns a List, Map, Integer, or String.
It is no longer necessary to call RLPInitialize.init()
to
initialize the JNI. Accordingly, the RLPInitialize
class has been
removed.
Introduction of the EnvironmentParameters
class for defining the
environment configuration, handling the logging callback object and log level, and
setting the RLP root directory. The empty constructor
RLPEnvironment()
is deprecated. Use the new constructor after you
have set up the EnvironmentParameters
object:
RLPEnvironment(EnvironmentParameters params)
.
The performance of English, French, Italian, German, and Spanish named-entity processing has been improved by approximately 200%.
The Gazetteer XML DTD and corresponding parser have been updated to support new features of the Global Name Matcher. Note: Apart from this readme file, the RLP documentation has not changed since Release 4.3.
The Token Normalization processor is new in 4.3.2:
The token normalization processor, bt_lp_m1norm, is a many to one normalizer. It reads the normalized token vector or the token vector if the normailzed vector is not present. For each input token, it returns the normalized token, if any are found in the data source. Results are written to the normailzed token vector.
RLP is preconfigured with a Japanese Orthographic Normalization data source for use with the token normailzation processor. See the JON processor listing in the rlp-global.xml file.
The following processors are new in 4.3:
Raw Input Processor
To enable the processing of non-Unicode input data, the Raw Input processor takes any raw text data and performs no encoding detection or conversion. This is only relevant for use with the RCLU language processor, which can take non-Unicode data and convert them to UTF-16 for succeeding language processors to use. In that case, the context definition must begin with RCLU and an encoding must be passed to the ProcessBuffer method. If you do not wish to provide an encoding, RLI must precede RCLU to detect the encoding.
Core Library for Unicode Language Processor (RCLU)
The RCLU LP performs two functions. First, it will convert the input text to UTF-16 as required by other language processors. Second, it can perform certain character transformations on the input text. Encoding conversion will only occur if RCLU is used in conjunction with the Raw Input processor. Transformations are controlled using the context properties described above and are performed in the order they are listed in the context definition. If you do not provide an encoding, RLI must precede RCLU to detect the encoding. For a complete list of dependencies and context properties, please see the RLP Programming and API Guide.
Arabic Name Vocalizer Language Processor (ARVOC)
The ARVOC language processor attempts to vocalize Arabic words under the assumption that they are likely to be personal names or parts of personal names. The selected text will be replaced with its fully-vocalized version under the prejudice that each word is likely to be part of a personal name. Words for which the proper vocalization could not be determined will be left intact (i.e., in their nonvocalized form). Words are vocalized on a one-by-one (context-free) basis. See the RLP Programming and API Guide for further details.
Significant accuracy improvements have been attained for Arabic base linguistics and English named entity extraction.
Support has been added for Czech, Greek, Hungarian, and Polish. See the Features by Language table above, and the BL1 section and POS Tags Appendix of the RLP Programming and API Guide for more information. In addition, BNP and NE are now available for Portuguese, and NE is now available for Dutch.
A new sample program for Windows, rlpW32demo.exe, has been added. The previous demo, RLPDemo.exe, has been removed.
The following processors are new in 4.2.1:
bt_ip_htmlstripper, which does the following:
Reads HTML document and identifies encoding via RLI.
Converts to UTF-16, transforms all SGML entities to UTF-16 characters.
Strips out HTML tags and post the stripped text as a BT_RLP_RAW_TEXT notice.
Posts UTF-16 HTML document as BT_RLP_HTML_DOC notice on the blackboard.
Posts HTML tags information as BT_RLP_HTML_TAGS notice on the blackboard.
bt_lp_htmljoiner, which does the following:
Annotates the BT_RLP_HTML_DOC with Named Entity data.
If com.basistech.htmljoiner.do_artrans property is true, performs Arabic vocalization and transliteration of the Named Entity data.
Posts the annotated HTML document as BT_RLP_HTML_DOC notice on the blackboard.
Annotated Named Entity in the output HTML has the following format:
<OBJECT title="NE type">Arabic NE Text<BDO dir="LTR">[transliterated text]</BDO></OBJECT>
In the BT_RLP_HTML_DOC notice, all meta tags with http-equiv="Content-Type" will be convert to:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
bt_op_dumphtml, which
Reads the BT_RLP_HTML_DOC notice and dumps the output as UTF-8.
The following is a sample context configuration (included in the SDK package as samples/etc/rlp-html-context.xml). It defines a processing chain that will consume an HTML document and output annotated HTML document via the DumpHTML output processor.
<?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE contextconfig SYSTEM "http://www.basistech.com/dtds/2003/contextconfig.dtd"> <contextconfig> <properties> <property name="com.basistech.dumphtml.output_pathname" value="rlp-output.htm"/> <property name="com.basistech.htmljoiner.do_artrans" value="true"/> </properties> <inputprocessor>HTML Stripper</inputprocessor> <outputprocessor>DumpHTML</outputprocessor> <languageprocessors> <languageprocessor>RLI</languageprocessor> <languageprocessor>BL1</languageprocessor> <languageprocessor>Tokenizer</languageprocessor> <languageprocessor>JLA</languageprocessor> <languageprocessor>CLA</languageprocessor> <languageprocessor>KLA</languageprocessor> <languageprocessor>SentenceBoundaryDetector</languageprocessor> <languageprocessor>ARBL</languageprocessor> <languageprocessor>Stopwords</languageprocessor> <languageprocessor>BaseNounPhrase</languageprocessor> <languageprocessor>NamedEntityExtractor</languageprocessor> <languageprocessor>Gazetteer</languageprocessor> <languageprocessor>RegExpLP</languageprocessor> <languageprocessor>NERedactLP</languageprocessor> <languageprocessor>HTML Joiner</languageprocessor> </languageprocessors> </contextconfig>
Processing functionality for Dutch, Portuguese, and Russian has been added to the BL1 language processor. This new functionality includes changes to the bl1config.dtd file and new POS tags for each new language. See the RLP Programming and API Guide for full information.
Named Entity Extraction is now available for French, Italian, and Spanish.
Improvements were made for German decompounding and disambiguation.
Beta release of improved European language performance. See the description of the BL1 processor in the RLP Programming and API Guide. A side effect of the introduction of this processor is that base noun phrase functionality and named entity extraction for French, Italian, and Spanish are temporarily unavailable. They will be available in the next release of RLP.
Improved AR-BL performance (now approximately 30 times faster).
Improved logging functionality:
BT_RLP_Environment::SetLogLevel
and
BT_RLP_Environment::SetLogCallbackFunction
can now be called
multiple times.
The logging callback now takes a "channel" parameter, allowing the function to examine the channel number and decide how to handle the message based on its severity. The new prototype is:
typedef void (*BT_Log_callback_function)(void * callback_data, int channel, char const *string_to_log);
Improved gazetteer and regular expression (regex) files for demonstrating custom named entity extraction: The default configuration files shipped with RLP have been updated to facilitate experimenting with these features. The instructions below assume you are using the RLPDemo program that is part of the Windows distribution. The instructions concerning edits to the configuration files also apply if you are using the samples\scripts\go.sh script that comes with UNIX distributions or the samples\scripts\go.bat script that comes with Windows distributions.
How to demo the gazetteer
Edit the file rlp\samples\etc\rlpdemo-gazetteer.txt; insert any demo words or phrases at the end of the file
Open a text file with those phrases in the RLPDemo GUI, or type them in, and press the "Named Entities" button.
You should see your entries highlighted in grey as "GazDemo" type entities. The standard demo text that appears when you open RLPDemo has the words "identification", "morphological", and "extraction". These will be highlighted in grey.
How to demo the regex entity locator:
The standard demo has three types of regex entities defined to show up as "OTHER" in purple highlight.
Type in one of these three types:
A US Social Security number (xxx-xx-xxxx)
A credit card number (xxxx-xx-xxxx)
The emoticon (smiley) ":)"
Push "Named Entities."
You will see valid SS#'s, credit card numbers and the smiley highlighted in purple.
You can edit the file rlp\etc\regex-config.xml to add your own regex entities. Look at the demo entity types near the top of the file.
Notes
This version will classify all regex entities as "NONE" which is shown as "OTHER" in the demo.
Regex entity definition will be simplified in a future release to make it easier to define entities.
JLA now includes the capability to create and use one or more user dictionaries for words specific to an industry or application. See the JLA Processor section of the RLP Programming and API Guide for full details.
All European language processors (English, French, German, Italian, and Spanish) have been replaced by a beta version of a single new language processor called Base Linguistics (BL1). This new processor renders the old POSTagger processor obsolete and has changed all POS tags for these languages. In addition, tokenizing functionality is built in, and thus the Tokenizer processor will produce no results when run with BL1. For complete specifications, see the RLP Programming and API Guide.
Support for French, Italian, and Spanish named entities.
Python is no longer used by RLP.
The distributed directory structure has been changed. All binaries are now in the top-level bin directory and all libraries are now in the top-level lib directory. Also, the rex directory has been renamed to rlp.
"REX" in the names of classes and structures has been replaced with "RLP". The include file bt_rex.h provides backward compatibility for the old names.
The Japanese lexicon has been improved. A comparison with the RLP 2.3 lexicon is given below.
The Chinese lexicon has been improved. A comparison with the RLP 2.3 lexicon is given below.
This release includes Basis Technology Chinese lexicons with data from new providers. This may affect the Chinese Language Analyzer tokenization results in your application. Basis internal tests show that overall, segmentation accuracy has improved compared to the lexicons used in the 3.x release of CLA. These tests also show that over a very large corpus, the segmentations do not differ at all in 88% of cases. Of the cases where there are differences, CLA 4.0 was correct 53% of the time and CLA 3.x was correct 34% of the time.
The following is a summary of the major differences.
CLA 4.0.0 | CLA 3.x.x | |
---|---|---|
verb phrases | Many treated as one token Examples: 坐在, 去了, Ŋ到, 抄起 | Many treated as multiple tokens Examples: 坐 在, 去 了, Ŋ 到, 抄 起 |
adverb + verb | Treated as one token Example: 也有 | Treated as multiple tokens Example: 也 有 |
popular noun phrases | Treated as one token Examples: 本报讯, 党支部书记, 资产阶级自 | Treated as multiple tokens Examples: 本报 讯, 党支部 书记, 资产阶级 自由化 |
abbreviations | Treated as one token Examples: 女足, 世锦赛 | Treated as multiple tokens Examples: 女 足, 世 锦 赛 |
not | Combinations treated as one token Examples: 不是, 不要, 不会 | Combinations treated as two tokens Examples: 不 是, 不 要, 不 会 |
book | Combinations treated as two tokens Examples: 华文 书, 英文 书 | Combinations treated as one token Examples: 华文书, 英文书 |
one, some | Combinations treated as two tokens Examples: 一 个 | Combinations treated as one token Example: 一个 |
的 (particle) | Not appended to some adjectives Example: 红红 | Appended to some adjectives Example: 红红的 |
biggest | Treated as one token Example: 最大 | Treated as two tokens Example: 最 大 |
In 4.0.0, idioms and proper nouns are considerably more up-to-date. The most significant decrease in version 4.0 accuracy is in the recognition of proper nouns, particularly Chinese personal names. Basis Technology is committed to making product improvements to deliver better accuracy, and a project is underway now to improve proper noun coverage in an upcoming release.
Japanese Lexicon Improvements:
New Dictionary: The Japanese dictionary has been revised extensively. New words have been added. Internal parts-of-speech have been updated.
Katakana segmentation: Unknown Katakana words will be segmented as one token.
Compound Verbs: Where possible, compound verbs will be segmented as one token.
SeparatePlaceNameFromSuffix Option: This options will apply to Japanese Postal Address Level 1 and Level 2 words only.
IgnoreSeparators Option: In JLA 4.1, when set to false, JLA will always treat whitespace separators as token delimiters. In prior versions, this behavior was not always observed.
New Segmentation: Because of major lexicon change plus other significant changes in segmentation of Katakana words and compound verbs, it is highly recommended that existing customers consider re-segmenting existing data.
Arabic and Chinese statistical algorithms have been implemented for named-entity extraction.
Chinese base noun phrase identification has been added.
Chinese part-of-speech algorithms have been improved.
Four new entity types have been added to RLP (ACE/TIDES entity types): FACILITY (FAC), GEOPOLITICAL ENTITY (GPE), VEHICLE (VEH), WEAPON (WEA).
Entity subtypes are now allowed.
Korean dictionary comments which were previously only in Korean are now also in English. (Files: ham-cnn.dic, ham-rma.dic, ham-usr.dic)
For this release, there is a new component for named entity tagging. This is expected to increase the accuracy of entity labeling for the languages that are already covered: English, German and Japanese. It also lays the groundwork for the addition of new languages in the near future. Other than the accuracy improvements, there should be no visible differences between the previous output and the new output.
Delay loading enables more efficient loading of dynamic libraries, as well as the ability to locate dependent libraries without setting the PATH environment variable. This feature is for Windows environments with MSVC 7.1.
The document type definition neconfig.dtd has changed. It is now:
<!ELEMENT neconfig (ne-options)+> <!ELEMENT ne-options (nerfeasturedatapath | labelweightsdatapath}> <!ATTLIST ne-options language CDATA #REQUIRED> <!ELEMENT env EMPTY> <!ATTLIST env name CDATA #REQUIRED> <!ELEMENT nerfeasturedatapath (#PCDATA | env)*> <!ELEMENT labelweightsdatapath (#PCDATA | env)*>
featuredatapath
and classdatapath
are no
longer used. labelweightsdatapath
is used for named entity labeling.
nerfeasturedatapath
is used for named entity recognition (the
spelling error is intentional for compatibility reasons).
Two new Chinese POS tags have been added to the Chinese Language Processor: NP - Noun Proper and U - Unknown.
The Gazetteer Text Processor Module allows users to create custom Named Entity definitions from their own data lists. See Chapter 4: RLP Processors of the RLP Programming and API Guide for documentation.
This release introduces two new APIs for setting and handling the root directory:
BT_REX_Environment::SetRootDirectory
and
BT_REX_Environment::RootDirectory
.
A new context property for the REXML Output Processor,
suppress_header_comment
, allows the option of turning off the
header comment that includes the REX version used to generate the file.
RLP Arabic capabilities have been expanded to handle variant Arabic numeric orthography such as the use of 'reh' (U+0631) as the decimal (U+066B) separator.
This release adds support for Korean and Chinese to the Sentence Boundary Detector (SBD).
There is now a .NET API and a .NET sample program for RLP.
Three new POS tags appear in the Korean Language Processor: NN for numbers, FW for foreign (non-Korean) words and Unknown for words not recognized.
A new Japanese LP option, IgnoreSeparators (default TRUE), can be set in the jla-options.xml file.
Bug # | Component | Description |
---|---|---|
RLP-2449 | BL1 | Fixed stack overflow processing large sequences of space-delimited numbers (in some languages, spaces may be used as thousands separator). |
RLP-2455 | BL1 | Fixed processing error that disabled the handling of larger user dictionary entries. |
RLP-2459 | Utilities | If in the input for a user dictionary is invalid UTF-8, RLP now reports an error rather than truncating the input. |
RLP-2462 | Java | Implemented a workaround for Sun bug 6791815 introduced in Java 1.6.0_04. The bug can cause deadlock on non-Solaris platforms when multiple threads are attempting to initialize a direct buffer. The workaround is create the direct buffer in a single thread before a race condition can arise. |
RLP-2426 | RLBL | In order to avoid inaccurate language identification based on very small script regions, changed
the default minimum text region (com.basistech.lbd.min_region ) that the LanguageBoundary processor will analyze from
0 to 20 Unicode characters. |
Bug # | Component | Description |
---|---|---|
RLP-2312 | Regular Expressions | Fixed a segmentation fault in RegExpLP handling input that is empty or contains only whitespace. |
RLP-2410 | Regular Expressions | Eliminated a hang in RegExpLP that was sometimes triggered by a low-memory condition. |
RLP-2077 | Entity Redactor | Added rules for eliminating entities that contain non-alphanumeric data. |
RLP-2398 | BL1 | Improved handling of out-of-memory conditions. |
RLI-274 | RLI | Fixed error identifying upper-case English as Pashto. |
CLA-435 | CLA | Eliminated a logic error that led to the allocation of an extra zero-length token while segmenting a particular sequence of Chinese input characters. |
RLP-2421 | Java API | Fixed a NullPointerException that was thrown looking for part-of-speech tags in input that is empty or contains only whitespace. |
Bug # | Component | Description |
---|---|---|
RLP-2022 | Language Boundary Detector | Revised the algorithm to check script regions, and if ambiguous, text regions (sentences) within script regions. Problems reporting gaps between regions and dividing a single language region into multiple regions have been fixed. |
RLP-2030 | RLI | RLP now understands the correct ISO639 codes for Hebrew (he ,
not iw ), Yiddish (yi , not
ji ), and Indonesian (id , not
in ) as returned by RLI. |
RLP-2141 | C++ API | BT_RLP_NE_Iterator has been fixed to handle language-neutral
named entities when the language is not known. |
CLA-421 | CLA | Fixed corruption of surrogate pairs (a pair of 16-bit words representing a Unicode character) when handling user-dictionary entries. |
CLA-425, CLA-426, CLA-427 | CLA | Fixed an error in the Pinyin reading dictionary, which in a limited set of cases was producing some slightly incorrect readings. |
RLP-2170 | JLA | Fixed tokenizing error handling newline at end of a sentence. |
JLA-295 | JLA | Eliminated reporting of of zero-length tokens. |
RLP-2140 | BL1 | To maintain consistency with other language processors, changed error message to a warning when the language is unknown. |
RLP-1833 | CSC | Fixed error reporting whitespace tokens. |
RLP-1829 | Java API | The Java command-line utility (com.basistech.rlp.RLP ) now
reports a meaningful message when the input file it has been told to process is
missing. |
RLP-1792 | Named Entities | Fixed a boundary-case logic error on 64-bit platforms that in some instances caused a crash when posting Named Entity results from multiple processors. |
RLP-1775 | Tokenizer | Fixed error tokenizing a series of numbers separated by whitespace as separate numbers. In specific well-defined cases (such as a space used as a thousands separator in French) a single number may still include whitespace. |
REX-233 | Named Entity Redactor | When resolving conflicts between overlapping candidate entities, Named Entity Redactor was returning two entities: the winning entity and the portion of the losing candidate that was not in the overlap. Now only the winning entity is returned. |
Bug # | Component | Description |
---|---|---|
RLP-2129 | NamedEntityExtractor | Corrected a memory management error extracting references to named entities. |
Bug # | Component | Description |
---|---|---|
8051 | CLA | Fixed CLA tokenization of non-Chinese text. |
8440 | Tokenizer | Fixed tokenization of numbers separated by white space. |
8475/8623 | RCLU | Fixed FormKCNormalization to correctly handle decomposition of characters that span a buffer boundary. |
8607 | JLA | Fixed normalization of dates. |
8669 | Language processors | Fixed handling of documents that only contain whitespace. |
8690 | Regular Expressions | Fixed boundary condition problem in RegExpLP which could cause a crash in NERedactLP. |
Bug # | Component | Description |
---|---|---|
8461 | Named Entities | Fixed handling of boundary case on 64-bit platforms when posting Named Entity results from multiple processors. |
8482 | Sample Applications | Fixed crash in C sample program when tokenization of the input does not occur because the user does not have the required license. |
8563 | Chinese Script Converter | Fixed error handling whitespace. |
8564 | Chinese Script Converter | Now reposts DETECTED_LANGUAGE and DETECTED_SCRIPT so subsequent processors have access to the language code and writing script code for the converted text. |
8574 | Java API | Fixed Java Native Implementation of log callbacks to avoid inappropriate messages when the -Xcheck:jni flag is passed to the JVM. |
8611 | Java API | Enabled Java users to catch a C++ Java Native Implementation exception by catching the exception ourselves and rethrowing it as a Java OutOfMemoryError or RuntimeException. |
Bug # | Component | Description |
---|---|---|
8238 | RLI | Fixed occasional error loading RLI language profiles. |
2988 | RegExpLP | Fixed detection of Japanese dates of format YYYY年MM月. |
7360 | Build System | Removed occasional requirement for JAVA_15_HOME environment
variable. |
8255 | RCLU | Fixed normalization error handling last character in Korean text. |
6476 | CSC | CSC reposts RAW_TEXT, TOKEN, and TOKEN_OFFSET results to reflect the converted text. |
Bug # | Component | Description |
---|---|---|
8255 | RCLU | Fixed error normalizing final character in Korean text. |
7941 | ScriptBoundary Detector | Fixed error returning script for Simplified Chinese. |
7965 | CLA | Fixed out-of-memory error processing large tokens. |
8221 | KLA | Fixed error reporting offsets in Korean text. |
8236 | RCLU | Fixed offset map out-of-bounds bug. |
8251 | RCLU | Fixed error mapping offsets for Unicode surrogate pairs in UTF-16 input. |
7979 | RLI | Fixed error handling '~' at the end of input text in special circumstances. |
Bug # | Component | Description |
---|---|---|
6986 | BL1 | When Bl1 reaches the memory limit defined by the bl1config
memory_limit attribute (the default is 200,000,000 bytes), a warning is
logged (not an error), BL1 memory is cleared, and BL1 is re-initialized.
Accordingly, an application can continue processing new languages without restarting
the process. |
5926 | JLA | Fixed normalization of Kanji numbers. |
6950 | Java API | Fixed handling of empty Map objects that was causing a NullPointerException when attempting to return compounds from Japanese text that contains no compounds. |
4264 | GUI Demo | Fixed support for using an IME (input method editor) to enter Chinese, Korean, or Japanese text into the Windows GUI Demo. |
6209 | JLA | Fixed problem decomposing compounds in text-format (uncompiled) user dictionaries. |
6451 | Named Entities | Fixed memory leak handling user-defined entity names. |
6550 | ARBL | Fixed bidirectional problem resolving numbers in Arabic script. |
6909 | KLA | Fixed misuse of copyright symbol (©). |
7325 | Gazetteer LP | Significantly improved performance for the Gazetteer language processor and the underlying Aho-Corasick dictionary. |
Bug # | Component | Description |
---|---|---|
7557 | BL1 | Fixed bug handling a token in German text containing 100 's' characters by setting a limit on the number of variants that are generated for lookup. |
Bug # | Component | Description |
---|---|---|
6771 | BL1 | Fixed stack overflow handling a very large token. |
6495 | BL1 | Fixed crash-bug handling large tokens in Greek . |
Bug # | Component | Description |
---|---|---|
5370 | C++ API | BT_RLP_Context contain two methods that should, but did not,
return constant pointers: GetStringResult and
GetUTF16StringResult . The second of these may return a
non-null-terminated string, but did not return the length. This has been fixed. See
5.3.0 New Features: Context
object. |
6292 | RCLU | You can use RCLU to apply context property transforms (e.g., FormCNormalization) to input text that is alreadly encoded as UTF-16 with the correct byte order. |
3687 | REXML | Spelling correction of a context property name.
com.basistech.rexml.supress_header_comment has been corrected
to com.basistech.rexml.suppress_header_comment . The old spelling
is still supported. |
Bug # | Component | Description |
---|---|---|
4425 | Named Entities | The absence of Named Entity Redactor in a context no longer causes a failure to report all the named entities found by Named Entity Extractor, Gazetteer, and Regular Expressions. |
4782 | Gazetteer | Gazetteer now returns named entities that match the text but not the whitespace in the Gazetteer entries. |
5270 | BL1 | Rare problem with French input that caused truncation of the input stream has been fixed. |
2838 | BNP | Eliminated the splitting of some Japanese base noun phrases into multiple noun phrases. |
3253 | Gazetteer | Fixed thread safety problem with use of the Aho-Corisick string matching algorithm to locate named entities. |
3254 | Gazetteer | Eliminated rereading of gazetteers when a context performs multiple process operations. |
3357 | CLA | Added five context properties to set Chinese Language Analyzer reading options. |
3404 | C++ API | Accelerated the disambiguation of Chinese POS tags. |
3788 | Gazetteer | Ignore diacritics in Gazetteer entries to improve matches with documents that do not contain diacritics. |
4126 | Gazetteer | Fixed handling of false gazetteer entry matches that do not respect token boundaries, i.e., that begin or end with a token fragment. |
4127 | Gazetteer | Expose gazetteer names to the public API. |
5977 | LBD | Fixed Java Language Boundary Detector crash on Unix platforms. |
Bug # | Component | Description |
---|---|---|
6771 | BL1 | Fixed problem processing long strings of unbroken text. |
Bug # | Component | Description |
---|---|---|
6174 | JLA | Fixed multi-threading problem tokenizing Japanese text. |
Bug # | Component | Description |
---|---|---|
2798 | BL1 | Corrected errors in POS tagging of punctuation for English, French, Italian, German, and Spanish. |
3979 | BL1 | Corrected errors in stem results for Greek. |
3996 | BL1 | Corrected handling of special symbol "|" in Polish. |
4145 | C++ API | Added cdecl modifier to all public static members to avoid
linking problems experienced in some environments. |
4572, 4684 | RLBL | Fixed errors in language code returned by Language Boundary LP. |
4786 | API | Logger reports RLP version. |
4891 | Command-line utility | Fixed -describe to return description of context
configuration. |
4896 | Packaging | Windows binary files now contain version information. |
4908 | BL1 | Fixed errors in stem results. |
5034 | CLA and JLA | CLA and JLA now return normalized numerics as stem results. |
5058 | JLA, CLA | Fixed handling of stopwords. |
5075 | BL1 | Report clitics boundaries as spaces in stem results. |
5172 | BL1 | Enabled user to reset the BL1 memory limit. |
4308 | Regular Expressions | Updated regex-config.xml to improve named-entity recognition of URL domain names and country codes, and international telephone numbers. |
4448 | ARBL | Fixed problem handling input with whitespace. |
4985 | KLA | Fixed memory leak. |
5059 | ARBL | Fixed error in nomalization of numbers in stem results. |
3497 | Core | Fixed performance problem processing input text on Solaris. |
4403 | JLA | Fixed problem in tokenization of words in Latin script. |
4841 | JLA | Fixed problem recognizing certain verb endings. |
4875 | JLA | Fixed utility for building user dictionary. |
4992 | JLA | Fixed behavior of recursive decomposition when
deep_compound_decomposition is set to true. |
5064 | JLA | Fixed error normalizing numbers. |
5130 | JON | Provide more complete Kanji orthographic data to normalization dictionary. |
5264 | RLI | Improved ability to distinguish Indonesian text from Malay. |
5270 | BL1 | Prevented a fatal error during Named Entity processing when BL1 fails to compute a token offset in French text (happens rarely). |
Bug # | Component | Description |
---|---|---|
4213 | Language Identifier LP | Mis-identified English text files. |
4293 | Japanese LP | Add kanji variations to the Japanese Orthographic Analyzer dictionary. |
Bug # | Component | Description |
---|---|---|
3776 | Arabic LP | Recognize year-identification in Arabic date strings |
3845 | RLP Core | Remove function-level statics |
3953 | Arabic LP | Tokenization Errors when some non-printable characters are present |
3977 | Arabic LP | Numeric stems not stored in stem vector |
3991 | Japanese LP | Add normalization to user dictionary builder |
4070 | Regular Expressions LP | Memory leak in RegExpLP with multiple contexts |
Bug # | Component | Description |
---|---|---|
3840 | RLP Core | BT_ROOT directory name with a space causes an error |
3855 | Named Entity Extractor | NE Macro mismatches |
3865 | Arabic LP | Valgrind error in bt_lp_arla.cpp |
Bug # | Component | Description |
---|---|---|
2147 | Language Identifier LP | Remove static initialization |
3395 | RLP Core | ProcessBuffer not implemented for Latin1 IP |
3477 | Korean LP | Memory leaks. |
3501 | Korean LP | Noun and stem results don't line up in sample output. |
3612 | Chinese LP | AT&T not parsed correctly |
3646 | Chinese LP | Segmentation fault on dual CPU machines. |
Bug # | Component | Description |
---|---|---|
2180 | Japanese LP | All the prefectures have to be marked as CMPD. |
2503 | Java API | The DOM return API has been deprecated in favor of the Java API. |
2610 | Arabic LP | Arabic LP processes entire token stream before doing POS disambiguation. |
2702 | Python Interfaces | Environment init fails leads to python crash. |
2897 | Demo | RLPWinForms does not display whole raw data - when the input file is not short,
not all input is displayed in RAW_TEXT . This will be fixed by
producing a new GUI in the future. |
2903 | REXML Output Processor | REX ML OP bus error (segfault) with ARBL variations. |
2937 | NE LP | Rex crashes without SentenceBoundaryDetector in context. |
2942 | Japanese LP | "志し" in the rex sample Japanese text is not properly segmented as a single token. |
2957 | Japanese LP | できます (dekimasu) is incorrectly segmented to で#きます (de#kimasu) in certain cases . |
2974 | Demo | When the Java sample program is executed with Korean text, an error occurs. |
2982 | Java API | REX Java unit tests report "'import site' failed". |
2983 | Japanese LP | Katakana words in UDD didn't work in some situations (single Katakana words in isolation were being segmented incorrectly). |
2991 | Demo | "abnormal program termination" happens when rdif is executed. |
3023 | Rex Core | Setting the same context property more than once doesn't work. |
3039 | Packaging | Accommodate bt_rex_config_xml_parser.cpp in the source package makefiles. |
3053 | Japanese LP | 8 of 47 prefectures in Japan are tokenized wrongly. |
3056 | EFIGS LP | Wrong German POS tags from efigs on Solaris. |
3057 | Japanese LP | Disabling Decomposition of Japanese Characters. |
3074 | EFIGS LP | English possessive suffixes ('s or ') were not being split from noun. |
3113 | Misc. IP | Truncated output when using source\samples\go.bat. |
3137 | Packaging | Need msvcr71.dll in Windows package. |
3141 | SBD LP | Warning from Sentence Boundary Detector for Simplified Chinese |
3150 | C++ API | BT_REX_Token_Iterator should include
IsStopword() predicate |
The Regular Expression processor as configured with
BT_ROOT
/rlp/etc/regex-config.xml does not
recognize Japanese "unit in square" quasi-characters, such as such as "㌢" (centi of
centimeter) or "㎝" (cm as one code point).
The recommended workaround is to include the RCLU processor in the RLP Context (in place of the Unicode Converter) along
with com.basistech.rclu.FormKCNormalization
set to true. RCLU normalizes
these characters to regular text ("センチ" or "cm", respectively, for the examples cited
above).
For example:
<?xml version="1.0"?> <!DOCTYPE contextconfig SYSTEM "contextconfig.dtd"> <contextconfig> <properties> <property name="com.basistech.rclu.FormKCNormalization" value="true"/> </properties> <languageprocessors> <languageprocessor>RLI</languageprocessor> <languageprocessor>RCLU</languageprocessor> <languageprocessor>JLA</languageprocessor> <languageprocessor>SentenceBoundaryDetector</languageprocessor> <languageprocessor>BaseNounPhrase</languageprocessor> <languageprocessor>NamedEntityExtractor</languageprocessor> <languageprocessor>Gazetteer</languageprocessor> <languageprocessor>RegExpLP</languageprocessor> <languageprocessor>NERedactLP</languageprocessor> </languageprocessors> </contextconfig>
(7845) The Java API does not yet provide access to one of the new result types
introduced in 5.4.1: ALTERNATIVE_PARTS_OF_SPEECH
.
(7897) The statistical model used by Arabic Base Linguistics does not yet factor in the
lemma. Accordingly, if you set com.basistech.arbl.lemmas
to true, the
disambiguator may not pick the most likely candidate for LEMMA
. If you
also set com.basistech.arbl.alternatives
to true,
ALTERNATIVE_LEMMAS
should contain the correct lemma, but it may not be
the first alternative, even if com.basistech.bl1.query
is set to false.
If you use an RLP makefile in RLP/samples, and
JAVA_HOME
is defined, you may receive this error:
$ make all BT_BUILD=ia32-glibc23-gcc32 ../../../build_system/confs/java.conf:7: *** JAVA_HOME defined but JAVA_15_HOME is not. Stop.
The workaround is to set JAVA_15_HOME
to the value of
JAVA_HOME
. In the recent upgrade to java 1.5, Basis used
JAVA_HOME
to point to Java 1.4 for internal compatibility, and
JAVA_15_HOME
to point to Java 1.5. This issue will be addressed in the
next release.
The script files for running the RLP samples use LD_LIBRARY_PATH
to
set the load library path. As Noted in New
Features, the correct environment variable for setting the load libraray path on the
Mac OS 10.4 platform is DYLD_LIBRARY_PATH
.
Each of these scripts contains a single line setting and exporting
LD_LIBRARY_PATH
. Change LD_LIBRARY_PATH
to
DYLD_LIBRARY_PATH
. The scripts are in
rlp/samples/scripts/ia32-darwin891-gcc40, and their use is explained
in the "Getting Started" chapter in the RLP Application Developer's
Guide.
(4425) Must include Named Entity Redactor in context if the context includes more than one processor that generates named entities (Named Entity Extractor, Regular Expressions, Gazetteer) If the Named Entity Redactor is not included, only named entities found by the first processor (of the 3 identified above) in the context are reported.
Java is not supported for 64-bit platforms.
(4782) Gazetteer only finds named entities with exactly the same whitespace as in the Gazetteer entry.
(5270) If BL1 is unable to compute a token offset (happens rarely with French input), it reports an error, truncates the token stream, and processes the text up to the problem token.
The Windows GUI demo, btrlpdemo.exe, is a beta release. Display problems (e.g., missing columns in the analysis window) may occur after extended use. This can be worked around by restarting the demo. Korean text is not displayed properly when using the base lingusitics view of a Korean analysis. There is nowork around for this. The GUI demo requires an RDIF license. If you don't have one, you can request an evaluation license for RDIF.
The new BL1 functionality has resulted in several new problems:
Configuration: All European language processors (DELA, ENLA, ESLA, FRLA, and ITLA) have been replaced by the single language processor, BL1, and thus these older processors no longer appears in configuration files such as rlp-global.xml, rlp-context.xml, and rlp-context-no-op.xml.
POS Tags: All POS tags for European languages have changed. See the RLP Programming and API Guide for full details.
Threads: Currently, you can only run one analysis for BL1 at a time.
For more information, please consult Basis Technology's Rosette Linguistics Platform documentation:
RLP Application Developer's Guide (titled the RLP Programming and API Guide prior to release 5.0.0)
API References: C++, Java, C, .NET
For support email ProductSupport@basistech.com. Please include the product name (RLP) in the subject line of your email.