Rosette Linguistics Platform 6.5.2 - Release Notes


Table of Contents

Introduction
Key Features
Platforms Supported for 6.5.2
New Features
Bugs Fixed
Known Problems
More Information

Copyright © 2004-2009 Basis Technology Corporation. All rights reserved. This document is property of and is proprietary to Basis Technology Corporation. It is not to be disclosed or reproduced in whole or in part without the express written consent of Basis Technology Corporation.

Web:http://www.basistech.com
Support:  ProductSupport@basistech.com
Initial Release:  April 06, 2009

Introduction

The Rosette Linguistics Platform (RLP) is the backbone of Basis Technology's text and language analysis technology. RLP provides advanced natural-language processing techniques to help your applications unlock information in unstructured text. RLP includes modules for language and encoding identification, converting text to Unicode, identifying basic linguistic features, and locating key entities like the names of people, places, and objects of interest. RLP supports English and a variety of Asian, European, and Middle Eastern languages. The detailed linguistic information provided by RLP can be used to increase the accuracy and depth of information-retrieval, text-mining, entity-extraction, and other text-analysis applications.

A single C++, C, Java, or .NET API provides the calling application access to all of the capabilities and all of the languages supported by RLP for which the customer has obtained a license. Additional features can be added by simply adding new license keys and possibly relinking with a later release of the platform.

Key Features

RLP is packaged with modules for Named Entity Extraction (NE) and Base Linguistics (BL): base noun phrase detection, tokenization, sentence boundary detection, part-of-speech tagging, and morphological analysis including stemming, alternative readings (transcriptions), and compound analysis. These modules can process Arabic, Chinese, Czech, Dutch, English, Farsi (Persian), French, German, Greek, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, and Urdu.

Language support for each of these operations is indicated in the following table:

Table 1. RLP Language Support for Base Linguistics (BL) and Named Entity Extraction (NE)

LanguageBase LinguisticsNE
 TokenizationPOSSBDBNPStemmingCompoundsReadings 
Arabicn/a 
Chinese (Simplified)n/an/a
Chinese (Traditional)n/an/a
Czech n/a 
Dutch 
English[a]n/an/a
Farsi (Persian)  n/a 
Frenchn/a 
German 
Greek n/a 
Hungarian  
Italiann/a 
Japanese
Korean  
Polish n/a 
Portuguesen/a 
Russian n/a 
Spanishn/a 
Urdu  n/a 

[a] RLP also provides specialized support for upper-case English text. When processing English text that is entirely upper case, specify the English Upper-Case language code (en_uc).


POS is part-of-speech tagging, SBD is sentence-boundary detection, and BNP is base-noun-phrase detection. For Chinese, the readings are pinyin transcriptions; for Japanese, the readings are Furigana transcriptions rendered in Hiragana. Blanks in this table indicate that the functionality is not available; n/a indicates that the feature does not apply to that language.

If you work with multilingual input data, RLP provides tools for locating regions of contiguous text in a single language, so that you can process each region with the appropriate language processors.

In addition to the languages listed above, the Rosette Language Identifier (RLI) can identify text in the following languages: Albanian, Transliterated Arabic, Bahasa Indonesia, Bahasa Malay, Bengali, Bulgarian, Catalan, Croatian, Danish, Estonian, Finnish, Gujarati, Hebrew, Hindi, Icelandic, Kannada, Kurdish, Latvian, Lithuanian, Malayalam Norwegian, Pashto, Transliterated Pashto, Transliterated Farsi (Persian), Romanian, Serbian (Cyrillic and Latin), Slovak, Slovenian, Somali, Swedish, Tagalog, Telugu, Thai, Turkish, Ukrainian, Transliterated Urdu, Uzbek (Cyrillic and Latin), and Vietnamese.

Other key features:

  • RLP is written in a portable subset of ISO/ANSI C++.

  • C++, C, Java, and .NET APIs are available. The APIs do not vary from one human language to another.

    Text is internally encoded in Unicode (UTF-16).

  • RLP operations are thread safe.

Note

RLP's features are enabled by license keys issued by Basis Technology. Please contact us to obtain the required evaluation or production license file, and refer to Installing RLP for information about where to put the license file.

Platforms Supported for 6.5.2

RLP 6.5.2 supports the following platforms. Tags identifying the operating system, CPU, and C++ compiler are embedded in the platform names. Java is supported except where noted otherwise.

Table 2. Supported Platforms

OSCPUCompilerBT_BUILD[a]
AIX 5.2PowerPCxlc 5.2ppc-aix52-xlc[b]
FreeBSD 4.8IA323.4ia32-freebsd48-gcc34[b]
FreeBSD 6.0AMD64gcc 3.4.4amd64-freebsd6-gcc344[b]
FreeBSD 6.0IA32gcc 3.4.4ia32-freebsd6-gcc344[b]
HP-UX 11.00IA64HP aCC 5.41ia64-hpux11-aCC541[b]
HP-UX 11.00PA-RISC32HP aCC A.03.33parisc-hpux11-aCC333-aa[b]
Linux (glibc 2.2)IA32gcc 3.2ia32-glibc22-gcc32
Linux (glibc 2.3)AMD64gcc 3.4amd64-glibc23-gcc34
Linux (glibc 2.3)AMD64gcc 4.0amd64-glibc23-gcc40
Linux (glibc 2.3)IA32gcc 3.2ia32-glibc23-gcc32
Linux (glibc 2.3)IA32gcc 3.4.4ia32-glibc23-gcc34
Linux (glibc 2.3)IA32gcc 4.0ia32-glibc23-gcc40
Linux (glibc 2.4)AMD64gcc 4.1amd64-glibc24-gcc41
Linux (glibc 2.4)IA32gcc 4.1ia32-glibc24-gcc41
Linux (glibc 2.5)AMD64gcc 4.1amd64-glibc25-gcc41
Linux (glibc 2.5)AMD64gcc 4.2amd64-glibc25-gcc42
Linux (glibc 2.5)IA32gcc 4.1ia32-glibc25-gcc41
Linux (glibc 2.5)IA32gcc 4.2ia32-glibc25-gcc42
MAC OS 10.5 (Darwin 9)32-bit/64-bit Intelgcc 4.0universal-darwin9-gcc40
Solaris 10AMD64CC 5.8amd64-solaris10-cc58
Solaris 10AMD64gcc 4.1.2amd64-solaris10-gcc41
Solaris 10IA32CC 5.8ia32-solaris10-cc58
Solaris 10IA32gcc 3.4ia32-solaris10-gcc34
Solaris 10SPARC32CC 5.8sparc-solaris10-cc58
Solaris 10SPARC64CC 5.8sparc-solaris10-cc58-64
Solaris 10SPARC64gcc 4.1.2sparc-solaris10-gcc412-64
Solaris 7-8SPARC32CC 5.2 (Forte Developer 6)sparc-solaris28-cc52
Solaris 7-8SPARC64CC 5.2 (Forte Developer 6)sparc-solaris28-cc52-64
Solaris 9IA32gcc 3.4.5ia32-solaris9-gcc34
Solaris 9SPARC32CC 3.4sparc-solaris9-gcc34
Solaris 9SPARC32CC 5.8 (Sun Studio 11)sparc-solaris9-cc58
Solaris 9SPARC64CC 5.8 (Sun Studio 11)sparc-solaris9-cc58-64
Solaris 9SPARC64gcc 4.1sparc-solaris9-gcc41-64
Windows 32IA32Visual Studio 7.1ia32-w32-msvc71
Windows 32IA32Visual Studio 7.1ia32-w32-msvc71-static[c]
Windows 32IA32Visual Studio 8.0ia32-w32-msvc80
Windows 32IA32Visual Studio 8.0ia32-w32-msvc80-static[c]
Windows 64AMD64Visual Studio 8.0amd64-w64-msvc80
Windows 64AMD64Visual Studio 8.0amd64-w64-msvc80-static[c]

[a] BT_BUILD is embedded in the name of the download package. It is also the subdirectory name used in various locations for platform-specific files, such as binary library files.

[b] Java not supported.

[c] Built with a statically linked library. Does not include support for the Java API, the Core Library for Unicode (RCLU), iFilter, or HTML Stripper.


New Features

New Features in 6.5.2

  • Java API Updates in com.basistech.rlp.RLPResultAccess: Added access to NamedEntityData. Replaced getIntegerData() with getDetectedLanguage() and getDetectedScript, which return the appropriate com.basistech.util Enum type (LanguageCode and ISO639).

  • Added API to guarantee that multiple instances of a given entity are returned with the same entity type

  • Added facility for creating a blacklist of entities that are not to be returned by the Named Entity Extractor for the specified entity type.

  • For the Base Linguistics Language Analyzer, added caching of morphological data about commonly used words in English and German to accelerate linguistic processing.

New Features in 6.5.0

  • Named Entities. Expanded named entity support to include Russian and three new entity types: TITLE, NUMBER, and DISTANCE. For this release, we have acquired, annotated, and performed statistical training with new data; improved and added more regular expressions; and introduced internal language-specific binary gazetteers (created by Basis Technology). To improve performance, the PERCENT entity type has been removed.

    Added Regular Expression support for naming and reusing regular expressions (including expression fragments).

    Added Named Entity Redactor support for joining adjacent named entities into a single named entity. By default, adjacent TITLE and PERSON entities are joined into a PERSON entity.

    Added the com.basistech.neredact.prefer_length property. When set to true (the default), this property instructs the Named Entity Redactor to resolve a conflict between overlapping candidate entities in favor of the longer candidate.

    Added the com.basistech.neredact.max_entity_tokens property. When a named entity returned by NamedEntityExtractor contains more than this number of tokens (the default is 8), Named Entity Redactor discards the entity.

    Added the FragmentBoundaryDetector, which uses whitespace to separate items in fragmentary text (such as lists and tables), so that the NamedEntityExtractor will not combine a series of fragments into a single entity. The Regular Expression processor also contains a new context property (com.basistech.regexp.respect_boundaries) that you can set to instruct the processor not to cross fragment boundaries when matching text.

    Added support for language-specific text gazetteers.

    For a problem and workaround applying regular expressions to Japanese text with "unit in square" characters, see Known Problems in 6.5.0.

  • Expanded the .NET API to provide complete access to RLP functionality. The .NET API is modeled on the Java API.

  • Enhanced the Base Linguistics (BL1) processor to run simultaneously in multiple threads..

  • Replaced the Japanese Orthographic Normalizer (JON) with the ManyToOneNormalizer, which provides a multi-language utility for using language-specific normalization dictionaries to provide normalized tokens. We continue to distribute a Japanese normalization dictionary. Users can add their own normalization dictionaries for any of the languages we support.

    Important

    If you are using a context configuration file or string that contains <languageprocessor>JON</languageprocessor>, you must change this entry to <languageprocessor>ManyToOneNormalizer</languageprocessor>.

    For each token, JON returned a NORMALIZED_TOKEN, whereas ManyToOneNormalizer returns a MANY_TO_ONE_NORMALIZED_TOKEN.

  • Moved sample code for integrating RLP with Lucene and Solr into a separate RLP For Lucene package (rlplucene-6.0.0-sdk-unix.tar.gz or rlplucene-6.0.0-sdk-win.zip).

  • The sample Japanese normalization dictionary has been updated to include place name variants.

  • Added support for instantiating multiple Environment objects in the same process. Each of these Environment objects is a wrapper for the same underlying Environment. Accordingly all Environment objects must be initialized with the same environment configuration (normally rlp-global.xml).

  • Replaced Unix Make files and Windows Visual Studio Solution and Project files with scripts for building the C++ , C, and .NET sample applications. The Unix .sh scripts are designed to be run in a Bash Shell. The Windows(.bat scripts should be run in the Command Prompt. We continue to provide Ant scripts for building and running the Java sample applications, as well as Unix .shscripts and Windows .bat scripts for running all the sample applications.

  • RLI can now detect UTF-16LE and UTF-16BE, even if the endianness does not match the endianness of the host operating system. If the endianness of the file matches the endiannnes of the host, RLI reports the encoding as UTF-16. If the endianness of file and host do not match, RLI reports the full encoding: UTF-16LE or UTF-16BE.

  • Approximately 24,000 traditional Chinese words have been added to the Chinese dictionary.

  • Replaced the Mac OS 10.4 (Darwin 8.9.1) platform for 32-bit Intel platform with the universal Mac OS 10.5 (Darwin 9) platform for 32-bit and 64-bit Intel.

  • The sparc-solaris9-gcc345 platform has been renamed to sparc-solaris9-gcc34.

Changes in 6.0.3

  • To obtain the Rosette Demo, download and install rlpdemo-6.0.3.4.msi.

  • RLP is no longer being built on the amd64-glibc23-gcc32 platform.

New Features in 6.0.2

  • This release removes dynamic dependencies on libgcc_s.so and libstdc++.so for the following Linux and Solaris platforms. The first three platforms are new; the last three have been changed to eliminate these dependencies. The compressed software installation packages are

    • rlp-6.0.2-sdk-amd64-glibc25-gcc41.tar.gz
    • rlp-6.0.2-sdk-amd64-solaris10-gcc41.tar.gz
    • rlp-6.0.2-sdk-sparc-solaris9-gcc41-64.tar.gz
    • rlp-6.0.2-sdk-amd64-glibc23-gcc34.tar.gz
    • rlp-6.0.2-sdk-ia32-glibc23-gcc34.tar.gz
    • rlp-6.0.2-sdk-amd64-solaris10-gcc41 (amd64-solaris10-gcc41 was formerly called amd64-solaris10-gcc412)

Changes in 6.0.1

The Windows 32-bit SDK packages are now .zip files in place of .msi files, and the Rosette Demo has been moved from the SDK to a separate package.

New Features in 6.0.0

  • Added named entity support for Korean, Farsi (Persian), and Urdu.

  • Added support for new named entity types: RELIGION, NATIONALITY, GPE (a geo-political entity), and FACILITY (a man-made structure or architectural entity).

  • 366 neologisms have been added to the Japanese dictionary.

  • 171 neologisms have been added to the Chinese dictionary.

  • A preliminary .NET API that provides limited coverage of RLP functionality.

  • Enhancements to the Windows GUI Demo.

  • Merging of the Korean Hangul and compound noun dictionaries into a single compiled Korean user dictionary that users can edit and recompile.

  • The Rosette Language Identifier (RLI) returns DETECTED_SCRIPT, the ISO15924 code for the writing script of the text to be processed.

  • The Rosette Language Identifier (RLI) is able to detect UTF-16 encoding.

  • Moved the routine for scanning the RLP license and generating a list of supported features from the introductory RLP sample applications to separate C++, Java, and C sample applications.

  • Removed deprecated Java interface and classes from com.basistech package: ISO15924Constants, BTLanguageCodes, and ISO15924. Use com.basistech.util.ISO15924 and com.basistech.util.LanguageCode.

  • Expanded the scope of Tokenizer to tokenize all languages. In a context configuration, Tokenizer should be placed after processors that provide their own language-specific tokenization (BL1, CLA, JLA, and KLA ), and (along with Sentence Boundary Detector) before processors that use the tokenization it provides (ARBL, FABL, and URBL).

  • Enabled use of the same RLP context with a Chinese Script converter (CSC) to convert simplified Chinese to traditional Chinese or vice versa.

New Features in 5.4.1

  • The Arabic Base Linguistics processor (ARBL) returns a new result type: lemma, and also returns multiple analyses of each token. Each analysis contains a stem, lemma, normalized token, root, and part-of-speech tag. See also Known Problems in 5.4.1.

  • Sample code to integrate RLP with Lucene and Solr, for applications that index and search English and Japanese documents. This code provides a starting point for creating analyzers to index and search documents in any of the languages that RLP supports.

New Features in 5.4.0

  • Base linguistics support for Farsi (Persian).

  • Base linguistics support for Urdu.

  • A C API.

  • Adoption of Java 1.5.

  • Java clients are no longer required to set the java.library.path system property. Additionaly this directory is no longer required on the Windows path.

  • Use of the BT_ROOT environment variable to set the root directory is no longer supported. New APIs to set the Basis root directory (the installation directory) have been introduced to replace deprecated APIs for setting the rlp root directory (a child of BT_ROOT). Utilities that took a BT_ROOT argument interpreted as the rlp root directory now take a BT_ROOT argument meaning the Basis root directory.

  • Deprecated BT_RLP_Environment::InitializeFromBuffer in favor of BT_RLP_Environment::InitializeFromFile.

  • In the com.basistech package, we have deprecated the com.basistech.ISO15924 interface and the ISOLanguageCodes and ISO15924 classes in favor of the new com.basistech.util.ISO15924 Enum.

  • In the com.basistech.rlp package, we have deprecated the RLPNEConstants interface and RLPNEUtils class in favor of com.basistech.util.NEConstants

  • Utilities shared by RLP and other Basis Technology products have been segregated from RLP. If using Java, you must include both btrlp.jar and btutil.jar on the classpath. As noted above, some Java APIs have been deprecated in favor of new APIs in the utilities package: com.basistech.util.

  • Support for Chinese user dictionaries has been extended to include part-of-speech tags, decomposition patterns, and the compilation of binary dictionaries.

  • The Japanese Language Analyzer (JLA) identifies the dictionary (the standard dictionary or a user dictionary) in which each token is found.

  • Special syntax for entering Private Use Area (PUA) Unicode characters in Japanese user dictionaries.

  • Addition of 1403 nouns, 2657 proper nouns, and 4 verbs to the Chinese dictionary used by the Chinese Language Analyzer (CLA).

  • Addition of 68 nouns, 579 proper nouns, 5 adjectives, and 2 verbs to the Japanese dictionary used by the Japanese Language Analyzer (JLA).

  • RCLU returns a new result type if com.basistech.rclu.mapoffsets is set to true. MAP_OFFSETS is an array of unsigned integers that map the characters in normalized text to their location in the input text prior to normalization.

  • Optional builds with statically linked libraries for Windows.

  • Three processors have been removed from RLP: Arabic Transliteration/Transcription (artrans), Arabic Name Vocalizer (arvoc), and Arabic Name Variations (arvars). The functionality provided by these processors is in a separate product, the Rosette Name Translator (RNT).

New Features in 5.3.3

Improved Name Entity accuracy for Arabic and English.

New Features in 5.3.2

  • Improved Named Entity accuracy for English.

  • RCLU support for mapping token offsets in text that undergoes character transformations (normalization) to offsets in the original text prior to the normalization. Normalization may transform one Unicode character into two or three characters or vice versa. If you set the com.basistech.rclu.mapoffsets property to "true" or "yes" and apply com.basistech.rclu.FormCNormalization, com.basistech.rclu.FormDNormalization, com.basistech.rclu.FormKDNormalization, or com.basistech.rclu.FormKCNormalization, RCLU returns an array of unsigned integers in addition to the transformed text. Each element in the array is the original text character index corresponding to the character of transformed text indicated by the element's position in the array. If com.basistech.rclu.mapoffsets is set to "false" or "no" (the default), or a transformation other than the ones listed above is performed, this result is null.

  • Support for reading the RLP license directly from the RLP environment configuration, rather than from a separate file. The RLP environment configuration, which may be instantiated from a file or buffer, may now contain a license element with the license.

  • Support for Chinese user dictionaries has been extended to include part-of-speech tags, decomposition patterns, and the compilation of binary dictionaries.

New Features in 5.3.0

  • Support for European language (BL1) user dictionaries.

  • Rosette Language Identifier (RLI) support for identifying a number of additional languages: Hindi, Tamil, Kannada, Telugi, Bengali, Gujarati, Malayalam, Ukranian (Cyrillic), Serbian (Cyrillic and Latin), Uzbek (Cyrllic and Latin), transliterated Farsi, and transliterated Urdu (Farsi and Urdu in Arabic script already supported).

  • RLI support for the Traditional Chinese Big5-HKSCS encoding.

  • RLI can identify upper-case English as such, enabling more accurate processing of upper-case English when RLI identifies the language.

  • RLI has been retrained to improve accuracy detecting Russian and Bulgarian.

  • Named Entity Extractor accuracy improvements identifying named entities in Arabic, Chinese, English, and Japanese.

  • Addition of a C++ named entity iterator (BT_RLP_NE_Iterator) to streamline access to named entities.

  • Addition of 1,632 Chinese names and 89 non-Chinese names to the Chinese dictionary used by the Chinese Language Analyzer (CLA).

  • Addition of 1,758 nouns to the Japanese dictionary used by the Japanese Language Analyzer (JLA).

  • All language processors (including Unicode Converter, which used to be built in to the RLP environment) are dynamically loaded as needed.

  • C++ API changes for BT_RLP_Context. GetStringResult and GetUTF16StringResult now return pointers as constants. GetUTF16StringResult now takes a second argument, which returns the length of the non-null-terminated string. The new signatures are as follows:

    • BT_Char8 const *GetStringResult(BT_RLP_EntityType type)

    • BT_Char16 const *GetUTF16StringResult(BT_RLP_EntityType type, BT_UInt32& resultLength)

New Features in 5.2.0

  • Improved performance and accuracy for Arabic base linguistics and named entity extraction. For efficiency, folded all 12 pronoun POS tags (PRON_XX) into a single POS tag (PRONOUN), and all 7 demonstrative pronoun tags (DEM_PRON_XX) into a single POS tag (DEM_PRON).

  • Added a context property (com.basistech.bl.query) for processing queries (search terms) more accurately and efficiently.

  • Enhanced support for user-defined named entity types.

  • Added API support for supplying MIME type and encoding when processing input (you can also include language processors in the RLP context that detect MIME type and encoding).

  • Refactoring of language processors.

    • The Raw Input processor no longer exists. In its place, you simply start your context with RLI and RCLU.

    • The formal distinction between processor types (input, language, output) no longer exists.

    • RDIF functionality for extracting plain text from markup up input (such as HTML, XML, PDF, and Microsoft Office documents) has been folded into new language processors: mime_detector, iFilter, and HTML Stripper.

  • Addition of the Chinese Script Converter (CSC). CSC is a language processor that converts text in Simplified Chinese script (used in the People's Republic of China) to text in Traditional Chinese script (used in Taiwan, Hong Kong, and Macau), or vice versa.

  • RLI has added support for GB 18030.

  • To enhance performance, RLP has switched from the PCRE regular expression engine to the Tcl regular expression engine.

New Features in 5.0.4

  • This release adds support for Mac OS 10.4. The compressed software and documentation installation packages for this platform are

    • rlp-5.0.4-sdk-ia32-darwin891-gcc40.tar.gz
    • rlp-5.0.4-doc-unix.tar.gz

    In addition to the SDK (RLP) and release number (5.0.4), the software package filename identifies the architecture (IA32), the Unix core environment (Darwin 8.9.1), and the C++ compiler (gcc 4.0). Using the Basis Technology naming convention, BT_BUILD is ia32-darwin891-gcc40; see "Getting Started: Supported Platforms" in the RLP Application Developer's Guide.

    Note: This platform uses .dylib instead of .so as the file extension for shared objects, and DYLD_LIBRARY_PATH instead of LD_LIBRARY_PATH as the environment variable for setting the load library path. See Known Problems in RLP 5.0.4.

  • If you are using the Java API, Java 1.5 is required.

New Features in 5.0.1

Added C++ and Java API to enable an RLP application to ascertain license information.

C++ BT_RLP_Environment functions:

  • bool HasLicenses() const = 0;

  • bool HasLicenseForLanguage(BT_LanguageID lid, BT_UInt32 functionality) const = 0;

  • virtual bool HasLicenseForNamedFeature(const char *feature, BT_UInt32 functionality) const = 0;

For the arguments you can use for functionality and feature, see bt_rlp_license_types.h.

rlp_sample has been updated to illustrate the use of these new functions. See rlp_sample.cpp.

Java RLPEnvironment methods:

  • boolean hasLicenses() throws RLPException

  • boolean hasLicenseForLanguage(int language_id, int functionality) throws RLPException

  • virtual boolean hasLicenseForNamedFeature(String feature, int functionality) throws RLPException

For the arguments you can use for functionality and feature, see RLPConstants.

RLPSample has been updated to illustrate the use of these new methods. See RLPSample.java.

New Features in 5.0.0

  • Rosette Language Identifier (RLI) enhancements: improved support for identifying language with short text streams, ability to detect Shift-JIS-2004 encoding, and support for identifying the following additional languages: Pashto, Urdu, Kurdish, Somali, transliterated (Romanized) Arabic, and transliterated (Romanized) Farsi. Added language codes to distinguish transliterated Arabic and transliterated Farsi from Arabic and Farsi in Arabic script.

  • Streamlining of input processors. The Raw Input processor, in conjunction with the RLI and RCLU language processors, can now handle Unicode and non-Unicode input encodings. The Latin 1 input processor is no longer required to handle US-ASCII and ISO 8859-1, so it has been eliminated.

  • Lexicon updates for the Japanese Language Analyzer (JLA) and the Chinese Language Analyzer (CLA):

    For Japanese, added 3 verbs, 2 given personal names, 3 foreign place names, 187 general proper nouns, 1731 common nouns, and 7 place names.

    For Chinese, added 15 adjectives, 47 adverbs, 395 phrases, 35 abbreviations, 125 noun phrases, 105 numeric entries, 196 proper nouns, 78 pronouns,34 temporal nouns, and 194 verbs.

  • Speed improvements for the NameEntityExtractor, improved support for handling upper-case English input, and added support for traditional Chinese.

  • Enhanced API support for accessing RLP result data. The C++ BT_RLP_ResultIterator class now includes an At function, which provide random accesss to individual results.

  • The default logging level has been changed from "none" to "error". See "Capturing Log Output" in Chapter 3 of the RLP Application Developer's Guide.

  • RLP Demo (GUI) fixes/enhancements.

  • Removal of support for the .NET interface. In a subsequent release, we plan to reintroduce the .NET interface with additional API to provide complete access to RLP functionality.

Java API Changes

  • RLPResultRandomAccess, RLPIntegerPair, and RLPIntegerTriple have been deprecated. Use RLPResultAccess to access result data. Depending on the result type, RLPResultAccess returns a List, Map, Integer, or String.

  • It is no longer necessary to call RLPInitialize.init() to initialize the JNI. Accordingly, the RLPInitialize class has been removed.

  • Introduction of the EnvironmentParameters class for defining the environment configuration, handling the logging callback object and log level, and setting the RLP root directory. The empty constructor RLPEnvironment() is deprecated. Use the new constructor after you have set up the EnvironmentParameters object: RLPEnvironment(EnvironmentParameters params).

New Features in 4.3.7

  • The Raw Input Processor has been updated to handle Unicode encodings properly.

New Features in 4.3.4

  • The performance of English, French, Italian, German, and Spanish named-entity processing has been improved by approximately 200%.

New Features in 4.3.3

  • The Gazetteer XML DTD and corresponding parser have been updated to support new features of the Global Name Matcher. Note: Apart from this readme file, the RLP documentation has not changed since Release 4.3.

New Features in 4.3.2

The Token Normalization processor is new in 4.3.2:

  • The token normalization processor, bt_lp_m1norm, is a many to one normalizer. It reads the normalized token vector or the token vector if the normailzed vector is not present. For each input token, it returns the normalized token, if any are found in the data source. Results are written to the normailzed token vector.

    RLP is preconfigured with a Japanese Orthographic Normalization data source for use with the token normailzation processor. See the JON processor listing in the rlp-global.xml file.

New Features in 4.3

  • The following processors are new in 4.3:

    • Raw Input Processor

      To enable the processing of non-Unicode input data, the Raw Input processor takes any raw text data and performs no encoding detection or conversion. This is only relevant for use with the RCLU language processor, which can take non-Unicode data and convert them to UTF-16 for succeeding language processors to use. In that case, the context definition must begin with RCLU and an encoding must be passed to the ProcessBuffer method. If you do not wish to provide an encoding, RLI must precede RCLU to detect the encoding.

    • Core Library for Unicode Language Processor (RCLU)

      The RCLU LP performs two functions. First, it will convert the input text to UTF-16 as required by other language processors. Second, it can perform certain character transformations on the input text. Encoding conversion will only occur if RCLU is used in conjunction with the Raw Input processor. Transformations are controlled using the context properties described above and are performed in the order they are listed in the context definition. If you do not provide an encoding, RLI must precede RCLU to detect the encoding. For a complete list of dependencies and context properties, please see the RLP Programming and API Guide.

    • Arabic Name Vocalizer Language Processor (ARVOC)

      The ARVOC language processor attempts to vocalize Arabic words under the assumption that they are likely to be personal names or parts of personal names. The selected text will be replaced with its fully-vocalized version under the prejudice that each word is likely to be part of a personal name. Words for which the proper vocalization could not be determined will be left intact (i.e., in their nonvocalized form). Words are vocalized on a one-by-one (context-free) basis. See the RLP Programming and API Guide for further details.

  • Significant accuracy improvements have been attained for Arabic base linguistics and English named entity extraction.

  • Support has been added for Czech, Greek, Hungarian, and Polish. See the Features by Language table above, and the BL1 section and POS Tags Appendix of the RLP Programming and API Guide for more information. In addition, BNP and NE are now available for Portuguese, and NE is now available for Dutch.

  • A new sample program for Windows, rlpW32demo.exe, has been added. The previous demo, RLPDemo.exe, has been removed.

New Features in 4.2.1

The following processors are new in 4.2.1:

  • bt_ip_htmlstripper, which does the following:

    1. Reads HTML document and identifies encoding via RLI.

    2. Converts to UTF-16, transforms all SGML entities to UTF-16 characters.

    3. Strips out HTML tags and post the stripped text as a BT_RLP_RAW_TEXT notice.

    4. Posts UTF-16 HTML document as BT_RLP_HTML_DOC notice on the blackboard.

    5. Posts HTML tags information as BT_RLP_HTML_TAGS notice on the blackboard.

  • bt_lp_htmljoiner, which does the following:

    1. Annotates the BT_RLP_HTML_DOC with Named Entity data.

    2. If com.basistech.htmljoiner.do_artrans property is true, performs Arabic vocalization and transliteration of the Named Entity data.

    3. Posts the annotated HTML document as BT_RLP_HTML_DOC notice on the blackboard.

    Annotated Named Entity in the output HTML has the following format:

    <OBJECT title="NE type">Arabic NE Text<BDO dir="LTR">[transliterated 
    text]</BDO></OBJECT>

    In the BT_RLP_HTML_DOC notice, all meta tags with http-equiv="Content-Type" will be convert to:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 
  • bt_op_dumphtml, which

    Reads the BT_RLP_HTML_DOC notice and dumps the output as UTF-8.

    The following is a sample context configuration (included in the SDK package as samples/etc/rlp-html-context.xml). It defines a processing chain that will consume an HTML document and output annotated HTML document via the DumpHTML output processor.

           <?xml version="1.0" encoding="utf-8" standalone="no"?>
            <!DOCTYPE contextconfig SYSTEM "http://www.basistech.com/dtds/2003/contextconfig.dtd">
            <contextconfig>   
              <properties>
                <property name="com.basistech.dumphtml.output_pathname" value="rlp-output.htm"/>
                <property name="com.basistech.htmljoiner.do_artrans" value="true"/>
              </properties>
              <inputprocessor>HTML Stripper</inputprocessor>
              <outputprocessor>DumpHTML</outputprocessor>
              <languageprocessors>
                <languageprocessor>RLI</languageprocessor>
                <languageprocessor>BL1</languageprocessor>
                <languageprocessor>Tokenizer</languageprocessor>
                <languageprocessor>JLA</languageprocessor>
                <languageprocessor>CLA</languageprocessor>
                <languageprocessor>KLA</languageprocessor>
                <languageprocessor>SentenceBoundaryDetector</languageprocessor>
                <languageprocessor>ARBL</languageprocessor>
                <languageprocessor>Stopwords</languageprocessor>
                <languageprocessor>BaseNounPhrase</languageprocessor> 
                <languageprocessor>NamedEntityExtractor</languageprocessor>
                <languageprocessor>Gazetteer</languageprocessor>
                <languageprocessor>RegExpLP</languageprocessor>
                <languageprocessor>NERedactLP</languageprocessor>
                <languageprocessor>HTML Joiner</languageprocessor>
              </languageprocessors>
            </contextconfig>

New Features in 4.2

  • Processing functionality for Dutch, Portuguese, and Russian has been added to the BL1 language processor. This new functionality includes changes to the bl1config.dtd file and new POS tags for each new language. See the RLP Programming and API Guide for full information.

  • Named Entity Extraction is now available for French, Italian, and Spanish.

  • Improvements were made for German decompounding and disambiguation.

New Features in 4.1

  • Beta release of improved European language performance. See the description of the BL1 processor in the RLP Programming and API Guide. A side effect of the introduction of this processor is that base noun phrase functionality and named entity extraction for French, Italian, and Spanish are temporarily unavailable. They will be available in the next release of RLP.

  • Improved AR-BL performance (now approximately 30 times faster).

  • Improved logging functionality:

    • BT_RLP_Environment::SetLogLevel and BT_RLP_Environment::SetLogCallbackFunction can now be called multiple times.

    • The logging callback now takes a "channel" parameter, allowing the function to examine the channel number and decide how to handle the message based on its severity. The new prototype is:

      typedef void (*BT_Log_callback_function)(void * callback_data, int channel, char const *string_to_log);
  • Improved gazetteer and regular expression (regex) files for demonstrating custom named entity extraction: The default configuration files shipped with RLP have been updated to facilitate experimenting with these features. The instructions below assume you are using the RLPDemo program that is part of the Windows distribution. The instructions concerning edits to the configuration files also apply if you are using the samples\scripts\go.sh script that comes with UNIX distributions or the samples\scripts\go.bat script that comes with Windows distributions.

    How to demo the gazetteer

    1. Edit the file rlp\samples\etc\rlpdemo-gazetteer.txt; insert any demo words or phrases at the end of the file

    2. Open a text file with those phrases in the RLPDemo GUI, or type them in, and press the "Named Entities" button.

      You should see your entries highlighted in grey as "GazDemo" type entities. The standard demo text that appears when you open RLPDemo has the words "identification", "morphological", and "extraction". These will be highlighted in grey.

    How to demo the regex entity locator:

    The standard demo has three types of regex entities defined to show up as "OTHER" in purple highlight.

    1. Type in one of these three types:

      • A US Social Security number (xxx-xx-xxxx)

      • A credit card number (xxxx-xx-xxxx)

      • The emoticon (smiley) ":)"

    2. Push "Named Entities."

      You will see valid SS#'s, credit card numbers and the smiley highlighted in purple.

    3. You can edit the file rlp\etc\regex-config.xml to add your own regex entities. Look at the demo entity types near the top of the file.

    Notes

    This version will classify all regex entities as "NONE" which is shown as "OTHER" in the demo.

    Regex entity definition will be simplified in a future release to make it easier to define entities.

  • JLA now includes the capability to create and use one or more user dictionaries for words specific to an industry or application. See the JLA Processor section of the RLP Programming and API Guide for full details.

  • All European language processors (English, French, German, Italian, and Spanish) have been replaced by a beta version of a single new language processor called Base Linguistics (BL1). This new processor renders the old POSTagger processor obsolete and has changed all POS tags for these languages. In addition, tokenizing functionality is built in, and thus the Tokenizer processor will produce no results when run with BL1. For complete specifications, see the RLP Programming and API Guide.

New Features in 4.0.0

  • Support for French, Italian, and Spanish named entities.

  • Python is no longer used by RLP.

  • The distributed directory structure has been changed. All binaries are now in the top-level bin directory and all libraries are now in the top-level lib directory. Also, the rex directory has been renamed to rlp.

  • "REX" in the names of classes and structures has been replaced with "RLP". The include file bt_rex.h provides backward compatibility for the old names.

  • The Japanese lexicon has been improved. A comparison with the RLP 2.3 lexicon is given below.

  • The Chinese lexicon has been improved. A comparison with the RLP 2.3 lexicon is given below.

    This release includes Basis Technology Chinese lexicons with data from new providers. This may affect the Chinese Language Analyzer tokenization results in your application. Basis internal tests show that overall, segmentation accuracy has improved compared to the lexicons used in the 3.x release of CLA. These tests also show that over a very large corpus, the segmentations do not differ at all in 88% of cases. Of the cases where there are differences, CLA 4.0 was correct 53% of the time and CLA 3.x was correct 34% of the time.

    The following is a summary of the major differences.

     CLA 4.0.0CLA 3.x.x
    verb phrasesMany treated as one token

    Examples: 坐在, 去了, Ŋ到, 抄起

    Many treated as multiple tokens

    Examples: 坐 在, 去 了, Ŋ 到, 抄 起

    adverb + verbTreated as one token

    Example: 也有

    Treated as multiple tokens

    Example: 也   有

    popular noun phrases Treated as one token

    Examples: 本报讯, 党支部书记, 资产阶级自

    Treated as multiple tokens

    Examples: 本报   讯, 党支部   书记, 资产阶级   自由化

    abbreviations Treated as one token

    Examples: 女足, 世锦赛

    Treated as multiple tokens

    Examples: 女   足, 世   锦  赛

    notCombinations treated as one token

    Examples: 不是, 不要,   不会

    Combinations treated as two tokens

    Examples: 不   是, 不   要, 不   会

    bookCombinations treated as two tokens

    Examples: 华文   书, 英文   书

    Combinations treated as one token

    Examples: 华文书, 英文书

    one, someCombinations treated as two tokens

    Examples: 一   个

    Combinations treated as one token

    Example: 一个

    的 (particle) Not appended to some adjectives

    Example: 红红

    Appended to some adjectives

    Example: 红红的

    biggestTreated as one token

    Example: 最大

    Treated as two tokens

    Example: 最   大

    In 4.0.0, idioms and proper nouns are considerably more up-to-date. The most significant decrease in version 4.0 accuracy is in the recognition of proper nouns, particularly Chinese personal names. Basis Technology is committed to making product improvements to deliver better accuracy, and a project is underway now to improve proper noun coverage in an upcoming release.

  • Japanese Lexicon Improvements:

    • New Dictionary: The Japanese dictionary has been revised extensively. New words have been added. Internal parts-of-speech have been updated.

    • Katakana segmentation: Unknown Katakana words will be segmented as one token.

    • Compound Verbs: Where possible, compound verbs will be segmented as one token.

    • SeparatePlaceNameFromSuffix Option: This options will apply to Japanese Postal Address Level 1 and Level 2 words only.

    • IgnoreSeparators Option: In JLA 4.1, when set to false, JLA will always treat whitespace separators as token delimiters. In prior versions, this behavior was not always observed.

    • New Segmentation: Because of major lexicon change plus other significant changes in segmentation of Katakana words and compound verbs, it is highly recommended that existing customers consider re-segmenting existing data.

New Features in 2.3.0

  • Arabic and Chinese statistical algorithms have been implemented for named-entity extraction.

  • Chinese base noun phrase identification has been added.

  • Chinese part-of-speech algorithms have been improved.

  • Four new entity types have been added to RLP (ACE/TIDES entity types): FACILITY (FAC), GEOPOLITICAL ENTITY (GPE), VEHICLE (VEH), WEAPON (WEA).

  • Entity subtypes are now allowed.

  • Korean dictionary comments which were previously only in Korean are now also in English. (Files: ham-cnn.dic, ham-rma.dic, ham-usr.dic)

New Features in 2.2.1

  • Named Entity Maximum Entropy Labeling has been implemented.

New Features in 2.2

  • For this release, there is a new component for named entity tagging. This is expected to increase the accuracy of entity labeling for the languages that are already covered: English, German and Japanese. It also lays the groundwork for the addition of new languages in the near future. Other than the accuracy improvements, there should be no visible differences between the previous output and the new output.

  • Delay loading enables more efficient loading of dynamic libraries, as well as the ability to locate dependent libraries without setting the PATH environment variable. This feature is for Windows environments with MSVC 7.1.

  • The document type definition neconfig.dtd has changed. It is now:

    <!ELEMENT neconfig (ne-options)+>
    <!ELEMENT ne-options (nerfeasturedatapath | labelweightsdatapath}>
    <!ATTLIST ne-options language CDATA #REQUIRED>
    <!ELEMENT env EMPTY>
    <!ATTLIST env name CDATA #REQUIRED>
    <!ELEMENT nerfeasturedatapath (#PCDATA | env)*>
    <!ELEMENT labelweightsdatapath (#PCDATA | env)*>
  • featuredatapath and classdatapath are no longer used. labelweightsdatapath is used for named entity labeling. nerfeasturedatapath is used for named entity recognition (the spelling error is intentional for compatibility reasons).

  • Two new Chinese POS tags have been added to the Chinese Language Processor: NP - Noun Proper and U - Unknown.

  • The Gazetteer Text Processor Module allows users to create custom Named Entity definitions from their own data lists. See Chapter 4: RLP Processors of the RLP Programming and API Guide for documentation.

  • This release introduces two new APIs for setting and handling the root directory: BT_REX_Environment::SetRootDirectory and BT_REX_Environment::RootDirectory.

  • A new context property for the REXML Output Processor, suppress_header_comment, allows the option of turning off the header comment that includes the REX version used to generate the file.

  • RLP Arabic capabilities have been expanded to handle variant Arabic numeric orthography such as the use of 'reh' (U+0631) as the decimal (U+066B) separator.

  • This release adds support for Korean and Chinese to the Sentence Boundary Detector (SBD).

  • There is now a .NET API and a .NET sample program for RLP.

  • Three new POS tags appear in the Korean Language Processor: NN for numbers, FW for foreign (non-Korean) words and Unknown for words not recognized.

  • A new Japanese LP option, IgnoreSeparators (default TRUE), can be set in the jla-options.xml file.

Bugs Fixed

Bugs fixed in 6.5.2

Bug #ComponentDescription
RLP-2449BL1Fixed stack overflow processing large sequences of space-delimited numbers (in some languages, spaces may be used as thousands separator).
RLP-2455BL1Fixed processing error that disabled the handling of larger user dictionary entries.
RLP-2459UtilitiesIf in the input for a user dictionary is invalid UTF-8, RLP now reports an error rather than truncating the input.
RLP-2462JavaImplemented a workaround for Sun bug 6791815 introduced in Java 1.6.0_04. The bug can cause deadlock on non-Solaris platforms when multiple threads are attempting to initialize a direct buffer. The workaround is create the direct buffer in a single thread before a race condition can arise.
RLP-2426RLBLIn order to avoid inaccurate language identification based on very small script regions, changed the default minimum text region (com.basistech.lbd.min_region) that the LanguageBoundary processor will analyze from 0 to 20 Unicode characters.

Bugs fixed in 6.5.1

Bug #ComponentDescription
RLP-2312Regular ExpressionsFixed a segmentation fault in RegExpLP handling input that is empty or contains only whitespace.
RLP-2410Regular ExpressionsEliminated a hang in RegExpLP that was sometimes triggered by a low-memory condition.
RLP-2077Entity RedactorAdded rules for eliminating entities that contain non-alphanumeric data.
RLP-2398BL1Improved handling of out-of-memory conditions.
RLI-274RLIFixed error identifying upper-case English as Pashto.
CLA-435CLAEliminated a logic error that led to the allocation of an extra zero-length token while segmenting a particular sequence of Chinese input characters.
RLP-2421Java APIFixed a NullPointerException that was thrown looking for part-of-speech tags in input that is empty or contains only whitespace.

Bugs fixed in 6.5.0

Bug #ComponentDescription
RLP-2022Language Boundary DetectorRevised the algorithm to check script regions, and if ambiguous, text regions (sentences) within script regions. Problems reporting gaps between regions and dividing a single language region into multiple regions have been fixed.
RLP-2030RLIRLP now understands the correct ISO639 codes for Hebrew (he, not iw), Yiddish (yi, not ji), and Indonesian (id, not in) as returned by RLI.
RLP-2141C++ APIBT_RLP_NE_Iterator has been fixed to handle language-neutral named entities when the language is not known.
CLA-421CLAFixed corruption of surrogate pairs (a pair of 16-bit words representing a Unicode character) when handling user-dictionary entries.
CLA-425, CLA-426, CLA-427CLAFixed an error in the Pinyin reading dictionary, which in a limited set of cases was producing some slightly incorrect readings.
RLP-2170JLAFixed tokenizing error handling newline at end of a sentence.
JLA-295JLAEliminated reporting of of zero-length tokens.
RLP-2140BL1To maintain consistency with other language processors, changed error message to a warning when the language is unknown.
RLP-1833CSCFixed error reporting whitespace tokens.
RLP-1829Java APIThe Java command-line utility (com.basistech.rlp.RLP) now reports a meaningful message when the input file it has been told to process is missing.
RLP-1792Named EntitiesFixed a boundary-case logic error on 64-bit platforms that in some instances caused a crash when posting Named Entity results from multiple processors.
RLP-1775TokenizerFixed error tokenizing a series of numbers separated by whitespace as separate numbers. In specific well-defined cases (such as a space used as a thousands separator in French) a single number may still include whitespace.
REX-233Named Entity RedactorWhen resolving conflicts between overlapping candidate entities, Named Entity Redactor was returning two entities: the winning entity and the portion of the losing candidate that was not in the overlap. Now only the winning entity is returned.

Bug fixed in RLP 6.0.4

Bug #ComponentDescription
RLP-2129NamedEntityExtractorCorrected a memory management error extracting references to named entities.

Bugs fixed in 6.0.3

Bug #ComponentDescription
8051CLAFixed CLA tokenization of non-Chinese text.
8440TokenizerFixed tokenization of numbers separated by white space.
8475/8623RCLUFixed FormKCNormalization to correctly handle decomposition of characters that span a buffer boundary.
8607JLAFixed normalization of dates.
8669Language processorsFixed handling of documents that only contain whitespace.
8690Regular ExpressionsFixed boundary condition problem in RegExpLP which could cause a crash in NERedactLP.

Bugs fixed in 6.0.1

Bug #ComponentDescription
8461Named EntitiesFixed handling of boundary case on 64-bit platforms when posting Named Entity results from multiple processors.
8482Sample ApplicationsFixed crash in C sample program when tokenization of the input does not occur because the user does not have the required license.
8563Chinese Script ConverterFixed error handling whitespace.
8564Chinese Script ConverterNow reposts DETECTED_LANGUAGE and DETECTED_SCRIPT so subsequent processors have access to the language code and writing script code for the converted text.
8574Java APIFixed Java Native Implementation of log callbacks to avoid inappropriate messages when the -Xcheck:jni flag is passed to the JVM.
8611Java APIEnabled Java users to catch a C++ Java Native Implementation exception by catching the exception ourselves and rethrowing it as a Java OutOfMemoryError or RuntimeException.

Bugs fixed in 6.0.0

Bug #ComponentDescription
8238RLIFixed occasional error loading RLI language profiles.
2988RegExpLPFixed detection of Japanese dates of format YYYY年MM月.
7360Build SystemRemoved occasional requirement for JAVA_15_HOME environment variable.
8255RCLUFixed normalization error handling last character in Korean text.
6476CSCCSC reposts RAW_TEXT, TOKEN, and TOKEN_OFFSET results to reflect the converted text.

Bugs fixed in 5.4.2

Bug #ComponentDescription
8255RCLUFixed error normalizing final character in Korean text.
7941ScriptBoundary DetectorFixed error returning script for Simplified Chinese.
7965CLAFixed out-of-memory error processing large tokens.
8221KLAFixed error reporting offsets in Korean text.
8236RCLUFixed offset map out-of-bounds bug.
8251RCLUFixed error mapping offsets for Unicode surrogate pairs in UTF-16 input.
7979RLIFixed error handling '~' at the end of input text in special circumstances.

Bugs fixed in 5.4.0

Bug #ComponentDescription
6986BL1When Bl1 reaches the memory limit defined by the bl1config memory_limit attribute (the default is 200,000,000 bytes), a warning is logged (not an error), BL1 memory is cleared, and BL1 is re-initialized. Accordingly, an application can continue processing new languages without restarting the process.
5926JLAFixed normalization of Kanji numbers.
6950Java APIFixed handling of empty Map objects that was causing a NullPointerException when attempting to return compounds from Japanese text that contains no compounds.
4264GUI DemoFixed support for using an IME (input method editor) to enter Chinese, Korean, or Japanese text into the Windows GUI Demo.
6209JLAFixed problem decomposing compounds in text-format (uncompiled) user dictionaries.
6451Named EntitiesFixed memory leak handling user-defined entity names.
6550ARBLFixed bidirectional problem resolving numbers in Arabic script.
6909KLAFixed misuse of copyright symbol (©).
7325Gazetteer LPSignificantly improved performance for the Gazetteer language processor and the underlying Aho-Corasick dictionary.

Bugs Fixed in 5.3.4

Bug #ComponentDescription
7557BL1Fixed bug handling a token in German text containing 100 's' characters by setting a limit on the number of variants that are generated for lookup.

Bugs Fixed in 5.3.3

Bug #ComponentDescription
7095JLAFixed rare crash during Japanese tokenization.

Bugs Fixed in 5.3.1

Bug #ComponentDescription
6771BL1Fixed stack overflow handling a very large token.
6495BL1Fixed crash-bug handling large tokens in Greek .

Bugs Fixed in 5.3.0

Bug #ComponentDescription
5370C++ APIBT_RLP_Context contain two methods that should, but did not, return constant pointers: GetStringResult and GetUTF16StringResult. The second of these may return a non-null-terminated string, but did not return the length. This has been fixed. See 5.3.0 New Features: Context object.
6292RCLUYou can use RCLU to apply context property transforms (e.g., FormCNormalization) to input text that is alreadly encoded as UTF-16 with the correct byte order.
3687REXMLSpelling correction of a context property name. com.basistech.rexml.supress_header_comment has been corrected to com.basistech.rexml.suppress_header_comment. The old spelling is still supported.

Bugs Fixed in 5.2.0

Bug #ComponentDescription
4425Named EntitiesThe absence of Named Entity Redactor in a context no longer causes a failure to report all the named entities found by Named Entity Extractor, Gazetteer, and Regular Expressions.
4782GazetteerGazetteer now returns named entities that match the text but not the whitespace in the Gazetteer entries.
5270BL1Rare problem with French input that caused truncation of the input stream has been fixed.
2838BNPEliminated the splitting of some Japanese base noun phrases into multiple noun phrases.
3253GazetteerFixed thread safety problem with use of the Aho-Corisick string matching algorithm to locate named entities.
3254GazetteerEliminated rereading of gazetteers when a context performs multiple process operations.
3357CLAAdded five context properties to set Chinese Language Analyzer reading options.
3404C++ APIAccelerated the disambiguation of Chinese POS tags.
3788GazetteerIgnore diacritics in Gazetteer entries to improve matches with documents that do not contain diacritics.
4126GazetteerFixed handling of false gazetteer entry matches that do not respect token boundaries, i.e., that begin or end with a token fragment.
4127GazetteerExpose gazetteer names to the public API.
5977LBDFixed Java Language Boundary Detector crash on Unix platforms.

Bugs fixed in 5.0.3

Bug #ComponentDescription
6771BL1Fixed problem processing long strings of unbroken text.

Bugs fixed in 5.0.2

Bug #ComponentDescription
6174JLAFixed multi-threading problem tokenizing Japanese text.

Bugs Fixed in 5.0

Bug #ComponentDescription
2798BL1Corrected errors in POS tagging of punctuation for English, French, Italian, German, and Spanish.
3979BL1Corrected errors in stem results for Greek.
3996BL1Corrected handling of special symbol "|" in Polish.
4145C++ APIAdded cdecl modifier to all public static members to avoid linking problems experienced in some environments.
4572, 4684RLBLFixed errors in language code returned by Language Boundary LP.
4786APILogger reports RLP version.
4891Command-line utilityFixed -describe to return description of context configuration.
4896PackagingWindows binary files now contain version information.
4908BL1Fixed errors in stem results.
5034CLA and JLACLA and JLA now return normalized numerics as stem results.
5058JLA, CLAFixed handling of stopwords.
5075BL1Report clitics boundaries as spaces in stem results.
5172BL1Enabled user to reset the BL1 memory limit.
4308Regular ExpressionsUpdated regex-config.xml to improve named-entity recognition of URL domain names and country codes, and international telephone numbers.
4448ARBLFixed problem handling input with whitespace.
4985KLAFixed memory leak.
5059ARBLFixed error in nomalization of numbers in stem results.
3497CoreFixed performance problem processing input text on Solaris.
4403JLAFixed problem in tokenization of words in Latin script.
4841JLAFixed problem recognizing certain verb endings.
4875JLAFixed utility for building user dictionary.
4992JLAFixed behavior of recursive decomposition when deep_compound_decomposition is set to true.
5064JLAFixed error normalizing numbers.
5130JONProvide more complete Kanji orthographic data to normalization dictionary.
5264RLIImproved ability to distinguish Indonesian text from Malay.
5270BL1Prevented a fatal error during Named Entity processing when BL1 fails to compute a token offset in French text (happens rarely).

Bugs Fixed in 4.3.4

Bug #ComponentDescription
4213Language Identifier LPMis-identified English text files.
4293Japanese LPAdd kanji variations to the Japanese Orthographic Analyzer dictionary.

Bugs Fixed in 4.3

Bug #ComponentDescription
3776Arabic LPRecognize year-identification in Arabic date strings
3845RLP CoreRemove function-level statics
3953Arabic LPTokenization Errors when some non-printable characters are present
3977Arabic LPNumeric stems not stored in stem vector
3991Japanese LPAdd normalization to user dictionary builder
4070Regular Expressions LPMemory leak in RegExpLP with multiple contexts

Bugs Fixed in 4.2

Bug #ComponentDescription
3840RLP CoreBT_ROOT directory name with a space causes an error
3855Named Entity ExtractorNE Macro mismatches
3865Arabic LPValgrind error in bt_lp_arla.cpp

Bugs Fixed in 4.0

Bug #ComponentDescription
2147Language Identifier LPRemove static initialization
3395RLP CoreProcessBuffer not implemented for Latin1 IP
3477Korean LPMemory leaks.
3501Korean LPNoun and stem results don't line up in sample output.
3612Chinese LPAT&T not parsed correctly
3646Chinese LPSegmentation fault on dual CPU machines.

Bugs Fixed in 2.3

Bug #ComponentDescription
2180Japanese LPAll the prefectures have to be marked as CMPD.
2503Java APIThe DOM return API has been deprecated in favor of the Java API.
2610Arabic LPArabic LP processes entire token stream before doing POS disambiguation.
2702Python InterfacesEnvironment init fails leads to python crash.
2897DemoRLPWinForms does not display whole raw data - when the input file is not short, not all input is displayed in RAW_TEXT. This will be fixed by producing a new GUI in the future.
2903REXML Output Processor REX ML OP bus error (segfault) with ARBL variations.
2937NE LPRex crashes without SentenceBoundaryDetector in context.
2942Japanese LP"志し" in the rex sample Japanese text is not properly segmented as a single token.
2957Japanese LPできます (dekimasu) is incorrectly segmented to で#きます (de#kimasu) in certain cases .
2974DemoWhen the Java sample program is executed with Korean text, an error occurs.
2982Java APIREX Java unit tests report "'import site' failed".
2983Japanese LPKatakana words in UDD didn't work in some situations (single Katakana words in isolation were being segmented incorrectly).
2991Demo"abnormal program termination" happens when rdif is executed.
3023Rex CoreSetting the same context property more than once doesn't work.
3039PackagingAccommodate bt_rex_config_xml_parser.cpp in the source package makefiles.
3053Japanese LP8 of 47 prefectures in Japan are tokenized wrongly.
3056EFIGS LPWrong German POS tags from efigs on Solaris.
3057Japanese LPDisabling Decomposition of Japanese Characters.
3074EFIGS LPEnglish possessive suffixes ('s or ') were not being split from noun.
3113Misc. IPTruncated output when using source\samples\go.bat.
3137PackagingNeed msvcr71.dll in Windows package.
3141SBD LPWarning from Sentence Boundary Detector for Simplified Chinese
3150C++ APIBT_REX_Token_Iterator should include IsStopword() predicate

Known Problems

Known Problem in 6.5.0

The Regular Expression processor as configured with BT_ROOT/rlp/etc/regex-config.xml does not recognize Japanese "unit in square" quasi-characters, such as such as "㌢" (centi of centimeter) or "㎝" (cm as one code point).

The recommended workaround is to include the RCLU processor in the RLP Context (in place of the Unicode Converter) along with com.basistech.rclu.FormKCNormalization set to true. RCLU normalizes these characters to regular text ("センチ" or "cm", respectively, for the examples cited above).

For example:

<?xml version="1.0"?>
<!DOCTYPE contextconfig SYSTEM "contextconfig.dtd">
<contextconfig>
  <properties>
    <property name="com.basistech.rclu.FormKCNormalization" value="true"/>
  </properties>
  <languageprocessors>
   <languageprocessor>RLI</languageprocessor>
   <languageprocessor>RCLU</languageprocessor>
   <languageprocessor>JLA</languageprocessor>
    <languageprocessor>SentenceBoundaryDetector</languageprocessor>
    <languageprocessor>BaseNounPhrase</languageprocessor> 
    <languageprocessor>NamedEntityExtractor</languageprocessor>
    <languageprocessor>Gazetteer</languageprocessor>
    <languageprocessor>RegExpLP</languageprocessor>
    <languageprocessor>NERedactLP</languageprocessor>
  </languageprocessors>
</contextconfig>

Known Problems in 5.4.1

(7845) The Java API does not yet provide access to one of the new result types introduced in 5.4.1: ALTERNATIVE_PARTS_OF_SPEECH.

(7897) The statistical model used by Arabic Base Linguistics does not yet factor in the lemma. Accordingly, if you set com.basistech.arbl.lemmas to true, the disambiguator may not pick the most likely candidate for LEMMA. If you also set com.basistech.arbl.alternatives to true, ALTERNATIVE_LEMMAS should contain the correct lemma, but it may not be the first alternative, even if com.basistech.bl1.query is set to false.

If you use an RLP makefile in RLP/samples, and JAVA_HOME is defined, you may receive this error:

$ make all BT_BUILD=ia32-glibc23-gcc32
  ../../../build_system/confs/java.conf:7: *** JAVA_HOME defined but JAVA_15_HOME is not.  Stop.

The workaround is to set JAVA_15_HOME to the value of JAVA_HOME. In the recent upgrade to java 1.5, Basis used JAVA_HOME to point to Java 1.4 for internal compatibility, and JAVA_15_HOME to point to Java 1.5. This issue will be addressed in the next release.

Known Problems in 5.3.0 and 5.3.1

Java is not supported for 64-bit platforms.

Known Problems in 5.2.0

Java is not supported for 64-bit platforms.

Known Problems in 5.0.4

The script files for running the RLP samples use LD_LIBRARY_PATH to set the load library path. As Noted in New Features, the correct environment variable for setting the load libraray path on the Mac OS 10.4 platform is DYLD_LIBRARY_PATH.

Each of these scripts contains a single line setting and exporting LD_LIBRARY_PATH. Change LD_LIBRARY_PATH to DYLD_LIBRARY_PATH. The scripts are in rlp/samples/scripts/ia32-darwin891-gcc40, and their use is explained in the "Getting Started" chapter in the RLP Application Developer's Guide.

  • go.sh
  • go-cpp-samples.sh
  • go-java-samples.sh

Known Problems in 5.0.0

(4425) Must include Named Entity Redactor in context if the context includes more than one processor that generates named entities (Named Entity Extractor, Regular Expressions, Gazetteer) If the Named Entity Redactor is not included, only named entities found by the first processor (of the 3 identified above) in the context are reported.

Java is not supported for 64-bit platforms.

(4782) Gazetteer only finds named entities with exactly the same whitespace as in the Gazetteer entry.

(5270) If BL1 is unable to compute a token offset (happens rarely with French input), it reports an error, truncates the token stream, and processes the text up to the problem token.

Known Problems in 4.3.0-4.3.4

The Windows GUI demo, btrlpdemo.exe, is a beta release. Display problems (e.g., missing columns in the analysis window) may occur after extended use. This can be worked around by restarting the demo. Korean text is not displayed properly when using the base lingusitics view of a Korean analysis. There is nowork around for this. The GUI demo requires an RDIF license. If you don't have one, you can request an evaluation license for RDIF.

Known Problems in 4.2

The new BL1 functionality has resulted in several new problems:

  • Configuration: All European language processors (DELA, ENLA, ESLA, FRLA, and ITLA) have been replaced by the single language processor, BL1, and thus these older processors no longer appears in configuration files such as rlp-global.xml, rlp-context.xml, and rlp-context-no-op.xml.

  • POS Tags: All POS tags for European languages have changed. See the RLP Programming and API Guide for full details.

  • Threads: Currently, you can only run one analysis for BL1 at a time.

More Information

For more information, please consult Basis Technology's Rosette Linguistics Platform documentation:

  • RLP Application Developer's Guide (titled the RLP Programming and API Guide prior to release 5.0.0)

  • API References: C++, Java, C, .NET

For support email ProductSupport@basistech.com. Please include the product name (RLP) in the subject line of your email.