BasisTechnology::RLP Namespace Reference

The RLP namespace contains the managed code API to the Rosette Language Processor. More...


Classes

class  Context
 The RLP Context class, which analyzes data using RLP processors. More...
class  Environment
 The RLP Environment class. More...
class  Error
 An exception type thrown by RLP methods. More...
class  IntegerVectorValue
 The return type for the RAW_TEXT result when the safeDataOnly flag is set is a boxed array of Chars. More...
class  LanguageNameUtils
 Some utility functions used for working with language identifiers. More...
class  NamedEntityData
 The context function GetNEREsults returns an array of NamedEntityData elements. More...
class  NamedEntityUtil
 Utilities for handling named entities. More...
class  SafeIntegerVectorValue
 The return type for all the result types that include groups of integers when the caller requests the use of only safe data. More...
class  SingletonStringValue
 The return type for the RAW_TEXT result, unless the 'safeDataOnly' flag is set. More...

Enumerations

enum  LanguageID {
  UNKNOWN = 0, ABKHAZIAN = 1, AFAN = 2, AFAR = 3,
  AFRIKAANS = 4, ALBANIAN = 5, AMHARIC = 6, ARABIC = 7,
  ARMENIAN = 8, ASSAMESE = 9, AYMARA = 10, AZERBAIJANI = 11,
  BASHKIR = 12, BASQUE = 13, BENGALI = 14, BHUTANI = 15,
  BIHARI = 16, BISLAMA = 17, BRETON = 18, BULGARIAN = 19,
  BURMESE = 20, BYELORUSSIAN = 21, CAMBODIAN = 22, CATALAN = 23,
  CHINESE = 24, CORSICAN = 25, CROATIAN = 26, CZECH = 27,
  DANISH = 28, DUTCH = 29, ENGLISH = 30, ESPERANTO = 31,
  ESTONIAN = 32, FAROESE = 33, FIJI = 34, FINNISH = 35,
  FRENCH = 36, FRISIAN = 37, GALICIAN = 38, GEORGIAN = 39,
  GERMAN = 40, GREEK = 41, GREENLANDIC = 42, GUARANI = 43,
  GUJARATI = 44, HAUSA = 45, HEBREW = 46, HINDI = 47,
  HUNGARIAN = 48, ICELANDIC = 49, INDONESIAN = 50, INTERLINGUA = 51,
  INTERLINGUE = 52, INUPIAK = 53, IRISH = 54, ITALIAN = 55,
  JAPANESE = 56, JAVANESE = 57, KANNADA = 58, KASHMIRI = 59,
  KAZAKH = 60, KINYARWANDA = 61, KIRGHIZ = 62, KOREAN = 63,
  KURDISH = 64, KURUNDI = 65, LAOTIAN = 66, LATIN = 67,
  LATVIAN = 68, LINGALA = 69, LITHUANIAN = 70, MACEDONIAN = 71,
  MALAGASY = 72, MALAY = 73, MALAYALAM = 74, MALTESE = 75,
  MAORI = 76, MARATHI = 77, MOLDAVIAN = 78, MONGOLIAN = 79,
  NAURU = 80, NEPALI = 81, NORWEGIAN = 82, OCCITAN = 83,
  ORIYA = 84, PASHTO = 85, PERSIAN = 86, POLISH = 87,
  PORTUGUESE = 88, PUNJABI = 89, QUECHUA = 90, RHAETOROMANCE = 91,
  ROMANIAN = 92, RUSSIAN = 93, SAMOAN = 94, SANGHO = 95,
  SANSKRIT = 96, SCOTSGAELIC = 97, SERBIAN = 98, SERBOCROATIAN = 99,
  SESOTHO = 100, SETSWANA = 101, SHONA = 102, SIMPLIFIEDCHINESE = 103,
  SINDHI = 104, SINGHALESE = 105, SISWATI = 106, SLOVAK = 107,
  SLOVENIAN = 108, SOMALI = 109, SPANISH = 110, SUNDANESE = 111,
  SWAHILI = 112, SWEDISH = 113, TAGALOG = 114, TAJIK = 115,
  TAMIL = 116, TATAR = 117, TELUGU = 118, THAI = 119,
  TIBETAN = 120, TIGRINYA = 121, TONGA = 122, TRADITIONALCHINESE = 123,
  TSONGA = 124, TURKISH = 125, TURKMEN = 126, TWI = 127,
  UKRAINIAN = 128, URDU = 129, UZBEK = 130, VIETNAMESE = 131,
  VOLAPUK = 132, WELSH = 133, WOLOF = 134, XHOSA = 135,
  YIDDISH = 136, YORUBA = 137, ZULU = 138, ENGLISH_UC = 139
}
 The enumeration of languages. More...
enum  ResultType {
  TOKEN = 0, PART_OF_SPEECH = 1, SENTENCE_BOUNDARY = 2, BASE_NOUN_PHRASE = 3,
  DETECTED_LANGUAGE = 4, DETECTED_ENCODING = 5, NAMED_ENTITY = 6, TOKEN_OFFSET = 7,
  STEM = 8, NORMALIZED_TOKEN = 9, COMPOUND = 10, READING = 11,
  SCRIPT_REGION = 12, LANGUAGE_REGION = 13, TEXT_BOUNDARIES = 14, RAW_TEXT = 15,
  STOPWORD = 16, TRANSCRIBED_TEXT = 17, TOKEN_VARIATIONS = 18, GAZETTEER_NAMES = 19,
  ROOTS = 20, FLAGS = 21, HTML_TAGS = 22, HTML_DOC = 23,
  RAW_DATA = 24, PATHNAME = 25, MIME_TYPE = 26, HTML_TABLES = 27,
  MAP_OFFSETS = 28, DETECTED_SCRIPT = 29, TOKEN_SOURCE_ID = 30, TOKEN_SOURCE_NAME = 31,
  LEMMAS = 32, ALT_NORMALIZED_TOKEN = 33, ALT_LEMMAS = 34, ALT_ROOTS = 35,
  ALT_STEMS = 36, ALT_PARTS_OF_SPEECH = 37, MANY_TO_ONE_NORMALIZED_TOKEN = 38, TOKEN_PREFIX_STEM_LENGTHS = 998
}
 The enumeration of types of data that may be retrieved from a Context after processing input data via Context.GetResultData. More...

Functions

public delegate void LogCallback (int channel, String^ message)
 To take control of RLP logging, use this delegate to supply a sink for strings.


Detailed Description

RLP applications begin by creating an Environment object and initializing it. To process data, the application obtains a Context from the Environment. Each thread must have its own Context.

Enumeration Type Documentation

enum BasisTechnology::RLP::LanguageID

Enumerator:
UNKNOWN 
ABKHAZIAN 
AFAN 
AFAR 
AFRIKAANS 
ALBANIAN 
AMHARIC 
ARABIC 
ARMENIAN 
ASSAMESE 
AYMARA 
AZERBAIJANI 
BASHKIR 
BASQUE 
BENGALI 
BHUTANI 
BIHARI 
BISLAMA 
BRETON 
BULGARIAN 
BURMESE 
BYELORUSSIAN 
CAMBODIAN 
CATALAN 
CHINESE 
CORSICAN 
CROATIAN 
CZECH 
DANISH 
DUTCH 
ENGLISH 
ESPERANTO 
ESTONIAN 
FAROESE 
FIJI 
FINNISH 
FRENCH 
FRISIAN 
GALICIAN 
GEORGIAN 
GERMAN 
GREEK 
GREENLANDIC 
GUARANI 
GUJARATI 
HAUSA 
HEBREW 
HINDI 
HUNGARIAN 
ICELANDIC 
INDONESIAN 
INTERLINGUA 
INTERLINGUE 
INUPIAK 
IRISH 
ITALIAN 
JAPANESE 
JAVANESE 
KANNADA 
KASHMIRI 
KAZAKH 
KINYARWANDA 
KIRGHIZ 
KOREAN 
KURDISH 
KURUNDI 
LAOTIAN 
LATIN 
LATVIAN 
LINGALA 
LITHUANIAN 
MACEDONIAN 
MALAGASY 
MALAY 
MALAYALAM 
MALTESE 
MAORI 
MARATHI 
MOLDAVIAN 
MONGOLIAN 
NAURU 
NEPALI 
NORWEGIAN 
OCCITAN 
ORIYA 
PASHTO 
PERSIAN 
POLISH 
PORTUGUESE 
PUNJABI 
QUECHUA 
RHAETOROMANCE 
ROMANIAN 
RUSSIAN 
SAMOAN 
SANGHO 
SANSKRIT 
SCOTSGAELIC 
SERBIAN 
SERBOCROATIAN 
SESOTHO 
SETSWANA 
SHONA 
SIMPLIFIEDCHINESE 
SINDHI 
SINGHALESE 
SISWATI 
SLOVAK 
SLOVENIAN 
SOMALI 
SPANISH 
SUNDANESE 
SWAHILI 
SWEDISH 
TAGALOG 
TAJIK 
TAMIL 
TATAR 
TELUGU 
THAI 
TIBETAN 
TIGRINYA 
TONGA 
TRADITIONALCHINESE 
TSONGA 
TURKISH 
TURKMEN 
TWI 
UKRAINIAN 
URDU 
UZBEK 
VIETNAMESE 
VOLAPUK 
WELSH 
WOLOF 
XHOSA 
YIDDISH 
YORUBA 
ZULU 
ENGLISH_UC 

enum BasisTechnology::RLP::ResultType

Enumerator:
TOKEN  The tokens.

PART_OF_SPEECH  The part-of-speech strings for the tokens.

SENTENCE_BOUNDARY  The indices of sentence-ending tokens.

BASE_NOUN_PHRASE  Start/end token indices for base noun phrases.

DETECTED_LANGUAGE  A single integer, (castable to LanguageID) identifying the language of the text.

DETECTED_ENCODING  The MIME charset or MIME type of the original data.

NAMED_ENTITY  Start/end token indices + named entity types for named entities.

TOKEN_OFFSET  Start/end character indices for the tokens in the UTF-16 text.

STEM  The stemmed forms of the tokens.

NORMALIZED_TOKEN  Normalized forms of the tokens.

Some languages/processors provide normalized forms.

COMPOUND  Compound components of tokens that are compound words.

READING  For Asian languages, readings for the tokens.

SCRIPT_REGION  Regions determined by Unicode script properties.

LANGUAGE_REGION  Regions determined by intra-document language detection.

TEXT_BOUNDARIES  Boundaries detected by the Unicode text boundary algorithm.

RAW_TEXT  The original text, converted to UTF-16.

STOPWORD  Integer indices of tokens identified as stopwords.

TRANSCRIBED_TEXT  Raw text transcribed.

TOKEN_VARIATIONS  Tokens expanded as variations, such as alternative spellings.
GAZETTEER_NAMES  Names discovered via gazeteer.

ROOTS  Roots (for semitic languages).
FLAGS  Not used.

HTML_TAGS  HTML stripped tags data.

For internal use.

HTML_DOC  HTML document.

For internal use.

RAW_DATA  Raw data at the byte level.

Not necessarily Unicode. For internal use.

PATHNAME  Pathname of the file the RAW_DATA came from, if ProcessFile was called.

This is used by some of the input processors to help determine mime types. For internal use.

MIME_TYPE  MIME Type of orignal data.
HTML_TABLES  HTML tables data.

For internal use.

MAP_OFFSETS  Map offsets of transformed text to original text.

DETECTED_SCRIPT  The script of the data.

TOKEN_SOURCE_ID  The source ID of the token.

TOKEN_SOURCE_NAME  The source name of the token.

LEMMAS  Lemmas (for semitic languages).
ALT_NORMALIZED_TOKEN  Alternative Normalized Tokens (for semitic languages).
ALT_LEMMAS  Alternative Lemmas (for semitic languages).
ALT_ROOTS  Alternative Roots (for semitic languages).
ALT_STEMS  Alternative Stems (for semitic languages).
ALT_PARTS_OF_SPEECH  Alternative Parts of Speech (for semitic languages).
MANY_TO_ONE_NORMALIZED_TOKEN  Many-To-One Normalized forms of the tokens.
TOKEN_PREFIX_STEM_LENGTHS  Token Prefix Stem Lengths.

This notice consists of two numbers: the length of the prefix(es) and the length of the stem. This is useful in Arabic, and of unknown relevance to any other language.


Function Documentation

public delegate void BasisTechnology::RLP::LogCallback ( int  channel,
String^   message 
)


Copyright © 2004-2008 Basis Technology Corporation. All Rights Reserved.