118 | | * Url: /mpiwg-mpdl-lt-web/lt/Tokenize |
119 | | * Request parameters: |
120 | | * srcUrl |
121 | | * source URL of fulltext |
122 | | * unstructured text |
123 | | * XML fragment/document |
124 | | * language (if available use xml:lang in XML document else this language value) |
125 | | * ISO 639-3 specifier |
126 | | * normalization (optional; default: without normalization) |
127 | | * "reg", "norm", "reg norm" |
128 | | * dictionary (optional; default: with dictionary) |
129 | | * "yes", "no" |
130 | | * stopElements (optional, default: empty) |
131 | | * elements which should not be analyzed and enriched (e.g. "lb") |
132 | | * outputFormat (optional; default: "xml") |
133 | | * "xml", "string" |
134 | | * Response output: |
135 | | * xml |
136 | | * document enriched by element <w> |
137 | | * e.g. <s><w lang="deu" reg="dies" norm="dies" forms="dies, dieser, dieses, diesen" lemmas="dieser" dictionaries="dwds">Dies</w> <w lang="deu" reg="ist" norm="ist" forms="bin, bist, ist, seid, sind, sein, war, warst, wart" lemmas="sein" dictionaries="dwds">ist</w> <w lang="deu" reg="ein" norm="ein" forms="ein, eines, einer" lemmas="ein" dictionaries="dwds">ein</w> <w lang="deu" reg="satz" norm="satz" forms="satz, sätze, satzes" lemmas="satz" dictionaries="dwds">Satz</w></s> |
138 | | * wordList |
139 | | * word tokens (separated by Blank) |
140 | | * word tokens (XML) |
141 | | |
| 207 | * Url: /mpiwg-mpdl-lt-web/text/Tokenize |
| 208 | * Request parameters: |
| 209 | * inputString or srcUrl (required) |
| 210 | * inputString |
| 211 | * string which should be tokenized |
| 212 | * unstructured text |
| 213 | * XML fragment/document |
| 214 | * srcUrl |
| 215 | * source URL |
| 216 | * unstructured text |
| 217 | * XML fragment/document |
| 218 | * language (optional) |
| 219 | * ISO 639-3 specifier |
| 220 | * if input is XML and an element contains the attribute "xml:lang" this value is used for this element |
| 221 | * default: "eng" |
| 222 | * normalization (optional) |
| 223 | * "none" |
| 224 | * "norm" |
| 225 | * default: "norm" |
| 226 | * dictionary (optional) |
| 227 | * "yes" |
| 228 | * "no" |
| 229 | * default: "yes" |
| 230 | * stopElements (optional, default: empty) |
| 231 | * list of xml element names which should not be tokenized (e.g. "lb pb") |
| 232 | * default: empty list |
| 233 | * outputFormat (optional) |
| 234 | * "xml" |
| 235 | * "string" |
| 236 | * default: "xml" |
| 237 | * Response output: |
| 238 | * outputFormat=xml |
| 239 | * tokenized inputString or document (enriched by element <w>) |
| 240 | * e.g. <s><w lang="deu" reg="dies" norm="dies" forms="dies, dieser, dieses, diesen" lemmas="dieser" dictionaries="dwds">Dies</w> <w lang="deu" reg="ist" norm="ist" forms="bin, bist, ist, seid, sind, sein, war, warst, wart" lemmas="sein" dictionaries="dwds">ist</w> <w lang="deu" reg="ein" norm="ein" forms="ein, eines, einer" lemmas="ein" dictionaries="dwds">ein</w> <w lang="deu" reg="satz" norm="satz" forms="satz, sätze, satzes" lemmas="satz" dictionaries="dwds">Satz</w></s> |
| 241 | * outputFormat=string |
| 242 | * word tokens of inputString or document (separated by Blank) |
| 243 | |