texttool-concept/texttools.tex - view

File: [Repository] / texttool-concept / texttools.tex
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Mon Sep 15 08:13:25 2003 UTC (20 years, 9 months ago) by dwinter
Branches: dwinter, MAIN
CVS tags: alpha, HEAD

texttools
: ----------------------------------------------------------------------

1: \documentclass[a4paper]{article} 2: 3: \usepackage[latin1]{inputenc} 4: \usepackage[T1]{fontenc} 5: \usepackage{ae} 6: \usepackage{url} 7: \usepackage{graphicx} 8: 9: \graphicspath{{graphics/}} 10: 11: \title{Draft: Proposal for a text tool architecture for ECHO} 12: \author{Robert Casties} 13: \date{\today} 14: 15: \begin{document} 16: 17: \maketitle 18: 19: \section{Introduction} 20: \label{sec:introduction} 21: In the context of ECHO ``text'' represents scholarly metadata as well 22: as full texts of sources. As such, text forms the glue between the 23: different objects in the ECHO corpus. To fully exploit the potential 24: of text for semantic access and interlinking, tools have to support 25: the automatic or manual generation of links between different objects 26: within the ECHO corpus. 27: 28: A viewing environment should present configurable views on all texts 29: that allow to exploit relations to other texts and objects. 30: 31: Four different fields can be 32: identified, for which tools have to be developed: 33: \begin{itemize} 34: \item the generation of XML-structures 35: \item the analysis of the corpora 36: \item the meaningful linking of texts 37: \item the generation of scholarly metadata. 38: \end{itemize} 39: 40: 41: \section{Requirements} 42: \label{sec:requirements} 43: The handling of large corpora makes it necessary to define a minimal 44: standard XML structure for these documents. This implies the 45: development of tools to convert existing document formats into 46: these standard formats. In addition, tools for editing documents in 47: these formats will have to be made available. 48: 49: A prerequisite for generating links between documents is the 50: possibility to analyse texts and adding the results of this analysis 51: to the document. In general, two different types of this analysis can 52: be distinguished: automatically generated analysis following defined 53: rules and the manual analysis by marking words depending on the 54: context. 55: 56: The analysis of corpora is the basis for automatically generated 57: linking of documents. For example the wordlists generated by 58: morphological analysis can serve as starting point for linking to a 59: dictionary or a grammar. Another example would be the usage of a 60: wordlist consisting of technical terms serving as basis for linking to 61: an encyclopedia or glossary. Furthermore such wordlists can serve as 62: starting points for cross-linking within the text corpus using this 63: word lists as a common anchor. 64: 65: Beyond the automatically generated linking of documents, the linking 66: as result of scholarly work has to be supported by text tools, 67: e.g. showing connections between different texts in the corpus, 68: combining sources with translations or secondary texts, the linking 69: between images and describing texts or the 70: connection between full texts and images. 71: 72: In particular, an open environment for adding comments and notes to 73: sources can be a test bed for how collaborative work on sources could 74: be encouraged by the ECHO project in order to build a virtual European 75: research area on cultural heritage. 76: 77: 78: \section{Technical issues} 79: \label{sec:formats} 80: 81: 82: \subsection{Granularity of reference} 83: \label{sec:gran-refer} 84: 85: The basic layer of informational markup has to define the units of 86: reference for the higher layers. The granularity of these reference 87: units determines the amount of complexity needed for referencing in 88: the higher layers. The markup in the basic layer should also permit 89: changes in formatting and corrections in the source document to a 90: certain extent without loosing referential integrity. 91: 92: The proposed unit of reference in the basic layer is a \emph{word}. 93: Where \emph{word} means any sequence of characters between whitespace 94: or other special characters in the source document, excluding 95: formatting and markup. The word as a unit of reference is not meant to 96: be a semantical unit or even a morphological unit. It is only meant to 97: be the smallest easily recognizable unit used in the text. 98: Morphological, syntactical and semantical units can be assembled and 99: referenced on higher level layers as \emph{terms} comprised of one or 100: more, not necessarily adjacent \emph{words}. 101: 102: 103: \subsection{Information layers} 104: \label{sec:information-layers} 105: 106: The text tools operate according to the ``standoff 107: principle'' of XML markup. The basic text is marked up only to provide 108: the basis of raw data and reference. Additional syntactical and 109: semantical information -- be it automatically generated or scholarly 110: edited -- is provided in separated informational layers of 111: \emph{word lists} referencing other layers or the basic text. 112: 113: \begin{figure}[htbp] 114: \centering 115: \includegraphics[width=0.8\textwidth]{word-termlists} 116: \caption{Relation of basic text and term lists. } 117: \label{fig:word-termlists} 118: \end{figure} 119: 120: A \emph{word list} or \emph{term list}\footnote{\emph{word list} and 121: \emph{term list} will be used interchangeably in the following text 122: since both forms should be functionally identical.} is a list of 123: \emph{words} or \emph{terms} that are each linked to a list of 124: references to \emph{words} or \emph{terms} in other \emph{word lists} 125: or to \emph{words} in basic texts. 126: 127: An example for the informational layers in an English or Latin 128: text\footnote{English or Latin as examples for languages where 129: sufficient morphological analysis can be based on single words.} 130: would be: 131: 132: \begin{enumerate} 133: \item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1} 134: 135: \item \emph{Basic word list} layer, an automatically generated list of all 136: unique words and references to their occurrence in the basic text 137: (\ref{item:1}).\label{item:2} 138: 139: \item \emph{Morphological term list} layer, an automatically generated list 140: of the morphologically normalized forms of all words and references 141: to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4} 142: 143: \item Scholarly edited \emph{term list} layer, a manually edited list of 144: semantical units like technical terms used in the document, 145: referring to the basic text (\ref{item:1}).\label{item:5} 146: \end{enumerate} 147: 148: Additional annotation layers referencing the basic text or any other 149: layer could be produced and stored in the same text repository or on any 150: other server. Therefore it has to be possible to reference any layer in 151: a unique and stable way across the net. 152: 153: In languages with more complex morphological units the morphological 154: analysis layer can be based on an intermediate term layer that joins 155: basic words into morphological units. 156: 157: 158: 159: \subsection{Primary and secondary source texts} 160: \label{sec:backr-orig-source} 161: 162: The text tool system should be easily adaptable to different 163: workflows dealing with text in the ECHO domain. There are two 164: basic types of text sources with a different degree of integration an 165: the central ECHO text corpus. 166: 167: %% FIXME!! 168: 169: The \emph{primary source text} is maintained in the basic word tagged 170: form on a text corpus server. Updates and changes have to be worked 171: into the word tagged text without breaking the referential integrity. 172: 173: As \emph{secondary source text} the basic word tagged text is not 174: the primary source. A mapping file has to be provided, 175: that maps the words in the basic text to other referenceable units in 176: the primary source documents. Updates and changes in the primary 177: document may be followed by updates to the mapping file or the basic 178: text to maintain referential integrity. 179: 180: The distinction between these types of sources concerns mainly the 181: text cruncher producing the basic tagged text and eventually a mapping 182: file and the presentation tools producing views or references to the 183: original source texts. 184: 185: 186: 187: \subsection{Support of additional markup} 188: \label{sec:supp-addit-mark} 189: 190: The basic text tagging format should be transparent to additional 191: markup in the source text to enable the easy integration of the text 192: tools into existing formats and tools. The use of XML namespaces can 193: provide such transparency. 194: 195: The common viewing environment can not be completely 196: agnostic to additional markup. It must be able to interpret a common 197: set of minimal visual markup. Visual elements to be considered are: 198: 199: \begin{itemize} 200: \item paragraphs and/or line breaks 201: 202: \item page breaks 203: 204: \item page images (coupled to page breaks) 205: 206: \item inline images 207: \end{itemize} 208: 209: When presenting text parts to the user as results to a search request 210: it would be useful to have a general mechanism to select larger units 211: around the referenced word. Additional semantical units suitable for 212: this kind of reference would be sentences. The mechanism could try to 213: select the surrounding sentence and then fall back to larger units 214: like a paragraph, a page or the whole text. 215: 216: A translation scheme to map different existing visual markup tags into 217: the common set for the viewing environment should be implemented. The 218: translation could be done directly upon creation of second source 219: texts as these texts are decoupled from the original source text. 220: The translation would have to be done on-the-fly for primary source 221: texts where markup different from the common set is used. 222: 223: 224: \section{Tools} 225: \label{sec:tools} 226: 227: 228: \subsection{Text cruncher} 229: \label{sec:text-cruncher} 230: 231: The \emph{text cruncher} tool takes a text file and eventual 232: information about a primary source and produces a \emph{basic word 233: tagged text}, a \emph{basic word list}, and an eventual 234: \emph{mapping file} if the text is to be considered a secondary source 235: text. 236: 237: 238: \subsection{Morphological analyzer} 239: \label{sec:morph-analys} 240: 241: The \emph{morphological analyzer} tool for a given language takes a 242: word list or a term list of morphological units and 243: produces a \emph{morphological term list} of normalized forms, their 244: morphological description, and references to their occurrences in the 245: provided list. 246: 247: A sub function of the morphological analyzer should be a normalizer for 248: single words to be used in conjunction with the dictionary tool. 249: 250: 251: \subsection{Dictionary} 252: \label{sec:dictionary} 253: 254: The \emph{dictionary analyzer} tool takes a morphologically normalized 255: term list and produces a term list with known terms, 256: references to their definitions and references into the occurrences in 257: the provided list. 258: 259: A sub function of the dictionary analyzer should be a lookup tool for 260: single normalized words or terms. 261: 262: 263: \subsection{Cross referencer} 264: \label{sec:cross-referencer} 265: 266: The \emph{cross referencer} tool takes a word list from one text 267: and a set of word lists from other texts and 268: produces a word list with words from the first list and 269: references into all of the lists. 270: 271: 272: \subsection{Display environment} 273: \label{sec:display-environment} 274: 275: The \emph{display environment} should be able to display a text with 276: minimal visual markup and additional links defined by additional 277: wordlists. 278: 279: The set of necessary visual markup like page breaks, page images, 280: inline images or text formatting should follow an agreed standard. 281: 282: The functionality provided by the links could be direct linking into 283: other texts, morphological analyses, or dictionary entries if the word 284: is only referenced by one word list. In the case of multiple 285: references to a word a mechanism for the selection of one of the 286: possible sources must be provided. 287: 288: 289: \subsection{List inverter} 290: \label{sec:list-inverter} 291: 292: The \emph{list inverter} is a small auxiliary tool that takes a 293: normal word list that is ordered by unique words and produces an 294: \emph{inverted word list} that is ordered by word references. 295: 296: 297: 298: 299: \section{Use cases} 300: \label{sec:use-cases} 301: 302: 303: \subsection{Integration of Archimedes XML texts} 304: \label{sec:integr-arch-xml} 305: 306: The XML texts of the Archimedes project could be integrated in two 307: different ways: either as primary source texts, adding basic word 308: tagging to the Archimedes markup or as secondary source texts by 309: providing mapping files to the unchanged source files. 310: 311: In the first case basic word tagging would be added to the XML 312: document by the text cruncher. The resulting documents could then be 313: further processed and edited, provided that word references are not 314: broken. The text cruncher would produce a basic word list for use with 315: other text tools. 316: 317: In the second case only a secondary source text and a mapping file 318: would be produced by the text cruncher together with the basic word 319: list. The original source text would stay unchanged outside the text 320: repository. 321: 322: Additional mappings would have to be generated to adapt the visual 323: markup used in the Archimedes XML to the common markup for the display 324: environment. 325: 326: 327: 328: \subsection{Integration of existing webpages} 329: \label{sec:integr-exist-webp} 330: 331: 332: 333: \subsection{Integration of raw OCR text} 334: \label{sec:integration-raw-ocr} 335: 336: Raw OCR text as it is generated by automatic OCR on digitized document 337: pages could be considered original source material. The OCR produces 338: one plain text document per scanned image file. A suitable text 339: cruncher would produce a secondary source text for use in the 340: repository with a mapping file referencing the original text files. 341: 342: 343: 344: \subsection{Full text search} 345: \label{sec:full-text-search} 346: 347: (to be done) 348: 349: 350: \subsection{Cross linking of texts} 351: \label{sec:cross-linking-texts} 352: 353: (to be done) 354: 355: 356: \section{Proposed formats} 357: \label{sec:proposed-formats} 358: 359: 360: \subsection{Basic document} 361: \label{sec:basic-docum-form} 362: 363: The basic document format consists of word tags, and optionally language information 364: for morphological analysis and basic visual markup. 365: 366: An example in pseudo XML markup might look like this: 367: 368: \begin{verbatim} 369: <text lang="lat"> 370: <word id="1">omnia</word> 371: <word id="2">gallia</word> 372: <word id="3">est</word> 373: <word id="4">divisa</word> 374: <word id="5">in</word> 375: <word id="6">partes</word> 376: <word id="7">tres</word>. 377: </text> 378: \end{verbatim} 379: 380: 381: 382: \subsection{Basic wordlist} 383: \label{sec:wordlist} 384: 385: The basic wordlist consists of all unique words and references to 386: their occurrences in the basic text. 387: 388: \begin{verbatim} 389: <list id="1"> 390: <list-entry id="1"> 391: <word>patria</word> 392: <word-ref>xlink:bello_gallico#36</word-ref> 393: <word-ref>xlink:bello_gallico#157</word-ref> 394: <word-ref>xlink:bello_gallico#336</word-ref> 395: </list-entry> 396: <list-entry id="2"> 397: <word>bello</word> 398: <word-ref>xlink:bello_gallico#189</word-ref> 399: <word-ref>xlink:bello_gallico#236</word-ref> 400: <word-ref>xlink:bello_gallico#557</word-ref> 401: <word-ref>xlink:bello_gallico#1396</word-ref> 402: <word-ref>xlink:bello_gallico#1450</word-ref> 403: </list-entry> 404: </list> 405: \end{verbatim} 406: 407: 408: \subsection{Term list} 409: \label{sec:term-list} 410: 411: A term groups one or more words into a semantical unit. A term list 412: contains chosen terms and references to their occurrences. 413: 414: \begin{verbatim} 415: <list id="1"> 416: <list-entry id="1"> 417: <term>patria nostra</term> 418: <term-ref> 419: <word-ref>xlink:bello_gallico#36</word-ref> 420: <word-ref>xlink:bello_gallico#37</word-ref> 421: </term-ref> 422: <word-ref>xlink:bello_gallico#36</word-ref> 423: <term-ref> 424: <word-ref>xlink:bello_gallico#155</word-ref> 425: <word-ref>xlink:bello_gallico#157</word-ref> 426: </term-ref> 427: </list-entry> 428: <list-entry id="2"> 429: <term>belllo gallico</term> 430: <term-ref> 431: <word-ref>xlink:bello_gallico#12</word-ref> 432: <word-ref>xlink:bello_gallico#13</word-ref> 433: </term-ref> 434: </list-entry> 435: </list> 436: \end{verbatim} 437: 438: 439: \subsection{Primary source mapping} 440: \label{sec:prim-source-mapp} 441: 442: A primary source mapping maps every word of a basic document to its 443: equivalent in the primary source document. 444: 445: \begin{verbatim} 446: <source-mapping> 447: <map id="1"> 448: <word-ref>xlink:bello_gallico#1</word-ref> 449: <ref>xlink:bello.txt(1235)</ref> 450: </map> 451: <map id="2"> 452: <word-ref>xlink:bello_gallico#2</word-ref> 453: <ref>xlink:bello.txt(1245)</ref> 454: </map> 455: <map id="3"> 456: <word-ref>xlink:bello_gallico#3</word-ref> 457: <ref>xlink:bello.txt(1257)</ref> 458: </map> 459: </source-mapping> 460: \end{verbatim} 461: 462: 463: 464: \section{Development priorities and time plan} 465: \label{sec:devel-prior-time} 466: 467: (to be done) 468: 469: \section{Glossary} 470: \label{sec:glossary} 471: 472: \begin{description} 473: \item[word] In a basic text a word is any sequence of characters 474: between delimiters of whitespace or other delimiters. A word on this 475: level is not a semantical, not even a syntactical unit. 476: 477: \item[term] A term is a container for one or more not necessarily 478: adjacent words. Terms can be syntactical or semantical units. Terms 479: can be used and referenced like basic words. 480: 481: \item[word reference] A word reference is an xlink or similar 482: reference to a word or term in a word list or in a basic text. 483: 484: \item[term reference] A term reference is a reference to a term and 485: equivalent to a word reference. 486: 487: \item[word list] A word list is a list containing elements consisting 488: of a word and a list of word references. 489: 490: \item[term list] A term list is equivalent to a word list. Its 491: elements consist of a term and a list of word references. 492: 493: \item[word occurrence list] A word occurrence list is a list where 494: every element is treated like a type and a list of all its instances 495: -- occurrences -- in the text. The same word (type) can occur only 496: once in an occurrence list where it can reference many word instances. 497: 498: \item[word instance list] A word instance list is a word list where 499: every element is treated like a singular object (unlike a word 500: occurrence list). The same word (type) can occur multiple times in an 501: instance list where it can reference only one word or term instance. 502: 503: \end{description} 504: 505: 506: \end{document} 507: 508: %%% Local Variables: 509: %%% mode: latex 510: %%% TeX-master: t 511: %%% End: 512: