Annotation of texttool-concept/texttools.tex, revision 1.1
1.1 ! dwinter 1: \documentclass[a4paper]{article}
! 2:
! 3: \usepackage[latin1]{inputenc}
! 4: \usepackage[T1]{fontenc}
! 5: \usepackage{ae}
! 6: \usepackage{url}
! 7: \usepackage{graphicx}
! 8:
! 9: \graphicspath{{graphics/}}
! 10:
! 11: \title{Draft: Proposal for a text tool architecture for ECHO}
! 12: \author{Robert Casties}
! 13: \date{\today}
! 14:
! 15: \begin{document}
! 16:
! 17: \maketitle
! 18:
! 19: \section{Introduction}
! 20: \label{sec:introduction}
! 21: In the context of ECHO ``text'' represents scholarly metadata as well
! 22: as full texts of sources. As such, text forms the glue between the
! 23: different objects in the ECHO corpus. To fully exploit the potential
! 24: of text for semantic access and interlinking, tools have to support
! 25: the automatic or manual generation of links between different objects
! 26: within the ECHO corpus.
! 27:
! 28: A viewing environment should present configurable views on all texts
! 29: that allow to exploit relations to other texts and objects.
! 30:
! 31: Four different fields can be
! 32: identified, for which tools have to be developed:
! 33: \begin{itemize}
! 34: \item the generation of XML-structures
! 35: \item the analysis of the corpora
! 36: \item the meaningful linking of texts
! 37: \item the generation of scholarly metadata.
! 38: \end{itemize}
! 39:
! 40:
! 41: \section{Requirements}
! 42: \label{sec:requirements}
! 43: The handling of large corpora makes it necessary to define a minimal
! 44: standard XML structure for these documents. This implies the
! 45: development of tools to convert existing document formats into
! 46: these standard formats. In addition, tools for editing documents in
! 47: these formats will have to be made available.
! 48:
! 49: A prerequisite for generating links between documents is the
! 50: possibility to analyse texts and adding the results of this analysis
! 51: to the document. In general, two different types of this analysis can
! 52: be distinguished: automatically generated analysis following defined
! 53: rules and the manual analysis by marking words depending on the
! 54: context.
! 55:
! 56: The analysis of corpora is the basis for automatically generated
! 57: linking of documents. For example the wordlists generated by
! 58: morphological analysis can serve as starting point for linking to a
! 59: dictionary or a grammar. Another example would be the usage of a
! 60: wordlist consisting of technical terms serving as basis for linking to
! 61: an encyclopedia or glossary. Furthermore such wordlists can serve as
! 62: starting points for cross-linking within the text corpus using this
! 63: word lists as a common anchor.
! 64:
! 65: Beyond the automatically generated linking of documents, the linking
! 66: as result of scholarly work has to be supported by text tools,
! 67: e.g. showing connections between different texts in the corpus,
! 68: combining sources with translations or secondary texts, the linking
! 69: between images and describing texts or the
! 70: connection between full texts and images.
! 71:
! 72: In particular, an open environment for adding comments and notes to
! 73: sources can be a test bed for how collaborative work on sources could
! 74: be encouraged by the ECHO project in order to build a virtual European
! 75: research area on cultural heritage.
! 76:
! 77:
! 78: \section{Technical issues}
! 79: \label{sec:formats}
! 80:
! 81:
! 82: \subsection{Granularity of reference}
! 83: \label{sec:gran-refer}
! 84:
! 85: The basic layer of informational markup has to define the units of
! 86: reference for the higher layers. The granularity of these reference
! 87: units determines the amount of complexity needed for referencing in
! 88: the higher layers. The markup in the basic layer should also permit
! 89: changes in formatting and corrections in the source document to a
! 90: certain extent without loosing referential integrity.
! 91:
! 92: The proposed unit of reference in the basic layer is a \emph{word}.
! 93: Where \emph{word} means any sequence of characters between whitespace
! 94: or other special characters in the source document, excluding
! 95: formatting and markup. The word as a unit of reference is not meant to
! 96: be a semantical unit or even a morphological unit. It is only meant to
! 97: be the smallest easily recognizable unit used in the text.
! 98: Morphological, syntactical and semantical units can be assembled and
! 99: referenced on higher level layers as \emph{terms} comprised of one or
! 100: more, not necessarily adjacent \emph{words}.
! 101:
! 102:
! 103: \subsection{Information layers}
! 104: \label{sec:information-layers}
! 105:
! 106: The text tools operate according to the ``standoff
! 107: principle'' of XML markup. The basic text is marked up only to provide
! 108: the basis of raw data and reference. Additional syntactical and
! 109: semantical information -- be it automatically generated or scholarly
! 110: edited -- is provided in separated informational layers of
! 111: \emph{word lists} referencing other layers or the basic text.
! 112:
! 113: \begin{figure}[htbp]
! 114: \centering
! 115: \includegraphics[width=0.8\textwidth]{word-termlists}
! 116: \caption{Relation of basic text and term lists. }
! 117: \label{fig:word-termlists}
! 118: \end{figure}
! 119:
! 120: A \emph{word list} or \emph{term list}\footnote{\emph{word list} and
! 121: \emph{term list} will be used interchangeably in the following text
! 122: since both forms should be functionally identical.} is a list of
! 123: \emph{words} or \emph{terms} that are each linked to a list of
! 124: references to \emph{words} or \emph{terms} in other \emph{word lists}
! 125: or to \emph{words} in basic texts.
! 126:
! 127: An example for the informational layers in an English or Latin
! 128: text\footnote{English or Latin as examples for languages where
! 129: sufficient morphological analysis can be based on single words.}
! 130: would be:
! 131:
! 132: \begin{enumerate}
! 133: \item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1}
! 134:
! 135: \item \emph{Basic word list} layer, an automatically generated list of all
! 136: unique words and references to their occurrence in the basic text
! 137: (\ref{item:1}).\label{item:2}
! 138:
! 139: \item \emph{Morphological term list} layer, an automatically generated list
! 140: of the morphologically normalized forms of all words and references
! 141: to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4}
! 142:
! 143: \item Scholarly edited \emph{term list} layer, a manually edited list of
! 144: semantical units like technical terms used in the document,
! 145: referring to the basic text (\ref{item:1}).\label{item:5}
! 146: \end{enumerate}
! 147:
! 148: Additional annotation layers referencing the basic text or any other
! 149: layer could be produced and stored in the same text repository or on any
! 150: other server. Therefore it has to be possible to reference any layer in
! 151: a unique and stable way across the net.
! 152:
! 153: In languages with more complex morphological units the morphological
! 154: analysis layer can be based on an intermediate term layer that joins
! 155: basic words into morphological units.
! 156:
! 157:
! 158:
! 159: \subsection{Primary and secondary source texts}
! 160: \label{sec:backr-orig-source}
! 161:
! 162: The text tool system should be easily adaptable to different
! 163: workflows dealing with text in the ECHO domain. There are two
! 164: basic types of text sources with a different degree of integration an
! 165: the central ECHO text corpus.
! 166:
! 167: %% FIXME!!
! 168:
! 169: The \emph{primary source text} is maintained in the basic word tagged
! 170: form on a text corpus server. Updates and changes have to be worked
! 171: into the word tagged text without breaking the referential integrity.
! 172:
! 173: As \emph{secondary source text} the basic word tagged text is not
! 174: the primary source. A mapping file has to be provided,
! 175: that maps the words in the basic text to other referenceable units in
! 176: the primary source documents. Updates and changes in the primary
! 177: document may be followed by updates to the mapping file or the basic
! 178: text to maintain referential integrity.
! 179:
! 180: The distinction between these types of sources concerns mainly the
! 181: text cruncher producing the basic tagged text and eventually a mapping
! 182: file and the presentation tools producing views or references to the
! 183: original source texts.
! 184:
! 185:
! 186:
! 187: \subsection{Support of additional markup}
! 188: \label{sec:supp-addit-mark}
! 189:
! 190: The basic text tagging format should be transparent to additional
! 191: markup in the source text to enable the easy integration of the text
! 192: tools into existing formats and tools. The use of XML namespaces can
! 193: provide such transparency.
! 194:
! 195: The common viewing environment can not be completely
! 196: agnostic to additional markup. It must be able to interpret a common
! 197: set of minimal visual markup. Visual elements to be considered are:
! 198:
! 199: \begin{itemize}
! 200: \item paragraphs and/or line breaks
! 201:
! 202: \item page breaks
! 203:
! 204: \item page images (coupled to page breaks)
! 205:
! 206: \item inline images
! 207: \end{itemize}
! 208:
! 209: When presenting text parts to the user as results to a search request
! 210: it would be useful to have a general mechanism to select larger units
! 211: around the referenced word. Additional semantical units suitable for
! 212: this kind of reference would be sentences. The mechanism could try to
! 213: select the surrounding sentence and then fall back to larger units
! 214: like a paragraph, a page or the whole text.
! 215:
! 216: A translation scheme to map different existing visual markup tags into
! 217: the common set for the viewing environment should be implemented. The
! 218: translation could be done directly upon creation of second source
! 219: texts as these texts are decoupled from the original source text.
! 220: The translation would have to be done on-the-fly for primary source
! 221: texts where markup different from the common set is used.
! 222:
! 223:
! 224: \section{Tools}
! 225: \label{sec:tools}
! 226:
! 227:
! 228: \subsection{Text cruncher}
! 229: \label{sec:text-cruncher}
! 230:
! 231: The \emph{text cruncher} tool takes a text file and eventual
! 232: information about a primary source and produces a \emph{basic word
! 233: tagged text}, a \emph{basic word list}, and an eventual
! 234: \emph{mapping file} if the text is to be considered a secondary source
! 235: text.
! 236:
! 237:
! 238: \subsection{Morphological analyzer}
! 239: \label{sec:morph-analys}
! 240:
! 241: The \emph{morphological analyzer} tool for a given language takes a
! 242: word list or a term list of morphological units and
! 243: produces a \emph{morphological term list} of normalized forms, their
! 244: morphological description, and references to their occurrences in the
! 245: provided list.
! 246:
! 247: A sub function of the morphological analyzer should be a normalizer for
! 248: single words to be used in conjunction with the dictionary tool.
! 249:
! 250:
! 251: \subsection{Dictionary}
! 252: \label{sec:dictionary}
! 253:
! 254: The \emph{dictionary analyzer} tool takes a morphologically normalized
! 255: term list and produces a term list with known terms,
! 256: references to their definitions and references into the occurrences in
! 257: the provided list.
! 258:
! 259: A sub function of the dictionary analyzer should be a lookup tool for
! 260: single normalized words or terms.
! 261:
! 262:
! 263: \subsection{Cross referencer}
! 264: \label{sec:cross-referencer}
! 265:
! 266: The \emph{cross referencer} tool takes a word list from one text
! 267: and a set of word lists from other texts and
! 268: produces a word list with words from the first list and
! 269: references into all of the lists.
! 270:
! 271:
! 272: \subsection{Display environment}
! 273: \label{sec:display-environment}
! 274:
! 275: The \emph{display environment} should be able to display a text with
! 276: minimal visual markup and additional links defined by additional
! 277: wordlists.
! 278:
! 279: The set of necessary visual markup like page breaks, page images,
! 280: inline images or text formatting should follow an agreed standard.
! 281:
! 282: The functionality provided by the links could be direct linking into
! 283: other texts, morphological analyses, or dictionary entries if the word
! 284: is only referenced by one word list. In the case of multiple
! 285: references to a word a mechanism for the selection of one of the
! 286: possible sources must be provided.
! 287:
! 288:
! 289: \subsection{List inverter}
! 290: \label{sec:list-inverter}
! 291:
! 292: The \emph{list inverter} is a small auxiliary tool that takes a
! 293: normal word list that is ordered by unique words and produces an
! 294: \emph{inverted word list} that is ordered by word references.
! 295:
! 296:
! 297:
! 298:
! 299: \section{Use cases}
! 300: \label{sec:use-cases}
! 301:
! 302:
! 303: \subsection{Integration of Archimedes XML texts}
! 304: \label{sec:integr-arch-xml}
! 305:
! 306: The XML texts of the Archimedes project could be integrated in two
! 307: different ways: either as primary source texts, adding basic word
! 308: tagging to the Archimedes markup or as secondary source texts by
! 309: providing mapping files to the unchanged source files.
! 310:
! 311: In the first case basic word tagging would be added to the XML
! 312: document by the text cruncher. The resulting documents could then be
! 313: further processed and edited, provided that word references are not
! 314: broken. The text cruncher would produce a basic word list for use with
! 315: other text tools.
! 316:
! 317: In the second case only a secondary source text and a mapping file
! 318: would be produced by the text cruncher together with the basic word
! 319: list. The original source text would stay unchanged outside the text
! 320: repository.
! 321:
! 322: Additional mappings would have to be generated to adapt the visual
! 323: markup used in the Archimedes XML to the common markup for the display
! 324: environment.
! 325:
! 326:
! 327:
! 328: \subsection{Integration of existing webpages}
! 329: \label{sec:integr-exist-webp}
! 330:
! 331:
! 332:
! 333: \subsection{Integration of raw OCR text}
! 334: \label{sec:integration-raw-ocr}
! 335:
! 336: Raw OCR text as it is generated by automatic OCR on digitized document
! 337: pages could be considered original source material. The OCR produces
! 338: one plain text document per scanned image file. A suitable text
! 339: cruncher would produce a secondary source text for use in the
! 340: repository with a mapping file referencing the original text files.
! 341:
! 342:
! 343:
! 344: \subsection{Full text search}
! 345: \label{sec:full-text-search}
! 346:
! 347: (to be done)
! 348:
! 349:
! 350: \subsection{Cross linking of texts}
! 351: \label{sec:cross-linking-texts}
! 352:
! 353: (to be done)
! 354:
! 355:
! 356: \section{Proposed formats}
! 357: \label{sec:proposed-formats}
! 358:
! 359:
! 360: \subsection{Basic document}
! 361: \label{sec:basic-docum-form}
! 362:
! 363: The basic document format consists of word tags, and optionally language information
! 364: for morphological analysis and basic visual markup.
! 365:
! 366: An example in pseudo XML markup might look like this:
! 367:
! 368: \begin{verbatim}
! 369: <text lang="lat">
! 370: <word id="1">omnia</word>
! 371: <word id="2">gallia</word>
! 372: <word id="3">est</word>
! 373: <word id="4">divisa</word>
! 374: <word id="5">in</word>
! 375: <word id="6">partes</word>
! 376: <word id="7">tres</word>.
! 377: </text>
! 378: \end{verbatim}
! 379:
! 380:
! 381:
! 382: \subsection{Basic wordlist}
! 383: \label{sec:wordlist}
! 384:
! 385: The basic wordlist consists of all unique words and references to
! 386: their occurrences in the basic text.
! 387:
! 388: \begin{verbatim}
! 389: <list id="1">
! 390: <list-entry id="1">
! 391: <word>patria</word>
! 392: <word-ref>xlink:bello_gallico#36</word-ref>
! 393: <word-ref>xlink:bello_gallico#157</word-ref>
! 394: <word-ref>xlink:bello_gallico#336</word-ref>
! 395: </list-entry>
! 396: <list-entry id="2">
! 397: <word>bello</word>
! 398: <word-ref>xlink:bello_gallico#189</word-ref>
! 399: <word-ref>xlink:bello_gallico#236</word-ref>
! 400: <word-ref>xlink:bello_gallico#557</word-ref>
! 401: <word-ref>xlink:bello_gallico#1396</word-ref>
! 402: <word-ref>xlink:bello_gallico#1450</word-ref>
! 403: </list-entry>
! 404: </list>
! 405: \end{verbatim}
! 406:
! 407:
! 408: \subsection{Term list}
! 409: \label{sec:term-list}
! 410:
! 411: A term groups one or more words into a semantical unit. A term list
! 412: contains chosen terms and references to their occurrences.
! 413:
! 414: \begin{verbatim}
! 415: <list id="1">
! 416: <list-entry id="1">
! 417: <term>patria nostra</term>
! 418: <term-ref>
! 419: <word-ref>xlink:bello_gallico#36</word-ref>
! 420: <word-ref>xlink:bello_gallico#37</word-ref>
! 421: </term-ref>
! 422: <word-ref>xlink:bello_gallico#36</word-ref>
! 423: <term-ref>
! 424: <word-ref>xlink:bello_gallico#155</word-ref>
! 425: <word-ref>xlink:bello_gallico#157</word-ref>
! 426: </term-ref>
! 427: </list-entry>
! 428: <list-entry id="2">
! 429: <term>belllo gallico</term>
! 430: <term-ref>
! 431: <word-ref>xlink:bello_gallico#12</word-ref>
! 432: <word-ref>xlink:bello_gallico#13</word-ref>
! 433: </term-ref>
! 434: </list-entry>
! 435: </list>
! 436: \end{verbatim}
! 437:
! 438:
! 439: \subsection{Primary source mapping}
! 440: \label{sec:prim-source-mapp}
! 441:
! 442: A primary source mapping maps every word of a basic document to its
! 443: equivalent in the primary source document.
! 444:
! 445: \begin{verbatim}
! 446: <source-mapping>
! 447: <map id="1">
! 448: <word-ref>xlink:bello_gallico#1</word-ref>
! 449: <ref>xlink:bello.txt(1235)</ref>
! 450: </map>
! 451: <map id="2">
! 452: <word-ref>xlink:bello_gallico#2</word-ref>
! 453: <ref>xlink:bello.txt(1245)</ref>
! 454: </map>
! 455: <map id="3">
! 456: <word-ref>xlink:bello_gallico#3</word-ref>
! 457: <ref>xlink:bello.txt(1257)</ref>
! 458: </map>
! 459: </source-mapping>
! 460: \end{verbatim}
! 461:
! 462:
! 463:
! 464: \section{Development priorities and time plan}
! 465: \label{sec:devel-prior-time}
! 466:
! 467: (to be done)
! 468:
! 469: \section{Glossary}
! 470: \label{sec:glossary}
! 471:
! 472: \begin{description}
! 473: \item[word] In a basic text a word is any sequence of characters
! 474: between delimiters of whitespace or other delimiters. A word on this
! 475: level is not a semantical, not even a syntactical unit.
! 476:
! 477: \item[term] A term is a container for one or more not necessarily
! 478: adjacent words. Terms can be syntactical or semantical units. Terms
! 479: can be used and referenced like basic words.
! 480:
! 481: \item[word reference] A word reference is an xlink or similar
! 482: reference to a word or term in a word list or in a basic text.
! 483:
! 484: \item[term reference] A term reference is a reference to a term and
! 485: equivalent to a word reference.
! 486:
! 487: \item[word list] A word list is a list containing elements consisting
! 488: of a word and a list of word references.
! 489:
! 490: \item[term list] A term list is equivalent to a word list. Its
! 491: elements consist of a term and a list of word references.
! 492:
! 493: \item[word occurrence list] A word occurrence list is a list where
! 494: every element is treated like a type and a list of all its instances
! 495: -- occurrences -- in the text. The same word (type) can occur only
! 496: once in an occurrence list where it can reference many word instances.
! 497:
! 498: \item[word instance list] A word instance list is a word list where
! 499: every element is treated like a singular object (unlike a word
! 500: occurrence list). The same word (type) can occur multiple times in an
! 501: instance list where it can reference only one word or term instance.
! 502:
! 503: \end{description}
! 504:
! 505:
! 506: \end{document}
! 507:
! 508: %%% Local Variables:
! 509: %%% mode: latex
! 510: %%% TeX-master: t
! 511: %%% End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>