File:  [Repository] / texttool-concept / texttools.tex
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Mon Sep 15 08:13:25 2003 UTC (20 years, 9 months ago) by dwinter
Branches: dwinter, MAIN
CVS tags: alpha, HEAD
texttools
: ----------------------------------------------------------------------

    1: \documentclass[a4paper]{article}
    2: 
    3: \usepackage[latin1]{inputenc}
    4: \usepackage[T1]{fontenc}
    5: \usepackage{ae}
    6: \usepackage{url}
    7: \usepackage{graphicx}
    8: 
    9: \graphicspath{{graphics/}}
   10: 
   11: \title{Draft: Proposal for a text tool architecture for ECHO}
   12: \author{Robert Casties}
   13: \date{\today}
   14: 
   15: \begin{document}
   16: 
   17: \maketitle
   18: 
   19: \section{Introduction}
   20: \label{sec:introduction}
   21: In the context of ECHO ``text'' represents scholarly metadata as well
   22: as full texts of sources. As such, text forms the glue between the
   23: different objects in the ECHO corpus. To fully exploit the potential
   24: of text for semantic access and interlinking, tools have to support
   25: the automatic or manual generation of links between different objects
   26: within the ECHO corpus.
   27: 
   28: A viewing environment should present configurable views on all texts
   29: that allow to exploit relations to other texts and objects.
   30: 
   31: Four different fields can be
   32: identified, for which tools have to be developed:
   33: \begin{itemize}
   34: \item the generation of XML-structures
   35: \item the analysis of the corpora
   36: \item the meaningful linking of texts
   37: \item the generation of scholarly metadata.
   38: \end{itemize}
   39: 
   40: 
   41: \section{Requirements}
   42: \label{sec:requirements}
   43: The handling of large corpora makes it necessary to define a minimal
   44: standard XML structure for these documents. This implies the
   45: development of tools to convert existing document formats into
   46: these standard formats. In addition, tools for editing documents in
   47: these formats will have to be made available.
   48: 
   49: A prerequisite for generating links between documents is the
   50: possibility to analyse texts and adding the results of this analysis
   51: to the document. In general, two different types of this analysis can
   52: be distinguished: automatically generated analysis following defined
   53: rules and the manual analysis by marking words depending on the
   54: context.
   55: 
   56: The analysis of corpora is the basis for automatically generated
   57: linking of documents. For example the wordlists generated by
   58: morphological analysis can serve as starting point for linking to a
   59: dictionary or a grammar. Another example would be the usage of a
   60: wordlist consisting of technical terms serving as basis for linking to
   61: an encyclopedia or glossary. Furthermore such wordlists can serve as
   62: starting points for cross-linking within the text corpus using this
   63: word lists as a common anchor. 
   64: 
   65: Beyond the automatically generated linking of documents, the linking
   66: as result of scholarly work has to be supported by text tools,
   67: e.g. showing connections between different texts in the corpus,
   68: combining sources with translations or secondary texts, the linking
   69: between images and describing texts or the
   70: connection between full texts and images.
   71: 
   72: In particular, an open environment for adding comments and notes to
   73: sources can be a test bed for how collaborative work on sources could
   74: be encouraged by the ECHO project in order to build a virtual European
   75: research area on cultural heritage. 
   76: 
   77: 
   78: \section{Technical issues}
   79: \label{sec:formats}
   80: 
   81: 
   82: \subsection{Granularity of reference}
   83: \label{sec:gran-refer}
   84: 
   85: The basic layer of informational markup has to define the units of
   86: reference for the higher layers. The granularity of these reference
   87: units determines the amount of complexity needed for referencing in
   88: the higher layers. The markup in the basic layer should also permit
   89: changes in formatting and corrections in the source document to a
   90: certain extent without loosing referential integrity.
   91: 
   92: The proposed unit of reference in the basic layer is a \emph{word}.
   93: Where \emph{word} means any sequence of characters between whitespace
   94: or other special characters in the source document, excluding
   95: formatting and markup. The word as a unit of reference is not meant to
   96: be a semantical unit or even a morphological unit. It is only meant to
   97: be the smallest easily recognizable unit used in the text.
   98: Morphological, syntactical and semantical units can be assembled and
   99: referenced on higher level layers as \emph{terms} comprised of one or
  100: more, not necessarily adjacent \emph{words}.
  101: 
  102: 
  103: \subsection{Information layers}
  104: \label{sec:information-layers}
  105: 
  106: The text tools operate according to the ``standoff
  107: principle'' of XML markup. The basic text is marked up only to provide
  108: the basis of raw data and reference. Additional syntactical and
  109: semantical information -- be it automatically generated or scholarly
  110: edited -- is provided in separated informational layers of
  111: \emph{word lists} referencing other layers or the basic text.
  112: 
  113: \begin{figure}[htbp]
  114:   \centering
  115:   \includegraphics[width=0.8\textwidth]{word-termlists}
  116:   \caption{Relation of basic text and term lists. }
  117:   \label{fig:word-termlists}
  118: \end{figure}
  119: 
  120: A \emph{word list} or \emph{term list}\footnote{\emph{word list} and
  121:   \emph{term list} will be used interchangeably in the following text
  122:   since both forms should be functionally identical.} is a list of
  123: \emph{words} or \emph{terms} that are each linked to a list of
  124: references to \emph{words} or \emph{terms} in other \emph{word lists}
  125: or to \emph{words} in basic texts.
  126: 
  127: An example for the informational layers in an English or Latin
  128: text\footnote{English or Latin as examples for languages where
  129:   sufficient morphological analysis can be based on single words.}
  130: would be:
  131: 
  132: \begin{enumerate}
  133: \item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1}
  134:   
  135: \item \emph{Basic word list} layer, an automatically generated list of all
  136:   unique words and references to their occurrence in the basic text
  137:   (\ref{item:1}).\label{item:2}
  138:     
  139: \item \emph{Morphological term list} layer, an automatically generated list
  140:   of the morphologically normalized forms of all words and references
  141:   to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4}
  142: 
  143: \item Scholarly edited \emph{term list} layer, a manually edited list of
  144:   semantical units like technical terms used in the document,
  145:   referring to the basic text (\ref{item:1}).\label{item:5}
  146: \end{enumerate}
  147: 
  148: Additional annotation layers referencing the basic text or any other
  149: layer could be produced and stored in the same text repository or on any
  150: other server. Therefore it has to be possible to reference any layer in
  151: a unique and stable way across the net.
  152: 
  153: In languages with more complex morphological units the morphological
  154: analysis layer can be based on an intermediate term layer that joins
  155: basic words into morphological units.
  156: 
  157: 
  158: 
  159: \subsection{Primary and secondary source texts}
  160: \label{sec:backr-orig-source}
  161: 
  162: The text tool system should be easily adaptable to different
  163: workflows dealing with text in the ECHO domain. There are two
  164: basic types of text sources with a different degree of integration an
  165: the central ECHO text corpus.
  166:  
  167: %% FIXME!!
  168: 
  169: The \emph{primary source text} is maintained in the basic word tagged
  170: form on a text corpus server. Updates and changes have to be worked
  171: into the word tagged text without breaking the referential integrity.
  172: 
  173: As \emph{secondary source text} the basic word tagged text is not
  174: the primary source. A mapping file has to be provided,
  175: that maps the words in the basic text to other referenceable units in
  176: the primary source documents. Updates and changes in the primary
  177: document may be followed by updates to the mapping file or the basic
  178: text to maintain referential integrity.
  179: 
  180: The distinction between these types of sources concerns mainly the
  181: text cruncher producing the basic tagged text and eventually a mapping
  182: file and the presentation tools producing views or references to the
  183: original source texts.
  184: 
  185: 
  186: 
  187: \subsection{Support of additional markup}
  188: \label{sec:supp-addit-mark}
  189: 
  190: The basic text tagging format should be transparent to additional
  191: markup in the source text to enable the easy integration of the text
  192: tools into existing formats and tools. The use of XML namespaces can
  193: provide such transparency.
  194: 
  195: The common viewing environment can not be completely
  196: agnostic to additional markup. It must be able to interpret a common
  197: set of minimal visual markup. Visual elements to be considered are:
  198: 
  199: \begin{itemize}
  200: \item paragraphs and/or line breaks
  201: 
  202: \item page breaks
  203: 
  204: \item page images (coupled to page breaks)
  205: 
  206: \item inline images
  207: \end{itemize}
  208: 
  209: When presenting text parts to the user as results to a search request
  210: it would be useful to have a general mechanism to select larger units
  211: around the referenced word. Additional semantical units suitable for
  212: this kind of reference would be sentences. The mechanism could try to
  213: select the surrounding sentence and then fall back to larger units
  214: like a paragraph, a page or the whole text.
  215: 
  216: A translation scheme to map different existing visual markup tags into
  217: the common set for the viewing environment should be implemented. The
  218: translation could be done directly upon creation of second source
  219: texts as these texts are decoupled from the original source text.
  220: The translation would have to be done on-the-fly for primary source
  221: texts where markup different from the common set is used.
  222: 
  223: 
  224: \section{Tools}
  225: \label{sec:tools}
  226: 
  227: 
  228: \subsection{Text cruncher}
  229: \label{sec:text-cruncher}
  230: 
  231: The \emph{text cruncher} tool takes a text file and eventual
  232: information about a primary source and produces a \emph{basic word
  233:   tagged text}, a \emph{basic word list}, and an eventual
  234: \emph{mapping file} if the text is to be considered a secondary source
  235: text.
  236: 
  237: 
  238: \subsection{Morphological analyzer}
  239: \label{sec:morph-analys}
  240: 
  241: The \emph{morphological analyzer} tool for a given language takes a
  242: word list or a term list of morphological units and
  243: produces a \emph{morphological term list} of normalized forms, their
  244: morphological description, and references to their occurrences in the
  245: provided list.
  246: 
  247: A sub function of the morphological analyzer should be a normalizer for
  248: single words to be used in conjunction with the dictionary tool.
  249: 
  250: 
  251: \subsection{Dictionary}
  252: \label{sec:dictionary}
  253: 
  254: The \emph{dictionary analyzer} tool takes a morphologically normalized
  255: term list and produces a term list with known terms,
  256: references to their definitions and references into the occurrences in
  257: the provided list.
  258: 
  259: A sub function of the dictionary analyzer should be a lookup tool for
  260: single normalized words or terms.
  261: 
  262: 
  263: \subsection{Cross referencer}
  264: \label{sec:cross-referencer}
  265: 
  266: The \emph{cross referencer} tool takes a word list from one text
  267: and a set of word lists from other texts and
  268: produces a word list with words from the first list and
  269: references into all of the lists.
  270: 
  271: 
  272: \subsection{Display environment}
  273: \label{sec:display-environment}
  274: 
  275: The \emph{display environment} should be able to display a text with
  276: minimal visual markup and additional links defined by additional
  277: wordlists. 
  278: 
  279: The set of necessary visual markup like page breaks, page images,
  280: inline images or text formatting should follow an agreed standard.
  281: 
  282: The functionality provided by the links could be direct linking into
  283: other texts, morphological analyses, or dictionary entries if the word
  284: is only referenced by one word list. In the case of multiple
  285: references to a word a mechanism for the selection of one of the
  286: possible sources must be provided.
  287: 
  288: 
  289: \subsection{List inverter}
  290: \label{sec:list-inverter}
  291: 
  292: The \emph{list inverter} is a small auxiliary tool that takes a
  293: normal word list that is ordered by unique words and produces an
  294: \emph{inverted word list} that is ordered by word references.
  295: 
  296: 
  297: 
  298: 
  299: \section{Use cases}
  300: \label{sec:use-cases}
  301: 
  302: 
  303: \subsection{Integration of Archimedes XML texts}
  304: \label{sec:integr-arch-xml}
  305: 
  306: The XML texts of the Archimedes project could be integrated in two
  307: different ways: either as primary source texts, adding basic word
  308: tagging to the Archimedes markup or as secondary source texts by
  309: providing mapping files to the unchanged source files.
  310: 
  311: In the first case basic word tagging would be added to the XML
  312: document by the text cruncher. The resulting documents could then be
  313: further processed and edited, provided that word references are not
  314: broken. The text cruncher would produce a basic word list for use with
  315: other text tools.
  316: 
  317: In the second case only a secondary source text and a mapping file
  318: would be produced by the text cruncher together with the basic word
  319: list. The original source text would stay unchanged outside the text
  320: repository.
  321: 
  322: Additional mappings would have to be generated to adapt the visual
  323: markup used in the Archimedes XML to the common markup for the display
  324: environment.
  325: 
  326: 
  327: 
  328: \subsection{Integration of existing webpages}
  329: \label{sec:integr-exist-webp}
  330: 
  331: 
  332: 
  333: \subsection{Integration of raw OCR text}
  334: \label{sec:integration-raw-ocr}
  335: 
  336: Raw OCR text as it is generated by automatic OCR on digitized document
  337: pages could be considered original source material. The OCR produces
  338: one plain text document per scanned image file. A suitable text
  339: cruncher would produce a secondary source text for use in the
  340: repository with a mapping file referencing the original text files.
  341: 
  342: 
  343: 
  344: \subsection{Full text search}
  345: \label{sec:full-text-search}
  346: 
  347: (to be done)
  348: 
  349: 
  350: \subsection{Cross linking of texts}
  351: \label{sec:cross-linking-texts}
  352: 
  353: (to be done)
  354: 
  355: 
  356: \section{Proposed formats}
  357: \label{sec:proposed-formats}
  358: 
  359: 
  360: \subsection{Basic document}
  361: \label{sec:basic-docum-form}
  362: 
  363: The basic document format consists of word tags, and optionally language information
  364: for morphological analysis and basic visual markup.
  365: 
  366: An example in pseudo XML markup might look like this:
  367: 
  368: \begin{verbatim}
  369:   <text lang="lat">
  370:     <word id="1">omnia</word>
  371:     <word id="2">gallia</word>
  372:     <word id="3">est</word>
  373:     <word id="4">divisa</word>
  374:     <word id="5">in</word>
  375:     <word id="6">partes</word>
  376:     <word id="7">tres</word>.
  377:   </text>
  378: \end{verbatim}
  379: 
  380: 
  381: 
  382: \subsection{Basic wordlist}
  383: \label{sec:wordlist}
  384: 
  385: The basic wordlist consists of all unique words and references to
  386: their occurrences in the basic text.
  387: 
  388: \begin{verbatim}
  389:   <list id="1">
  390:     <list-entry id="1">
  391:       <word>patria</word>
  392:       <word-ref>xlink:bello_gallico#36</word-ref>
  393:       <word-ref>xlink:bello_gallico#157</word-ref>
  394:       <word-ref>xlink:bello_gallico#336</word-ref>
  395:     </list-entry>
  396:     <list-entry id="2">
  397:       <word>bello</word>
  398:       <word-ref>xlink:bello_gallico#189</word-ref>
  399:       <word-ref>xlink:bello_gallico#236</word-ref>
  400:       <word-ref>xlink:bello_gallico#557</word-ref>
  401:       <word-ref>xlink:bello_gallico#1396</word-ref>
  402:       <word-ref>xlink:bello_gallico#1450</word-ref>
  403:     </list-entry>
  404:   </list>
  405: \end{verbatim}
  406: 
  407: 
  408: \subsection{Term list}
  409: \label{sec:term-list}
  410: 
  411: A term groups one or more words into a semantical unit. A term list
  412: contains chosen terms and references to their occurrences.
  413: 
  414: \begin{verbatim}
  415:   <list id="1">
  416:     <list-entry id="1">
  417:       <term>patria nostra</term>
  418:       <term-ref>
  419:         <word-ref>xlink:bello_gallico#36</word-ref>
  420:         <word-ref>xlink:bello_gallico#37</word-ref>
  421:       </term-ref>
  422:       <word-ref>xlink:bello_gallico#36</word-ref>
  423:       <term-ref>
  424:         <word-ref>xlink:bello_gallico#155</word-ref>
  425:         <word-ref>xlink:bello_gallico#157</word-ref>
  426:       </term-ref>
  427:     </list-entry>
  428:     <list-entry id="2">
  429:       <term>belllo gallico</term>
  430:       <term-ref>
  431:         <word-ref>xlink:bello_gallico#12</word-ref>
  432:         <word-ref>xlink:bello_gallico#13</word-ref>
  433:       </term-ref>
  434:     </list-entry>
  435:   </list>
  436: \end{verbatim}
  437: 
  438: 
  439: \subsection{Primary source mapping}
  440: \label{sec:prim-source-mapp}
  441: 
  442: A primary source mapping maps every word of a basic document to its
  443: equivalent in the primary source document.
  444: 
  445: \begin{verbatim}
  446:   <source-mapping>
  447:     <map id="1">
  448:       <word-ref>xlink:bello_gallico#1</word-ref>
  449:       <ref>xlink:bello.txt(1235)</ref>
  450:     </map>
  451:     <map id="2">
  452:       <word-ref>xlink:bello_gallico#2</word-ref>
  453:       <ref>xlink:bello.txt(1245)</ref>
  454:     </map>
  455:     <map id="3">
  456:       <word-ref>xlink:bello_gallico#3</word-ref>
  457:       <ref>xlink:bello.txt(1257)</ref>
  458:     </map>
  459:   </source-mapping>
  460: \end{verbatim}
  461: 
  462: 
  463: 
  464: \section{Development priorities and time plan}
  465: \label{sec:devel-prior-time}
  466: 
  467: (to be done)
  468: 
  469: \section{Glossary}
  470: \label{sec:glossary}
  471: 
  472: \begin{description}
  473: \item[word] In a basic text a word is any sequence of characters
  474:   between delimiters of whitespace or other delimiters. A word on this
  475:   level is not a semantical, not even a syntactical unit.
  476: 
  477: \item[term] A term is a container for one or more not necessarily
  478:   adjacent words. Terms can be syntactical or semantical units. Terms
  479:   can be used and referenced like basic words.
  480:   
  481: \item[word reference] A word reference is an xlink or similar
  482:   reference to a word or term in a word list or in a basic text.
  483:   
  484: \item[term reference] A term reference is a reference to a term and
  485:   equivalent to a word reference.
  486: 
  487: \item[word list] A word list is a list containing elements consisting
  488:   of a word and a list of word references.
  489: 
  490: \item[term list] A term list is equivalent to a word list. Its
  491:   elements consist of a term and a list of word references.
  492:   
  493: \item[word occurrence list] A word occurrence list is a list where
  494:   every element is treated like a type and a list of all its instances
  495:   -- occurrences -- in the text. The same word (type) can occur only
  496:   once in an occurrence list where it can reference many word instances.
  497:   
  498: \item[word instance list] A word instance list is a word list where
  499:   every element is treated like a singular object (unlike a word
  500:   occurrence list). The same word (type) can occur multiple times in an
  501:   instance list where it can reference only one word or term instance.
  502: 
  503: \end{description}
  504: 
  505: 
  506: \end{document}
  507: 
  508: %%% Local Variables: 
  509: %%% mode: latex
  510: %%% TeX-master: t
  511: %%% End: 
  512: 

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>