Annotation of texttool-concept/texttools.tex, revision 1.1.1.1

1.1       dwinter     1: \documentclass[a4paper]{article}
                      2: 
                      3: \usepackage[latin1]{inputenc}
                      4: \usepackage[T1]{fontenc}
                      5: \usepackage{ae}
                      6: \usepackage{url}
                      7: \usepackage{graphicx}
                      8: 
                      9: \graphicspath{{graphics/}}
                     10: 
                     11: \title{Draft: Proposal for a text tool architecture for ECHO}
                     12: \author{Robert Casties}
                     13: \date{\today}
                     14: 
                     15: \begin{document}
                     16: 
                     17: \maketitle
                     18: 
                     19: \section{Introduction}
                     20: \label{sec:introduction}
                     21: In the context of ECHO ``text'' represents scholarly metadata as well
                     22: as full texts of sources. As such, text forms the glue between the
                     23: different objects in the ECHO corpus. To fully exploit the potential
                     24: of text for semantic access and interlinking, tools have to support
                     25: the automatic or manual generation of links between different objects
                     26: within the ECHO corpus.
                     27: 
                     28: A viewing environment should present configurable views on all texts
                     29: that allow to exploit relations to other texts and objects.
                     30: 
                     31: Four different fields can be
                     32: identified, for which tools have to be developed:
                     33: \begin{itemize}
                     34: \item the generation of XML-structures
                     35: \item the analysis of the corpora
                     36: \item the meaningful linking of texts
                     37: \item the generation of scholarly metadata.
                     38: \end{itemize}
                     39: 
                     40: 
                     41: \section{Requirements}
                     42: \label{sec:requirements}
                     43: The handling of large corpora makes it necessary to define a minimal
                     44: standard XML structure for these documents. This implies the
                     45: development of tools to convert existing document formats into
                     46: these standard formats. In addition, tools for editing documents in
                     47: these formats will have to be made available.
                     48: 
                     49: A prerequisite for generating links between documents is the
                     50: possibility to analyse texts and adding the results of this analysis
                     51: to the document. In general, two different types of this analysis can
                     52: be distinguished: automatically generated analysis following defined
                     53: rules and the manual analysis by marking words depending on the
                     54: context.
                     55: 
                     56: The analysis of corpora is the basis for automatically generated
                     57: linking of documents. For example the wordlists generated by
                     58: morphological analysis can serve as starting point for linking to a
                     59: dictionary or a grammar. Another example would be the usage of a
                     60: wordlist consisting of technical terms serving as basis for linking to
                     61: an encyclopedia or glossary. Furthermore such wordlists can serve as
                     62: starting points for cross-linking within the text corpus using this
                     63: word lists as a common anchor. 
                     64: 
                     65: Beyond the automatically generated linking of documents, the linking
                     66: as result of scholarly work has to be supported by text tools,
                     67: e.g. showing connections between different texts in the corpus,
                     68: combining sources with translations or secondary texts, the linking
                     69: between images and describing texts or the
                     70: connection between full texts and images.
                     71: 
                     72: In particular, an open environment for adding comments and notes to
                     73: sources can be a test bed for how collaborative work on sources could
                     74: be encouraged by the ECHO project in order to build a virtual European
                     75: research area on cultural heritage. 
                     76: 
                     77: 
                     78: \section{Technical issues}
                     79: \label{sec:formats}
                     80: 
                     81: 
                     82: \subsection{Granularity of reference}
                     83: \label{sec:gran-refer}
                     84: 
                     85: The basic layer of informational markup has to define the units of
                     86: reference for the higher layers. The granularity of these reference
                     87: units determines the amount of complexity needed for referencing in
                     88: the higher layers. The markup in the basic layer should also permit
                     89: changes in formatting and corrections in the source document to a
                     90: certain extent without loosing referential integrity.
                     91: 
                     92: The proposed unit of reference in the basic layer is a \emph{word}.
                     93: Where \emph{word} means any sequence of characters between whitespace
                     94: or other special characters in the source document, excluding
                     95: formatting and markup. The word as a unit of reference is not meant to
                     96: be a semantical unit or even a morphological unit. It is only meant to
                     97: be the smallest easily recognizable unit used in the text.
                     98: Morphological, syntactical and semantical units can be assembled and
                     99: referenced on higher level layers as \emph{terms} comprised of one or
                    100: more, not necessarily adjacent \emph{words}.
                    101: 
                    102: 
                    103: \subsection{Information layers}
                    104: \label{sec:information-layers}
                    105: 
                    106: The text tools operate according to the ``standoff
                    107: principle'' of XML markup. The basic text is marked up only to provide
                    108: the basis of raw data and reference. Additional syntactical and
                    109: semantical information -- be it automatically generated or scholarly
                    110: edited -- is provided in separated informational layers of
                    111: \emph{word lists} referencing other layers or the basic text.
                    112: 
                    113: \begin{figure}[htbp]
                    114:   \centering
                    115:   \includegraphics[width=0.8\textwidth]{word-termlists}
                    116:   \caption{Relation of basic text and term lists. }
                    117:   \label{fig:word-termlists}
                    118: \end{figure}
                    119: 
                    120: A \emph{word list} or \emph{term list}\footnote{\emph{word list} and
                    121:   \emph{term list} will be used interchangeably in the following text
                    122:   since both forms should be functionally identical.} is a list of
                    123: \emph{words} or \emph{terms} that are each linked to a list of
                    124: references to \emph{words} or \emph{terms} in other \emph{word lists}
                    125: or to \emph{words} in basic texts.
                    126: 
                    127: An example for the informational layers in an English or Latin
                    128: text\footnote{English or Latin as examples for languages where
                    129:   sufficient morphological analysis can be based on single words.}
                    130: would be:
                    131: 
                    132: \begin{enumerate}
                    133: \item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1}
                    134:   
                    135: \item \emph{Basic word list} layer, an automatically generated list of all
                    136:   unique words and references to their occurrence in the basic text
                    137:   (\ref{item:1}).\label{item:2}
                    138:     
                    139: \item \emph{Morphological term list} layer, an automatically generated list
                    140:   of the morphologically normalized forms of all words and references
                    141:   to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4}
                    142: 
                    143: \item Scholarly edited \emph{term list} layer, a manually edited list of
                    144:   semantical units like technical terms used in the document,
                    145:   referring to the basic text (\ref{item:1}).\label{item:5}
                    146: \end{enumerate}
                    147: 
                    148: Additional annotation layers referencing the basic text or any other
                    149: layer could be produced and stored in the same text repository or on any
                    150: other server. Therefore it has to be possible to reference any layer in
                    151: a unique and stable way across the net.
                    152: 
                    153: In languages with more complex morphological units the morphological
                    154: analysis layer can be based on an intermediate term layer that joins
                    155: basic words into morphological units.
                    156: 
                    157: 
                    158: 
                    159: \subsection{Primary and secondary source texts}
                    160: \label{sec:backr-orig-source}
                    161: 
                    162: The text tool system should be easily adaptable to different
                    163: workflows dealing with text in the ECHO domain. There are two
                    164: basic types of text sources with a different degree of integration an
                    165: the central ECHO text corpus.
                    166:  
                    167: %% FIXME!!
                    168: 
                    169: The \emph{primary source text} is maintained in the basic word tagged
                    170: form on a text corpus server. Updates and changes have to be worked
                    171: into the word tagged text without breaking the referential integrity.
                    172: 
                    173: As \emph{secondary source text} the basic word tagged text is not
                    174: the primary source. A mapping file has to be provided,
                    175: that maps the words in the basic text to other referenceable units in
                    176: the primary source documents. Updates and changes in the primary
                    177: document may be followed by updates to the mapping file or the basic
                    178: text to maintain referential integrity.
                    179: 
                    180: The distinction between these types of sources concerns mainly the
                    181: text cruncher producing the basic tagged text and eventually a mapping
                    182: file and the presentation tools producing views or references to the
                    183: original source texts.
                    184: 
                    185: 
                    186: 
                    187: \subsection{Support of additional markup}
                    188: \label{sec:supp-addit-mark}
                    189: 
                    190: The basic text tagging format should be transparent to additional
                    191: markup in the source text to enable the easy integration of the text
                    192: tools into existing formats and tools. The use of XML namespaces can
                    193: provide such transparency.
                    194: 
                    195: The common viewing environment can not be completely
                    196: agnostic to additional markup. It must be able to interpret a common
                    197: set of minimal visual markup. Visual elements to be considered are:
                    198: 
                    199: \begin{itemize}
                    200: \item paragraphs and/or line breaks
                    201: 
                    202: \item page breaks
                    203: 
                    204: \item page images (coupled to page breaks)
                    205: 
                    206: \item inline images
                    207: \end{itemize}
                    208: 
                    209: When presenting text parts to the user as results to a search request
                    210: it would be useful to have a general mechanism to select larger units
                    211: around the referenced word. Additional semantical units suitable for
                    212: this kind of reference would be sentences. The mechanism could try to
                    213: select the surrounding sentence and then fall back to larger units
                    214: like a paragraph, a page or the whole text.
                    215: 
                    216: A translation scheme to map different existing visual markup tags into
                    217: the common set for the viewing environment should be implemented. The
                    218: translation could be done directly upon creation of second source
                    219: texts as these texts are decoupled from the original source text.
                    220: The translation would have to be done on-the-fly for primary source
                    221: texts where markup different from the common set is used.
                    222: 
                    223: 
                    224: \section{Tools}
                    225: \label{sec:tools}
                    226: 
                    227: 
                    228: \subsection{Text cruncher}
                    229: \label{sec:text-cruncher}
                    230: 
                    231: The \emph{text cruncher} tool takes a text file and eventual
                    232: information about a primary source and produces a \emph{basic word
                    233:   tagged text}, a \emph{basic word list}, and an eventual
                    234: \emph{mapping file} if the text is to be considered a secondary source
                    235: text.
                    236: 
                    237: 
                    238: \subsection{Morphological analyzer}
                    239: \label{sec:morph-analys}
                    240: 
                    241: The \emph{morphological analyzer} tool for a given language takes a
                    242: word list or a term list of morphological units and
                    243: produces a \emph{morphological term list} of normalized forms, their
                    244: morphological description, and references to their occurrences in the
                    245: provided list.
                    246: 
                    247: A sub function of the morphological analyzer should be a normalizer for
                    248: single words to be used in conjunction with the dictionary tool.
                    249: 
                    250: 
                    251: \subsection{Dictionary}
                    252: \label{sec:dictionary}
                    253: 
                    254: The \emph{dictionary analyzer} tool takes a morphologically normalized
                    255: term list and produces a term list with known terms,
                    256: references to their definitions and references into the occurrences in
                    257: the provided list.
                    258: 
                    259: A sub function of the dictionary analyzer should be a lookup tool for
                    260: single normalized words or terms.
                    261: 
                    262: 
                    263: \subsection{Cross referencer}
                    264: \label{sec:cross-referencer}
                    265: 
                    266: The \emph{cross referencer} tool takes a word list from one text
                    267: and a set of word lists from other texts and
                    268: produces a word list with words from the first list and
                    269: references into all of the lists.
                    270: 
                    271: 
                    272: \subsection{Display environment}
                    273: \label{sec:display-environment}
                    274: 
                    275: The \emph{display environment} should be able to display a text with
                    276: minimal visual markup and additional links defined by additional
                    277: wordlists. 
                    278: 
                    279: The set of necessary visual markup like page breaks, page images,
                    280: inline images or text formatting should follow an agreed standard.
                    281: 
                    282: The functionality provided by the links could be direct linking into
                    283: other texts, morphological analyses, or dictionary entries if the word
                    284: is only referenced by one word list. In the case of multiple
                    285: references to a word a mechanism for the selection of one of the
                    286: possible sources must be provided.
                    287: 
                    288: 
                    289: \subsection{List inverter}
                    290: \label{sec:list-inverter}
                    291: 
                    292: The \emph{list inverter} is a small auxiliary tool that takes a
                    293: normal word list that is ordered by unique words and produces an
                    294: \emph{inverted word list} that is ordered by word references.
                    295: 
                    296: 
                    297: 
                    298: 
                    299: \section{Use cases}
                    300: \label{sec:use-cases}
                    301: 
                    302: 
                    303: \subsection{Integration of Archimedes XML texts}
                    304: \label{sec:integr-arch-xml}
                    305: 
                    306: The XML texts of the Archimedes project could be integrated in two
                    307: different ways: either as primary source texts, adding basic word
                    308: tagging to the Archimedes markup or as secondary source texts by
                    309: providing mapping files to the unchanged source files.
                    310: 
                    311: In the first case basic word tagging would be added to the XML
                    312: document by the text cruncher. The resulting documents could then be
                    313: further processed and edited, provided that word references are not
                    314: broken. The text cruncher would produce a basic word list for use with
                    315: other text tools.
                    316: 
                    317: In the second case only a secondary source text and a mapping file
                    318: would be produced by the text cruncher together with the basic word
                    319: list. The original source text would stay unchanged outside the text
                    320: repository.
                    321: 
                    322: Additional mappings would have to be generated to adapt the visual
                    323: markup used in the Archimedes XML to the common markup for the display
                    324: environment.
                    325: 
                    326: 
                    327: 
                    328: \subsection{Integration of existing webpages}
                    329: \label{sec:integr-exist-webp}
                    330: 
                    331: 
                    332: 
                    333: \subsection{Integration of raw OCR text}
                    334: \label{sec:integration-raw-ocr}
                    335: 
                    336: Raw OCR text as it is generated by automatic OCR on digitized document
                    337: pages could be considered original source material. The OCR produces
                    338: one plain text document per scanned image file. A suitable text
                    339: cruncher would produce a secondary source text for use in the
                    340: repository with a mapping file referencing the original text files.
                    341: 
                    342: 
                    343: 
                    344: \subsection{Full text search}
                    345: \label{sec:full-text-search}
                    346: 
                    347: (to be done)
                    348: 
                    349: 
                    350: \subsection{Cross linking of texts}
                    351: \label{sec:cross-linking-texts}
                    352: 
                    353: (to be done)
                    354: 
                    355: 
                    356: \section{Proposed formats}
                    357: \label{sec:proposed-formats}
                    358: 
                    359: 
                    360: \subsection{Basic document}
                    361: \label{sec:basic-docum-form}
                    362: 
                    363: The basic document format consists of word tags, and optionally language information
                    364: for morphological analysis and basic visual markup.
                    365: 
                    366: An example in pseudo XML markup might look like this:
                    367: 
                    368: \begin{verbatim}
                    369:   <text lang="lat">
                    370:     <word id="1">omnia</word>
                    371:     <word id="2">gallia</word>
                    372:     <word id="3">est</word>
                    373:     <word id="4">divisa</word>
                    374:     <word id="5">in</word>
                    375:     <word id="6">partes</word>
                    376:     <word id="7">tres</word>.
                    377:   </text>
                    378: \end{verbatim}
                    379: 
                    380: 
                    381: 
                    382: \subsection{Basic wordlist}
                    383: \label{sec:wordlist}
                    384: 
                    385: The basic wordlist consists of all unique words and references to
                    386: their occurrences in the basic text.
                    387: 
                    388: \begin{verbatim}
                    389:   <list id="1">
                    390:     <list-entry id="1">
                    391:       <word>patria</word>
                    392:       <word-ref>xlink:bello_gallico#36</word-ref>
                    393:       <word-ref>xlink:bello_gallico#157</word-ref>
                    394:       <word-ref>xlink:bello_gallico#336</word-ref>
                    395:     </list-entry>
                    396:     <list-entry id="2">
                    397:       <word>bello</word>
                    398:       <word-ref>xlink:bello_gallico#189</word-ref>
                    399:       <word-ref>xlink:bello_gallico#236</word-ref>
                    400:       <word-ref>xlink:bello_gallico#557</word-ref>
                    401:       <word-ref>xlink:bello_gallico#1396</word-ref>
                    402:       <word-ref>xlink:bello_gallico#1450</word-ref>
                    403:     </list-entry>
                    404:   </list>
                    405: \end{verbatim}
                    406: 
                    407: 
                    408: \subsection{Term list}
                    409: \label{sec:term-list}
                    410: 
                    411: A term groups one or more words into a semantical unit. A term list
                    412: contains chosen terms and references to their occurrences.
                    413: 
                    414: \begin{verbatim}
                    415:   <list id="1">
                    416:     <list-entry id="1">
                    417:       <term>patria nostra</term>
                    418:       <term-ref>
                    419:         <word-ref>xlink:bello_gallico#36</word-ref>
                    420:         <word-ref>xlink:bello_gallico#37</word-ref>
                    421:       </term-ref>
                    422:       <word-ref>xlink:bello_gallico#36</word-ref>
                    423:       <term-ref>
                    424:         <word-ref>xlink:bello_gallico#155</word-ref>
                    425:         <word-ref>xlink:bello_gallico#157</word-ref>
                    426:       </term-ref>
                    427:     </list-entry>
                    428:     <list-entry id="2">
                    429:       <term>belllo gallico</term>
                    430:       <term-ref>
                    431:         <word-ref>xlink:bello_gallico#12</word-ref>
                    432:         <word-ref>xlink:bello_gallico#13</word-ref>
                    433:       </term-ref>
                    434:     </list-entry>
                    435:   </list>
                    436: \end{verbatim}
                    437: 
                    438: 
                    439: \subsection{Primary source mapping}
                    440: \label{sec:prim-source-mapp}
                    441: 
                    442: A primary source mapping maps every word of a basic document to its
                    443: equivalent in the primary source document.
                    444: 
                    445: \begin{verbatim}
                    446:   <source-mapping>
                    447:     <map id="1">
                    448:       <word-ref>xlink:bello_gallico#1</word-ref>
                    449:       <ref>xlink:bello.txt(1235)</ref>
                    450:     </map>
                    451:     <map id="2">
                    452:       <word-ref>xlink:bello_gallico#2</word-ref>
                    453:       <ref>xlink:bello.txt(1245)</ref>
                    454:     </map>
                    455:     <map id="3">
                    456:       <word-ref>xlink:bello_gallico#3</word-ref>
                    457:       <ref>xlink:bello.txt(1257)</ref>
                    458:     </map>
                    459:   </source-mapping>
                    460: \end{verbatim}
                    461: 
                    462: 
                    463: 
                    464: \section{Development priorities and time plan}
                    465: \label{sec:devel-prior-time}
                    466: 
                    467: (to be done)
                    468: 
                    469: \section{Glossary}
                    470: \label{sec:glossary}
                    471: 
                    472: \begin{description}
                    473: \item[word] In a basic text a word is any sequence of characters
                    474:   between delimiters of whitespace or other delimiters. A word on this
                    475:   level is not a semantical, not even a syntactical unit.
                    476: 
                    477: \item[term] A term is a container for one or more not necessarily
                    478:   adjacent words. Terms can be syntactical or semantical units. Terms
                    479:   can be used and referenced like basic words.
                    480:   
                    481: \item[word reference] A word reference is an xlink or similar
                    482:   reference to a word or term in a word list or in a basic text.
                    483:   
                    484: \item[term reference] A term reference is a reference to a term and
                    485:   equivalent to a word reference.
                    486: 
                    487: \item[word list] A word list is a list containing elements consisting
                    488:   of a word and a list of word references.
                    489: 
                    490: \item[term list] A term list is equivalent to a word list. Its
                    491:   elements consist of a term and a list of word references.
                    492:   
                    493: \item[word occurrence list] A word occurrence list is a list where
                    494:   every element is treated like a type and a list of all its instances
                    495:   -- occurrences -- in the text. The same word (type) can occur only
                    496:   once in an occurrence list where it can reference many word instances.
                    497:   
                    498: \item[word instance list] A word instance list is a word list where
                    499:   every element is treated like a singular object (unlike a word
                    500:   occurrence list). The same word (type) can occur multiple times in an
                    501:   instance list where it can reference only one word or term instance.
                    502: 
                    503: \end{description}
                    504: 
                    505: 
                    506: \end{document}
                    507: 
                    508: %%% Local Variables: 
                    509: %%% mode: latex
                    510: %%% TeX-master: t
                    511: %%% End: 

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>