Annotation of texttool-concept/texttools.tex, revision 1.1

1.1     ! dwinter     1: \documentclass[a4paper]{article}
        !             2: 
        !             3: \usepackage[latin1]{inputenc}
        !             4: \usepackage[T1]{fontenc}
        !             5: \usepackage{ae}
        !             6: \usepackage{url}
        !             7: \usepackage{graphicx}
        !             8: 
        !             9: \graphicspath{{graphics/}}
        !            10: 
        !            11: \title{Draft: Proposal for a text tool architecture for ECHO}
        !            12: \author{Robert Casties}
        !            13: \date{\today}
        !            14: 
        !            15: \begin{document}
        !            16: 
        !            17: \maketitle
        !            18: 
        !            19: \section{Introduction}
        !            20: \label{sec:introduction}
        !            21: In the context of ECHO ``text'' represents scholarly metadata as well
        !            22: as full texts of sources. As such, text forms the glue between the
        !            23: different objects in the ECHO corpus. To fully exploit the potential
        !            24: of text for semantic access and interlinking, tools have to support
        !            25: the automatic or manual generation of links between different objects
        !            26: within the ECHO corpus.
        !            27: 
        !            28: A viewing environment should present configurable views on all texts
        !            29: that allow to exploit relations to other texts and objects.
        !            30: 
        !            31: Four different fields can be
        !            32: identified, for which tools have to be developed:
        !            33: \begin{itemize}
        !            34: \item the generation of XML-structures
        !            35: \item the analysis of the corpora
        !            36: \item the meaningful linking of texts
        !            37: \item the generation of scholarly metadata.
        !            38: \end{itemize}
        !            39: 
        !            40: 
        !            41: \section{Requirements}
        !            42: \label{sec:requirements}
        !            43: The handling of large corpora makes it necessary to define a minimal
        !            44: standard XML structure for these documents. This implies the
        !            45: development of tools to convert existing document formats into
        !            46: these standard formats. In addition, tools for editing documents in
        !            47: these formats will have to be made available.
        !            48: 
        !            49: A prerequisite for generating links between documents is the
        !            50: possibility to analyse texts and adding the results of this analysis
        !            51: to the document. In general, two different types of this analysis can
        !            52: be distinguished: automatically generated analysis following defined
        !            53: rules and the manual analysis by marking words depending on the
        !            54: context.
        !            55: 
        !            56: The analysis of corpora is the basis for automatically generated
        !            57: linking of documents. For example the wordlists generated by
        !            58: morphological analysis can serve as starting point for linking to a
        !            59: dictionary or a grammar. Another example would be the usage of a
        !            60: wordlist consisting of technical terms serving as basis for linking to
        !            61: an encyclopedia or glossary. Furthermore such wordlists can serve as
        !            62: starting points for cross-linking within the text corpus using this
        !            63: word lists as a common anchor. 
        !            64: 
        !            65: Beyond the automatically generated linking of documents, the linking
        !            66: as result of scholarly work has to be supported by text tools,
        !            67: e.g. showing connections between different texts in the corpus,
        !            68: combining sources with translations or secondary texts, the linking
        !            69: between images and describing texts or the
        !            70: connection between full texts and images.
        !            71: 
        !            72: In particular, an open environment for adding comments and notes to
        !            73: sources can be a test bed for how collaborative work on sources could
        !            74: be encouraged by the ECHO project in order to build a virtual European
        !            75: research area on cultural heritage. 
        !            76: 
        !            77: 
        !            78: \section{Technical issues}
        !            79: \label{sec:formats}
        !            80: 
        !            81: 
        !            82: \subsection{Granularity of reference}
        !            83: \label{sec:gran-refer}
        !            84: 
        !            85: The basic layer of informational markup has to define the units of
        !            86: reference for the higher layers. The granularity of these reference
        !            87: units determines the amount of complexity needed for referencing in
        !            88: the higher layers. The markup in the basic layer should also permit
        !            89: changes in formatting and corrections in the source document to a
        !            90: certain extent without loosing referential integrity.
        !            91: 
        !            92: The proposed unit of reference in the basic layer is a \emph{word}.
        !            93: Where \emph{word} means any sequence of characters between whitespace
        !            94: or other special characters in the source document, excluding
        !            95: formatting and markup. The word as a unit of reference is not meant to
        !            96: be a semantical unit or even a morphological unit. It is only meant to
        !            97: be the smallest easily recognizable unit used in the text.
        !            98: Morphological, syntactical and semantical units can be assembled and
        !            99: referenced on higher level layers as \emph{terms} comprised of one or
        !           100: more, not necessarily adjacent \emph{words}.
        !           101: 
        !           102: 
        !           103: \subsection{Information layers}
        !           104: \label{sec:information-layers}
        !           105: 
        !           106: The text tools operate according to the ``standoff
        !           107: principle'' of XML markup. The basic text is marked up only to provide
        !           108: the basis of raw data and reference. Additional syntactical and
        !           109: semantical information -- be it automatically generated or scholarly
        !           110: edited -- is provided in separated informational layers of
        !           111: \emph{word lists} referencing other layers or the basic text.
        !           112: 
        !           113: \begin{figure}[htbp]
        !           114:   \centering
        !           115:   \includegraphics[width=0.8\textwidth]{word-termlists}
        !           116:   \caption{Relation of basic text and term lists. }
        !           117:   \label{fig:word-termlists}
        !           118: \end{figure}
        !           119: 
        !           120: A \emph{word list} or \emph{term list}\footnote{\emph{word list} and
        !           121:   \emph{term list} will be used interchangeably in the following text
        !           122:   since both forms should be functionally identical.} is a list of
        !           123: \emph{words} or \emph{terms} that are each linked to a list of
        !           124: references to \emph{words} or \emph{terms} in other \emph{word lists}
        !           125: or to \emph{words} in basic texts.
        !           126: 
        !           127: An example for the informational layers in an English or Latin
        !           128: text\footnote{English or Latin as examples for languages where
        !           129:   sufficient morphological analysis can be based on single words.}
        !           130: would be:
        !           131: 
        !           132: \begin{enumerate}
        !           133: \item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1}
        !           134:   
        !           135: \item \emph{Basic word list} layer, an automatically generated list of all
        !           136:   unique words and references to their occurrence in the basic text
        !           137:   (\ref{item:1}).\label{item:2}
        !           138:     
        !           139: \item \emph{Morphological term list} layer, an automatically generated list
        !           140:   of the morphologically normalized forms of all words and references
        !           141:   to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4}
        !           142: 
        !           143: \item Scholarly edited \emph{term list} layer, a manually edited list of
        !           144:   semantical units like technical terms used in the document,
        !           145:   referring to the basic text (\ref{item:1}).\label{item:5}
        !           146: \end{enumerate}
        !           147: 
        !           148: Additional annotation layers referencing the basic text or any other
        !           149: layer could be produced and stored in the same text repository or on any
        !           150: other server. Therefore it has to be possible to reference any layer in
        !           151: a unique and stable way across the net.
        !           152: 
        !           153: In languages with more complex morphological units the morphological
        !           154: analysis layer can be based on an intermediate term layer that joins
        !           155: basic words into morphological units.
        !           156: 
        !           157: 
        !           158: 
        !           159: \subsection{Primary and secondary source texts}
        !           160: \label{sec:backr-orig-source}
        !           161: 
        !           162: The text tool system should be easily adaptable to different
        !           163: workflows dealing with text in the ECHO domain. There are two
        !           164: basic types of text sources with a different degree of integration an
        !           165: the central ECHO text corpus.
        !           166:  
        !           167: %% FIXME!!
        !           168: 
        !           169: The \emph{primary source text} is maintained in the basic word tagged
        !           170: form on a text corpus server. Updates and changes have to be worked
        !           171: into the word tagged text without breaking the referential integrity.
        !           172: 
        !           173: As \emph{secondary source text} the basic word tagged text is not
        !           174: the primary source. A mapping file has to be provided,
        !           175: that maps the words in the basic text to other referenceable units in
        !           176: the primary source documents. Updates and changes in the primary
        !           177: document may be followed by updates to the mapping file or the basic
        !           178: text to maintain referential integrity.
        !           179: 
        !           180: The distinction between these types of sources concerns mainly the
        !           181: text cruncher producing the basic tagged text and eventually a mapping
        !           182: file and the presentation tools producing views or references to the
        !           183: original source texts.
        !           184: 
        !           185: 
        !           186: 
        !           187: \subsection{Support of additional markup}
        !           188: \label{sec:supp-addit-mark}
        !           189: 
        !           190: The basic text tagging format should be transparent to additional
        !           191: markup in the source text to enable the easy integration of the text
        !           192: tools into existing formats and tools. The use of XML namespaces can
        !           193: provide such transparency.
        !           194: 
        !           195: The common viewing environment can not be completely
        !           196: agnostic to additional markup. It must be able to interpret a common
        !           197: set of minimal visual markup. Visual elements to be considered are:
        !           198: 
        !           199: \begin{itemize}
        !           200: \item paragraphs and/or line breaks
        !           201: 
        !           202: \item page breaks
        !           203: 
        !           204: \item page images (coupled to page breaks)
        !           205: 
        !           206: \item inline images
        !           207: \end{itemize}
        !           208: 
        !           209: When presenting text parts to the user as results to a search request
        !           210: it would be useful to have a general mechanism to select larger units
        !           211: around the referenced word. Additional semantical units suitable for
        !           212: this kind of reference would be sentences. The mechanism could try to
        !           213: select the surrounding sentence and then fall back to larger units
        !           214: like a paragraph, a page or the whole text.
        !           215: 
        !           216: A translation scheme to map different existing visual markup tags into
        !           217: the common set for the viewing environment should be implemented. The
        !           218: translation could be done directly upon creation of second source
        !           219: texts as these texts are decoupled from the original source text.
        !           220: The translation would have to be done on-the-fly for primary source
        !           221: texts where markup different from the common set is used.
        !           222: 
        !           223: 
        !           224: \section{Tools}
        !           225: \label{sec:tools}
        !           226: 
        !           227: 
        !           228: \subsection{Text cruncher}
        !           229: \label{sec:text-cruncher}
        !           230: 
        !           231: The \emph{text cruncher} tool takes a text file and eventual
        !           232: information about a primary source and produces a \emph{basic word
        !           233:   tagged text}, a \emph{basic word list}, and an eventual
        !           234: \emph{mapping file} if the text is to be considered a secondary source
        !           235: text.
        !           236: 
        !           237: 
        !           238: \subsection{Morphological analyzer}
        !           239: \label{sec:morph-analys}
        !           240: 
        !           241: The \emph{morphological analyzer} tool for a given language takes a
        !           242: word list or a term list of morphological units and
        !           243: produces a \emph{morphological term list} of normalized forms, their
        !           244: morphological description, and references to their occurrences in the
        !           245: provided list.
        !           246: 
        !           247: A sub function of the morphological analyzer should be a normalizer for
        !           248: single words to be used in conjunction with the dictionary tool.
        !           249: 
        !           250: 
        !           251: \subsection{Dictionary}
        !           252: \label{sec:dictionary}
        !           253: 
        !           254: The \emph{dictionary analyzer} tool takes a morphologically normalized
        !           255: term list and produces a term list with known terms,
        !           256: references to their definitions and references into the occurrences in
        !           257: the provided list.
        !           258: 
        !           259: A sub function of the dictionary analyzer should be a lookup tool for
        !           260: single normalized words or terms.
        !           261: 
        !           262: 
        !           263: \subsection{Cross referencer}
        !           264: \label{sec:cross-referencer}
        !           265: 
        !           266: The \emph{cross referencer} tool takes a word list from one text
        !           267: and a set of word lists from other texts and
        !           268: produces a word list with words from the first list and
        !           269: references into all of the lists.
        !           270: 
        !           271: 
        !           272: \subsection{Display environment}
        !           273: \label{sec:display-environment}
        !           274: 
        !           275: The \emph{display environment} should be able to display a text with
        !           276: minimal visual markup and additional links defined by additional
        !           277: wordlists. 
        !           278: 
        !           279: The set of necessary visual markup like page breaks, page images,
        !           280: inline images or text formatting should follow an agreed standard.
        !           281: 
        !           282: The functionality provided by the links could be direct linking into
        !           283: other texts, morphological analyses, or dictionary entries if the word
        !           284: is only referenced by one word list. In the case of multiple
        !           285: references to a word a mechanism for the selection of one of the
        !           286: possible sources must be provided.
        !           287: 
        !           288: 
        !           289: \subsection{List inverter}
        !           290: \label{sec:list-inverter}
        !           291: 
        !           292: The \emph{list inverter} is a small auxiliary tool that takes a
        !           293: normal word list that is ordered by unique words and produces an
        !           294: \emph{inverted word list} that is ordered by word references.
        !           295: 
        !           296: 
        !           297: 
        !           298: 
        !           299: \section{Use cases}
        !           300: \label{sec:use-cases}
        !           301: 
        !           302: 
        !           303: \subsection{Integration of Archimedes XML texts}
        !           304: \label{sec:integr-arch-xml}
        !           305: 
        !           306: The XML texts of the Archimedes project could be integrated in two
        !           307: different ways: either as primary source texts, adding basic word
        !           308: tagging to the Archimedes markup or as secondary source texts by
        !           309: providing mapping files to the unchanged source files.
        !           310: 
        !           311: In the first case basic word tagging would be added to the XML
        !           312: document by the text cruncher. The resulting documents could then be
        !           313: further processed and edited, provided that word references are not
        !           314: broken. The text cruncher would produce a basic word list for use with
        !           315: other text tools.
        !           316: 
        !           317: In the second case only a secondary source text and a mapping file
        !           318: would be produced by the text cruncher together with the basic word
        !           319: list. The original source text would stay unchanged outside the text
        !           320: repository.
        !           321: 
        !           322: Additional mappings would have to be generated to adapt the visual
        !           323: markup used in the Archimedes XML to the common markup for the display
        !           324: environment.
        !           325: 
        !           326: 
        !           327: 
        !           328: \subsection{Integration of existing webpages}
        !           329: \label{sec:integr-exist-webp}
        !           330: 
        !           331: 
        !           332: 
        !           333: \subsection{Integration of raw OCR text}
        !           334: \label{sec:integration-raw-ocr}
        !           335: 
        !           336: Raw OCR text as it is generated by automatic OCR on digitized document
        !           337: pages could be considered original source material. The OCR produces
        !           338: one plain text document per scanned image file. A suitable text
        !           339: cruncher would produce a secondary source text for use in the
        !           340: repository with a mapping file referencing the original text files.
        !           341: 
        !           342: 
        !           343: 
        !           344: \subsection{Full text search}
        !           345: \label{sec:full-text-search}
        !           346: 
        !           347: (to be done)
        !           348: 
        !           349: 
        !           350: \subsection{Cross linking of texts}
        !           351: \label{sec:cross-linking-texts}
        !           352: 
        !           353: (to be done)
        !           354: 
        !           355: 
        !           356: \section{Proposed formats}
        !           357: \label{sec:proposed-formats}
        !           358: 
        !           359: 
        !           360: \subsection{Basic document}
        !           361: \label{sec:basic-docum-form}
        !           362: 
        !           363: The basic document format consists of word tags, and optionally language information
        !           364: for morphological analysis and basic visual markup.
        !           365: 
        !           366: An example in pseudo XML markup might look like this:
        !           367: 
        !           368: \begin{verbatim}
        !           369:   <text lang="lat">
        !           370:     <word id="1">omnia</word>
        !           371:     <word id="2">gallia</word>
        !           372:     <word id="3">est</word>
        !           373:     <word id="4">divisa</word>
        !           374:     <word id="5">in</word>
        !           375:     <word id="6">partes</word>
        !           376:     <word id="7">tres</word>.
        !           377:   </text>
        !           378: \end{verbatim}
        !           379: 
        !           380: 
        !           381: 
        !           382: \subsection{Basic wordlist}
        !           383: \label{sec:wordlist}
        !           384: 
        !           385: The basic wordlist consists of all unique words and references to
        !           386: their occurrences in the basic text.
        !           387: 
        !           388: \begin{verbatim}
        !           389:   <list id="1">
        !           390:     <list-entry id="1">
        !           391:       <word>patria</word>
        !           392:       <word-ref>xlink:bello_gallico#36</word-ref>
        !           393:       <word-ref>xlink:bello_gallico#157</word-ref>
        !           394:       <word-ref>xlink:bello_gallico#336</word-ref>
        !           395:     </list-entry>
        !           396:     <list-entry id="2">
        !           397:       <word>bello</word>
        !           398:       <word-ref>xlink:bello_gallico#189</word-ref>
        !           399:       <word-ref>xlink:bello_gallico#236</word-ref>
        !           400:       <word-ref>xlink:bello_gallico#557</word-ref>
        !           401:       <word-ref>xlink:bello_gallico#1396</word-ref>
        !           402:       <word-ref>xlink:bello_gallico#1450</word-ref>
        !           403:     </list-entry>
        !           404:   </list>
        !           405: \end{verbatim}
        !           406: 
        !           407: 
        !           408: \subsection{Term list}
        !           409: \label{sec:term-list}
        !           410: 
        !           411: A term groups one or more words into a semantical unit. A term list
        !           412: contains chosen terms and references to their occurrences.
        !           413: 
        !           414: \begin{verbatim}
        !           415:   <list id="1">
        !           416:     <list-entry id="1">
        !           417:       <term>patria nostra</term>
        !           418:       <term-ref>
        !           419:         <word-ref>xlink:bello_gallico#36</word-ref>
        !           420:         <word-ref>xlink:bello_gallico#37</word-ref>
        !           421:       </term-ref>
        !           422:       <word-ref>xlink:bello_gallico#36</word-ref>
        !           423:       <term-ref>
        !           424:         <word-ref>xlink:bello_gallico#155</word-ref>
        !           425:         <word-ref>xlink:bello_gallico#157</word-ref>
        !           426:       </term-ref>
        !           427:     </list-entry>
        !           428:     <list-entry id="2">
        !           429:       <term>belllo gallico</term>
        !           430:       <term-ref>
        !           431:         <word-ref>xlink:bello_gallico#12</word-ref>
        !           432:         <word-ref>xlink:bello_gallico#13</word-ref>
        !           433:       </term-ref>
        !           434:     </list-entry>
        !           435:   </list>
        !           436: \end{verbatim}
        !           437: 
        !           438: 
        !           439: \subsection{Primary source mapping}
        !           440: \label{sec:prim-source-mapp}
        !           441: 
        !           442: A primary source mapping maps every word of a basic document to its
        !           443: equivalent in the primary source document.
        !           444: 
        !           445: \begin{verbatim}
        !           446:   <source-mapping>
        !           447:     <map id="1">
        !           448:       <word-ref>xlink:bello_gallico#1</word-ref>
        !           449:       <ref>xlink:bello.txt(1235)</ref>
        !           450:     </map>
        !           451:     <map id="2">
        !           452:       <word-ref>xlink:bello_gallico#2</word-ref>
        !           453:       <ref>xlink:bello.txt(1245)</ref>
        !           454:     </map>
        !           455:     <map id="3">
        !           456:       <word-ref>xlink:bello_gallico#3</word-ref>
        !           457:       <ref>xlink:bello.txt(1257)</ref>
        !           458:     </map>
        !           459:   </source-mapping>
        !           460: \end{verbatim}
        !           461: 
        !           462: 
        !           463: 
        !           464: \section{Development priorities and time plan}
        !           465: \label{sec:devel-prior-time}
        !           466: 
        !           467: (to be done)
        !           468: 
        !           469: \section{Glossary}
        !           470: \label{sec:glossary}
        !           471: 
        !           472: \begin{description}
        !           473: \item[word] In a basic text a word is any sequence of characters
        !           474:   between delimiters of whitespace or other delimiters. A word on this
        !           475:   level is not a semantical, not even a syntactical unit.
        !           476: 
        !           477: \item[term] A term is a container for one or more not necessarily
        !           478:   adjacent words. Terms can be syntactical or semantical units. Terms
        !           479:   can be used and referenced like basic words.
        !           480:   
        !           481: \item[word reference] A word reference is an xlink or similar
        !           482:   reference to a word or term in a word list or in a basic text.
        !           483:   
        !           484: \item[term reference] A term reference is a reference to a term and
        !           485:   equivalent to a word reference.
        !           486: 
        !           487: \item[word list] A word list is a list containing elements consisting
        !           488:   of a word and a list of word references.
        !           489: 
        !           490: \item[term list] A term list is equivalent to a word list. Its
        !           491:   elements consist of a term and a list of word references.
        !           492:   
        !           493: \item[word occurrence list] A word occurrence list is a list where
        !           494:   every element is treated like a type and a list of all its instances
        !           495:   -- occurrences -- in the text. The same word (type) can occur only
        !           496:   once in an occurrence list where it can reference many word instances.
        !           497:   
        !           498: \item[word instance list] A word instance list is a word list where
        !           499:   every element is treated like a singular object (unlike a word
        !           500:   occurrence list). The same word (type) can occur multiple times in an
        !           501:   instance list where it can reference only one word or term instance.
        !           502: 
        !           503: \end{description}
        !           504: 
        !           505: 
        !           506: \end{document}
        !           507: 
        !           508: %%% Local Variables: 
        !           509: %%% mode: latex
        !           510: %%% TeX-master: t
        !           511: %%% End: 

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>