\documentclass[a4paper]{article} \usepackage[latin1]{inputenc} \usepackage[T1]{fontenc} \usepackage{ae} \usepackage{url} \usepackage{graphicx} \graphicspath{{graphics/}} \title{Draft: Proposal for a text tool architecture for ECHO} \author{Robert Casties} \date{\today} \begin{document} \maketitle \section{Introduction} \label{sec:introduction} In the context of ECHO ``text'' represents scholarly metadata as well as full texts of sources. As such, text forms the glue between the different objects in the ECHO corpus. To fully exploit the potential of text for semantic access and interlinking, tools have to support the automatic or manual generation of links between different objects within the ECHO corpus. A viewing environment should present configurable views on all texts that allow to exploit relations to other texts and objects. Four different fields can be identified, for which tools have to be developed: \begin{itemize} \item the generation of XML-structures \item the analysis of the corpora \item the meaningful linking of texts \item the generation of scholarly metadata. \end{itemize} \section{Requirements} \label{sec:requirements} The handling of large corpora makes it necessary to define a minimal standard XML structure for these documents. This implies the development of tools to convert existing document formats into these standard formats. In addition, tools for editing documents in these formats will have to be made available. A prerequisite for generating links between documents is the possibility to analyse texts and adding the results of this analysis to the document. In general, two different types of this analysis can be distinguished: automatically generated analysis following defined rules and the manual analysis by marking words depending on the context. The analysis of corpora is the basis for automatically generated linking of documents. For example the wordlists generated by morphological analysis can serve as starting point for linking to a dictionary or a grammar. Another example would be the usage of a wordlist consisting of technical terms serving as basis for linking to an encyclopedia or glossary. Furthermore such wordlists can serve as starting points for cross-linking within the text corpus using this word lists as a common anchor. Beyond the automatically generated linking of documents, the linking as result of scholarly work has to be supported by text tools, e.g. showing connections between different texts in the corpus, combining sources with translations or secondary texts, the linking between images and describing texts or the connection between full texts and images. In particular, an open environment for adding comments and notes to sources can be a test bed for how collaborative work on sources could be encouraged by the ECHO project in order to build a virtual European research area on cultural heritage. \section{Technical issues} \label{sec:formats} \subsection{Granularity of reference} \label{sec:gran-refer} The basic layer of informational markup has to define the units of reference for the higher layers. The granularity of these reference units determines the amount of complexity needed for referencing in the higher layers. The markup in the basic layer should also permit changes in formatting and corrections in the source document to a certain extent without loosing referential integrity. The proposed unit of reference in the basic layer is a \emph{word}. Where \emph{word} means any sequence of characters between whitespace or other special characters in the source document, excluding formatting and markup. The word as a unit of reference is not meant to be a semantical unit or even a morphological unit. It is only meant to be the smallest easily recognizable unit used in the text. Morphological, syntactical and semantical units can be assembled and referenced on higher level layers as \emph{terms} comprised of one or more, not necessarily adjacent \emph{words}. \subsection{Information layers} \label{sec:information-layers} The text tools operate according to the ``standoff principle'' of XML markup. The basic text is marked up only to provide the basis of raw data and reference. Additional syntactical and semantical information -- be it automatically generated or scholarly edited -- is provided in separated informational layers of \emph{word lists} referencing other layers or the basic text. \begin{figure}[htbp] \centering \includegraphics[width=0.8\textwidth]{word-termlists} \caption{Relation of basic text and term lists. } \label{fig:word-termlists} \end{figure} A \emph{word list} or \emph{term list}\footnote{\emph{word list} and \emph{term list} will be used interchangeably in the following text since both forms should be functionally identical.} is a list of \emph{words} or \emph{terms} that are each linked to a list of references to \emph{words} or \emph{terms} in other \emph{word lists} or to \emph{words} in basic texts. An example for the informational layers in an English or Latin text\footnote{English or Latin as examples for languages where sufficient morphological analysis can be based on single words.} would be: \begin{enumerate} \item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1} \item \emph{Basic word list} layer, an automatically generated list of all unique words and references to their occurrence in the basic text (\ref{item:1}).\label{item:2} \item \emph{Morphological term list} layer, an automatically generated list of the morphologically normalized forms of all words and references to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4} \item Scholarly edited \emph{term list} layer, a manually edited list of semantical units like technical terms used in the document, referring to the basic text (\ref{item:1}).\label{item:5} \end{enumerate} Additional annotation layers referencing the basic text or any other layer could be produced and stored in the same text repository or on any other server. Therefore it has to be possible to reference any layer in a unique and stable way across the net. In languages with more complex morphological units the morphological analysis layer can be based on an intermediate term layer that joins basic words into morphological units. \subsection{Primary and secondary source texts} \label{sec:backr-orig-source} The text tool system should be easily adaptable to different workflows dealing with text in the ECHO domain. There are two basic types of text sources with a different degree of integration an the central ECHO text corpus. %% FIXME!! The \emph{primary source text} is maintained in the basic word tagged form on a text corpus server. Updates and changes have to be worked into the word tagged text without breaking the referential integrity. As \emph{secondary source text} the basic word tagged text is not the primary source. A mapping file has to be provided, that maps the words in the basic text to other referenceable units in the primary source documents. Updates and changes in the primary document may be followed by updates to the mapping file or the basic text to maintain referential integrity. The distinction between these types of sources concerns mainly the text cruncher producing the basic tagged text and eventually a mapping file and the presentation tools producing views or references to the original source texts. \subsection{Support of additional markup} \label{sec:supp-addit-mark} The basic text tagging format should be transparent to additional markup in the source text to enable the easy integration of the text tools into existing formats and tools. The use of XML namespaces can provide such transparency. The common viewing environment can not be completely agnostic to additional markup. It must be able to interpret a common set of minimal visual markup. Visual elements to be considered are: \begin{itemize} \item paragraphs and/or line breaks \item page breaks \item page images (coupled to page breaks) \item inline images \end{itemize} When presenting text parts to the user as results to a search request it would be useful to have a general mechanism to select larger units around the referenced word. Additional semantical units suitable for this kind of reference would be sentences. The mechanism could try to select the surrounding sentence and then fall back to larger units like a paragraph, a page or the whole text. A translation scheme to map different existing visual markup tags into the common set for the viewing environment should be implemented. The translation could be done directly upon creation of second source texts as these texts are decoupled from the original source text. The translation would have to be done on-the-fly for primary source texts where markup different from the common set is used. \section{Tools} \label{sec:tools} \subsection{Text cruncher} \label{sec:text-cruncher} The \emph{text cruncher} tool takes a text file and eventual information about a primary source and produces a \emph{basic word tagged text}, a \emph{basic word list}, and an eventual \emph{mapping file} if the text is to be considered a secondary source text. \subsection{Morphological analyzer} \label{sec:morph-analys} The \emph{morphological analyzer} tool for a given language takes a word list or a term list of morphological units and produces a \emph{morphological term list} of normalized forms, their morphological description, and references to their occurrences in the provided list. A sub function of the morphological analyzer should be a normalizer for single words to be used in conjunction with the dictionary tool. \subsection{Dictionary} \label{sec:dictionary} The \emph{dictionary analyzer} tool takes a morphologically normalized term list and produces a term list with known terms, references to their definitions and references into the occurrences in the provided list. A sub function of the dictionary analyzer should be a lookup tool for single normalized words or terms. \subsection{Cross referencer} \label{sec:cross-referencer} The \emph{cross referencer} tool takes a word list from one text and a set of word lists from other texts and produces a word list with words from the first list and references into all of the lists. \subsection{Display environment} \label{sec:display-environment} The \emph{display environment} should be able to display a text with minimal visual markup and additional links defined by additional wordlists. The set of necessary visual markup like page breaks, page images, inline images or text formatting should follow an agreed standard. The functionality provided by the links could be direct linking into other texts, morphological analyses, or dictionary entries if the word is only referenced by one word list. In the case of multiple references to a word a mechanism for the selection of one of the possible sources must be provided. \subsection{List inverter} \label{sec:list-inverter} The \emph{list inverter} is a small auxiliary tool that takes a normal word list that is ordered by unique words and produces an \emph{inverted word list} that is ordered by word references. \section{Use cases} \label{sec:use-cases} \subsection{Integration of Archimedes XML texts} \label{sec:integr-arch-xml} The XML texts of the Archimedes project could be integrated in two different ways: either as primary source texts, adding basic word tagging to the Archimedes markup or as secondary source texts by providing mapping files to the unchanged source files. In the first case basic word tagging would be added to the XML document by the text cruncher. The resulting documents could then be further processed and edited, provided that word references are not broken. The text cruncher would produce a basic word list for use with other text tools. In the second case only a secondary source text and a mapping file would be produced by the text cruncher together with the basic word list. The original source text would stay unchanged outside the text repository. Additional mappings would have to be generated to adapt the visual markup used in the Archimedes XML to the common markup for the display environment. \subsection{Integration of existing webpages} \label{sec:integr-exist-webp} \subsection{Integration of raw OCR text} \label{sec:integration-raw-ocr} Raw OCR text as it is generated by automatic OCR on digitized document pages could be considered original source material. The OCR produces one plain text document per scanned image file. A suitable text cruncher would produce a secondary source text for use in the repository with a mapping file referencing the original text files. \subsection{Full text search} \label{sec:full-text-search} (to be done) \subsection{Cross linking of texts} \label{sec:cross-linking-texts} (to be done) \section{Proposed formats} \label{sec:proposed-formats} \subsection{Basic document} \label{sec:basic-docum-form} The basic document format consists of word tags, and optionally language information for morphological analysis and basic visual markup. An example in pseudo XML markup might look like this: \begin{verbatim} omnia gallia est divisa in partes tres. \end{verbatim} \subsection{Basic wordlist} \label{sec:wordlist} The basic wordlist consists of all unique words and references to their occurrences in the basic text. \begin{verbatim} patria xlink:bello_gallico#36 xlink:bello_gallico#157 xlink:bello_gallico#336 bello xlink:bello_gallico#189 xlink:bello_gallico#236 xlink:bello_gallico#557 xlink:bello_gallico#1396 xlink:bello_gallico#1450 \end{verbatim} \subsection{Term list} \label{sec:term-list} A term groups one or more words into a semantical unit. A term list contains chosen terms and references to their occurrences. \begin{verbatim} patria nostra xlink:bello_gallico#36 xlink:bello_gallico#37 xlink:bello_gallico#36 xlink:bello_gallico#155 xlink:bello_gallico#157 belllo gallico xlink:bello_gallico#12 xlink:bello_gallico#13 \end{verbatim} \subsection{Primary source mapping} \label{sec:prim-source-mapp} A primary source mapping maps every word of a basic document to its equivalent in the primary source document. \begin{verbatim} xlink:bello_gallico#1 xlink:bello.txt(1235) xlink:bello_gallico#2 xlink:bello.txt(1245) xlink:bello_gallico#3 xlink:bello.txt(1257) \end{verbatim} \section{Development priorities and time plan} \label{sec:devel-prior-time} (to be done) \section{Glossary} \label{sec:glossary} \begin{description} \item[word] In a basic text a word is any sequence of characters between delimiters of whitespace or other delimiters. A word on this level is not a semantical, not even a syntactical unit. \item[term] A term is a container for one or more not necessarily adjacent words. Terms can be syntactical or semantical units. Terms can be used and referenced like basic words. \item[word reference] A word reference is an xlink or similar reference to a word or term in a word list or in a basic text. \item[term reference] A term reference is a reference to a term and equivalent to a word reference. \item[word list] A word list is a list containing elements consisting of a word and a list of word references. \item[term list] A term list is equivalent to a word list. Its elements consist of a term and a list of word references. \item[word occurrence list] A word occurrence list is a list where every element is treated like a type and a list of all its instances -- occurrences -- in the text. The same word (type) can occur only once in an occurrence list where it can reference many word instances. \item[word instance list] A word instance list is a word list where every element is treated like a singular object (unlike a word occurrence list). The same word (type) can occur multiple times in an instance list where it can reference only one word or term instance. \end{description} \end{document} %%% Local Variables: %%% mode: latex %%% TeX-master: t %%% End: