\documentclass[a4paper]{article}

\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{ae}
\usepackage{url}
\usepackage{graphicx}

\graphicspath{{graphics/}}

\title{Draft: Proposal for a text tool architecture for ECHO}
\author{Robert Casties}
\date{\today}

\begin{document}

\maketitle

\section{Introduction}
\label{sec:introduction}
In the context of ECHO ``text'' represents scholarly metadata as well
as full texts of sources. As such, text forms the glue between the
different objects in the ECHO corpus. To fully exploit the potential
of text for semantic access and interlinking, tools have to support
the automatic or manual generation of links between different objects
within the ECHO corpus.

A viewing environment should present configurable views on all texts
that allow to exploit relations to other texts and objects.

Four different fields can be
identified, for which tools have to be developed:
\begin{itemize}
\item the generation of XML-structures
\item the analysis of the corpora
\item the meaningful linking of texts
\item the generation of scholarly metadata.
\end{itemize}


\section{Requirements}
\label{sec:requirements}
The handling of large corpora makes it necessary to define a minimal
standard XML structure for these documents. This implies the
development of tools to convert existing document formats into
these standard formats. In addition, tools for editing documents in
these formats will have to be made available.

A prerequisite for generating links between documents is the
possibility to analyse texts and adding the results of this analysis
to the document. In general, two different types of this analysis can
be distinguished: automatically generated analysis following defined
rules and the manual analysis by marking words depending on the
context.

The analysis of corpora is the basis for automatically generated
linking of documents. For example the wordlists generated by
morphological analysis can serve as starting point for linking to a
dictionary or a grammar. Another example would be the usage of a
wordlist consisting of technical terms serving as basis for linking to
an encyclopedia or glossary. Furthermore such wordlists can serve as
starting points for cross-linking within the text corpus using this
word lists as a common anchor. 

Beyond the automatically generated linking of documents, the linking
as result of scholarly work has to be supported by text tools,
e.g. showing connections between different texts in the corpus,
combining sources with translations or secondary texts, the linking
between images and describing texts or the
connection between full texts and images.

In particular, an open environment for adding comments and notes to
sources can be a test bed for how collaborative work on sources could
be encouraged by the ECHO project in order to build a virtual European
research area on cultural heritage. 


\section{Technical issues}
\label{sec:formats}


\subsection{Granularity of reference}
\label{sec:gran-refer}

The basic layer of informational markup has to define the units of
reference for the higher layers. The granularity of these reference
units determines the amount of complexity needed for referencing in
the higher layers. The markup in the basic layer should also permit
changes in formatting and corrections in the source document to a
certain extent without loosing referential integrity.

The proposed unit of reference in the basic layer is a \emph{word}.
Where \emph{word} means any sequence of characters between whitespace
or other special characters in the source document, excluding
formatting and markup. The word as a unit of reference is not meant to
be a semantical unit or even a morphological unit. It is only meant to
be the smallest easily recognizable unit used in the text.
Morphological, syntactical and semantical units can be assembled and
referenced on higher level layers as \emph{terms} comprised of one or
more, not necessarily adjacent \emph{words}.


\subsection{Information layers}
\label{sec:information-layers}

The text tools operate according to the ``standoff
principle'' of XML markup. The basic text is marked up only to provide
the basis of raw data and reference. Additional syntactical and
semantical information -- be it automatically generated or scholarly
edited -- is provided in separated informational layers of
\emph{word lists} referencing other layers or the basic text.

\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.8\textwidth]{word-termlists}
  \caption{Relation of basic text and term lists. }
  \label{fig:word-termlists}
\end{figure}

A \emph{word list} or \emph{term list}\footnote{\emph{word list} and
  \emph{term list} will be used interchangeably in the following text
  since both forms should be functionally identical.} is a list of
\emph{words} or \emph{terms} that are each linked to a list of
references to \emph{words} or \emph{terms} in other \emph{word lists}
or to \emph{words} in basic texts.

An example for the informational layers in an English or Latin
text\footnote{English or Latin as examples for languages where
  sufficient morphological analysis can be based on single words.}
would be:

\begin{enumerate}
\item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1}
  
\item \emph{Basic word list} layer, an automatically generated list of all
  unique words and references to their occurrence in the basic text
  (\ref{item:1}).\label{item:2}
    
\item \emph{Morphological term list} layer, an automatically generated list
  of the morphologically normalized forms of all words and references
  to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4}

\item Scholarly edited \emph{term list} layer, a manually edited list of
  semantical units like technical terms used in the document,
  referring to the basic text (\ref{item:1}).\label{item:5}
\end{enumerate}

Additional annotation layers referencing the basic text or any other
layer could be produced and stored in the same text repository or on any
other server. Therefore it has to be possible to reference any layer in
a unique and stable way across the net.

In languages with more complex morphological units the morphological
analysis layer can be based on an intermediate term layer that joins
basic words into morphological units.


\subsection{Primary and secondary source texts}
\label{sec:backr-orig-source}

The text tool system should be easily adaptable to different
workflows dealing with text in the ECHO domain. There are two
basic types of text sources with a different degree of integration an
the central ECHO text corpus.
 
%% FIXME!!

The \emph{primary source text} is maintained in the basic word tagged
form on a text corpus server. Updates and changes have to be worked
into the word tagged text without breaking the referential integrity.

As \emph{secondary source text} the basic word tagged text is not
the primary source. A mapping file has to be provided,
that maps the words in the basic text to other referenceable units in
the primary source documents. Updates and changes in the primary
document may be followed by updates to the mapping file or the basic
text to maintain referential integrity.

The distinction between these types of sources concerns mainly the
text cruncher producing the basic tagged text and eventually a mapping
file and the presentation tools producing views or references to the
original source texts.


\subsection{Support of additional markup}
\label{sec:supp-addit-mark}

The basic text tagging format should be transparent to additional
markup in the source text to enable the easy integration of the text
tools into existing formats and tools. The use of XML namespaces can
provide such transparency.

The common viewing environment can not be completely
agnostic to additional markup. It must be able to interpret a common
set of minimal visual markup. Visual elements to be considered are:

\begin{itemize}
\item paragraphs and/or line breaks

\item page breaks

\item page images (coupled to page breaks)

\item inline images
\end{itemize}

When presenting text parts to the user as results to a search request
it would be useful to have a general mechanism to select larger units
around the referenced word. Additional semantical units suitable for
this kind of reference would be sentences. The mechanism could try to
select the surrounding sentence and then fall back to larger units
like a paragraph, a page or the whole text.

A translation scheme to map different existing visual markup tags into
the common set for the viewing environment should be implemented. The
translation could be done directly upon creation of second source
texts as these texts are decoupled from the original source text.
The translation would have to be done on-the-fly for primary source
texts where markup different from the common set is used.


\section{Tools}
\label{sec:tools}


\subsection{Text cruncher}
\label{sec:text-cruncher}

The \emph{text cruncher} tool takes a text file and eventual
information about a primary source and produces a \emph{basic word
  tagged text}, a \emph{basic word list}, and an eventual
\emph{mapping file} if the text is to be considered a secondary source
text.


\subsection{Morphological analyzer}
\label{sec:morph-analys}

The \emph{morphological analyzer} tool for a given language takes a
word list or a term list of morphological units and
produces a \emph{morphological term list} of normalized forms, their
morphological description, and references to their occurrences in the
provided list.

A sub function of the morphological analyzer should be a normalizer for
single words to be used in conjunction with the dictionary tool.


\subsection{Dictionary}
\label{sec:dictionary}

The \emph{dictionary analyzer} tool takes a morphologically normalized
term list and produces a term list with known terms,
references to their definitions and references into the occurrences in
the provided list.

A sub function of the dictionary analyzer should be a lookup tool for
single normalized words or terms.


\subsection{Cross referencer}
\label{sec:cross-referencer}

The \emph{cross referencer} tool takes a word list from one text
and a set of word lists from other texts and
produces a word list with words from the first list and
references into all of the lists.


\subsection{Display environment}
\label{sec:display-environment}

The \emph{display environment} should be able to display a text with
minimal visual markup and additional links defined by additional
wordlists. 

The set of necessary visual markup like page breaks, page images,
inline images or text formatting should follow an agreed standard.

The functionality provided by the links could be direct linking into
other texts, morphological analyses, or dictionary entries if the word
is only referenced by one word list. In the case of multiple
references to a word a mechanism for the selection of one of the
possible sources must be provided.


\subsection{List inverter}
\label{sec:list-inverter}

The \emph{list inverter} is a small auxiliary tool that takes a
normal word list that is ordered by unique words and produces an
\emph{inverted word list} that is ordered by word references.


\section{Use cases}
\label{sec:use-cases}


\subsection{Integration of Archimedes XML texts}
\label{sec:integr-arch-xml}

The XML texts of the Archimedes project could be integrated in two
different ways: either as primary source texts, adding basic word
tagging to the Archimedes markup or as secondary source texts by
providing mapping files to the unchanged source files.

In the first case basic word tagging would be added to the XML
document by the text cruncher. The resulting documents could then be
further processed and edited, provided that word references are not
broken. The text cruncher would produce a basic word list for use with
other text tools.

In the second case only a secondary source text and a mapping file
would be produced by the text cruncher together with the basic word
list. The original source text would stay unchanged outside the text
repository.

Additional mappings would have to be generated to adapt the visual
markup used in the Archimedes XML to the common markup for the display
environment.


\subsection{Integration of existing webpages}
\label{sec:integr-exist-webp}


\subsection{Integration of raw OCR text}
\label{sec:integration-raw-ocr}

Raw OCR text as it is generated by automatic OCR on digitized document
pages could be considered original source material. The OCR produces
one plain text document per scanned image file. A suitable text
cruncher would produce a secondary source text for use in the
repository with a mapping file referencing the original text files.


\subsection{Full text search}
\label{sec:full-text-search}

(to be done)


\subsection{Cross linking of texts}
\label{sec:cross-linking-texts}

(to be done)


\section{Proposed formats}
\label{sec:proposed-formats}


\subsection{Basic document}
\label{sec:basic-docum-form}

The basic document format consists of word tags, and optionally language information
for morphological analysis and basic visual markup.

An example in pseudo XML markup might look like this:

\begin{verbatim}
  <text lang="lat">
    <word id="1">omnia</word>
    <word id="2">gallia</word>
    <word id="3">est</word>
    <word id="4">divisa</word>
    <word id="5">in</word>
    <word id="6">partes</word>
    <word id="7">tres</word>.
  </text>
\end{verbatim}


\subsection{Basic wordlist}
\label{sec:wordlist}

The basic wordlist consists of all unique words and references to
their occurrences in the basic text.

\begin{verbatim}
  <list id="1">
    <list-entry id="1">
      <word>patria</word>
      <word-ref>xlink:bello_gallico#36</word-ref>
      <word-ref>xlink:bello_gallico#157</word-ref>
      <word-ref>xlink:bello_gallico#336</word-ref>
    </list-entry>
    <list-entry id="2">
      <word>bello</word>
      <word-ref>xlink:bello_gallico#189</word-ref>
      <word-ref>xlink:bello_gallico#236</word-ref>
      <word-ref>xlink:bello_gallico#557</word-ref>
      <word-ref>xlink:bello_gallico#1396</word-ref>
      <word-ref>xlink:bello_gallico#1450</word-ref>
    </list-entry>
  </list>
\end{verbatim}


\subsection{Term list}
\label{sec:term-list}

A term groups one or more words into a semantical unit. A term list
contains chosen terms and references to their occurrences.

\begin{verbatim}
  <list id="1">
    <list-entry id="1">
      <term>patria nostra</term>
      <term-ref>
        <word-ref>xlink:bello_gallico#36</word-ref>
        <word-ref>xlink:bello_gallico#37</word-ref>
      </term-ref>
      <word-ref>xlink:bello_gallico#36</word-ref>
      <term-ref>
        <word-ref>xlink:bello_gallico#155</word-ref>
        <word-ref>xlink:bello_gallico#157</word-ref>
      </term-ref>
    </list-entry>
    <list-entry id="2">
      <term>belllo gallico</term>
      <term-ref>
        <word-ref>xlink:bello_gallico#12</word-ref>
        <word-ref>xlink:bello_gallico#13</word-ref>
      </term-ref>
    </list-entry>
  </list>
\end{verbatim}


\subsection{Primary source mapping}
\label{sec:prim-source-mapp}

A primary source mapping maps every word of a basic document to its
equivalent in the primary source document.

\begin{verbatim}
  <source-mapping>
    <map id="1">
      <word-ref>xlink:bello_gallico#1</word-ref>
      <ref>xlink:bello.txt(1235)</ref>
    </map>
    <map id="2">
      <word-ref>xlink:bello_gallico#2</word-ref>
      <ref>xlink:bello.txt(1245)</ref>
    </map>
    <map id="3">
      <word-ref>xlink:bello_gallico#3</word-ref>
      <ref>xlink:bello.txt(1257)</ref>
    </map>
  </source-mapping>
\end{verbatim}


\section{Development priorities and time plan}
\label{sec:devel-prior-time}

(to be done)

\section{Glossary}
\label{sec:glossary}

\begin{description}
\item[word] In a basic text a word is any sequence of characters
  between delimiters of whitespace or other delimiters. A word on this
  level is not a semantical, not even a syntactical unit.

\item[term] A term is a container for one or more not necessarily
  adjacent words. Terms can be syntactical or semantical units. Terms
  can be used and referenced like basic words.
  
\item[word reference] A word reference is an xlink or similar
  reference to a word or term in a word list or in a basic text.
  
\item[term reference] A term reference is a reference to a term and
  equivalent to a word reference.

\item[word list] A word list is a list containing elements consisting
  of a word and a list of word references.

\item[term list] A term list is equivalent to a word list. Its
  elements consist of a term and a list of word references.
  
\item[word occurrence list] A word occurrence list is a list where
  every element is treated like a type and a list of all its instances
  -- occurrences -- in the text. The same word (type) can occur only
  once in an occurrence list where it can reference many word instances.
  
\item[word instance list] A word instance list is a word list where
  every element is treated like a singular object (unlike a word
  occurrence list). The same word (type) can occur multiple times in an
  instance list where it can reference only one word or term instance.

\end{description}


\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: