Annotation of texttool-concept/texttools.tex, revision 1.1.1.1
1.1 dwinter 1: \documentclass[a4paper]{article}
2:
3: \usepackage[latin1]{inputenc}
4: \usepackage[T1]{fontenc}
5: \usepackage{ae}
6: \usepackage{url}
7: \usepackage{graphicx}
8:
9: \graphicspath{{graphics/}}
10:
11: \title{Draft: Proposal for a text tool architecture for ECHO}
12: \author{Robert Casties}
13: \date{\today}
14:
15: \begin{document}
16:
17: \maketitle
18:
19: \section{Introduction}
20: \label{sec:introduction}
21: In the context of ECHO ``text'' represents scholarly metadata as well
22: as full texts of sources. As such, text forms the glue between the
23: different objects in the ECHO corpus. To fully exploit the potential
24: of text for semantic access and interlinking, tools have to support
25: the automatic or manual generation of links between different objects
26: within the ECHO corpus.
27:
28: A viewing environment should present configurable views on all texts
29: that allow to exploit relations to other texts and objects.
30:
31: Four different fields can be
32: identified, for which tools have to be developed:
33: \begin{itemize}
34: \item the generation of XML-structures
35: \item the analysis of the corpora
36: \item the meaningful linking of texts
37: \item the generation of scholarly metadata.
38: \end{itemize}
39:
40:
41: \section{Requirements}
42: \label{sec:requirements}
43: The handling of large corpora makes it necessary to define a minimal
44: standard XML structure for these documents. This implies the
45: development of tools to convert existing document formats into
46: these standard formats. In addition, tools for editing documents in
47: these formats will have to be made available.
48:
49: A prerequisite for generating links between documents is the
50: possibility to analyse texts and adding the results of this analysis
51: to the document. In general, two different types of this analysis can
52: be distinguished: automatically generated analysis following defined
53: rules and the manual analysis by marking words depending on the
54: context.
55:
56: The analysis of corpora is the basis for automatically generated
57: linking of documents. For example the wordlists generated by
58: morphological analysis can serve as starting point for linking to a
59: dictionary or a grammar. Another example would be the usage of a
60: wordlist consisting of technical terms serving as basis for linking to
61: an encyclopedia or glossary. Furthermore such wordlists can serve as
62: starting points for cross-linking within the text corpus using this
63: word lists as a common anchor.
64:
65: Beyond the automatically generated linking of documents, the linking
66: as result of scholarly work has to be supported by text tools,
67: e.g. showing connections between different texts in the corpus,
68: combining sources with translations or secondary texts, the linking
69: between images and describing texts or the
70: connection between full texts and images.
71:
72: In particular, an open environment for adding comments and notes to
73: sources can be a test bed for how collaborative work on sources could
74: be encouraged by the ECHO project in order to build a virtual European
75: research area on cultural heritage.
76:
77:
78: \section{Technical issues}
79: \label{sec:formats}
80:
81:
82: \subsection{Granularity of reference}
83: \label{sec:gran-refer}
84:
85: The basic layer of informational markup has to define the units of
86: reference for the higher layers. The granularity of these reference
87: units determines the amount of complexity needed for referencing in
88: the higher layers. The markup in the basic layer should also permit
89: changes in formatting and corrections in the source document to a
90: certain extent without loosing referential integrity.
91:
92: The proposed unit of reference in the basic layer is a \emph{word}.
93: Where \emph{word} means any sequence of characters between whitespace
94: or other special characters in the source document, excluding
95: formatting and markup. The word as a unit of reference is not meant to
96: be a semantical unit or even a morphological unit. It is only meant to
97: be the smallest easily recognizable unit used in the text.
98: Morphological, syntactical and semantical units can be assembled and
99: referenced on higher level layers as \emph{terms} comprised of one or
100: more, not necessarily adjacent \emph{words}.
101:
102:
103: \subsection{Information layers}
104: \label{sec:information-layers}
105:
106: The text tools operate according to the ``standoff
107: principle'' of XML markup. The basic text is marked up only to provide
108: the basis of raw data and reference. Additional syntactical and
109: semantical information -- be it automatically generated or scholarly
110: edited -- is provided in separated informational layers of
111: \emph{word lists} referencing other layers or the basic text.
112:
113: \begin{figure}[htbp]
114: \centering
115: \includegraphics[width=0.8\textwidth]{word-termlists}
116: \caption{Relation of basic text and term lists. }
117: \label{fig:word-termlists}
118: \end{figure}
119:
120: A \emph{word list} or \emph{term list}\footnote{\emph{word list} and
121: \emph{term list} will be used interchangeably in the following text
122: since both forms should be functionally identical.} is a list of
123: \emph{words} or \emph{terms} that are each linked to a list of
124: references to \emph{words} or \emph{terms} in other \emph{word lists}
125: or to \emph{words} in basic texts.
126:
127: An example for the informational layers in an English or Latin
128: text\footnote{English or Latin as examples for languages where
129: sufficient morphological analysis can be based on single words.}
130: would be:
131:
132: \begin{enumerate}
133: \item \emph{Basic text} layer, marked up with \emph{words}.\label{item:1}
134:
135: \item \emph{Basic word list} layer, an automatically generated list of all
136: unique words and references to their occurrence in the basic text
137: (\ref{item:1}).\label{item:2}
138:
139: \item \emph{Morphological term list} layer, an automatically generated list
140: of the morphologically normalized forms of all words and references
141: to their occurrence in the basic wordlist (\ref{item:2}).\label{item:4}
142:
143: \item Scholarly edited \emph{term list} layer, a manually edited list of
144: semantical units like technical terms used in the document,
145: referring to the basic text (\ref{item:1}).\label{item:5}
146: \end{enumerate}
147:
148: Additional annotation layers referencing the basic text or any other
149: layer could be produced and stored in the same text repository or on any
150: other server. Therefore it has to be possible to reference any layer in
151: a unique and stable way across the net.
152:
153: In languages with more complex morphological units the morphological
154: analysis layer can be based on an intermediate term layer that joins
155: basic words into morphological units.
156:
157:
158:
159: \subsection{Primary and secondary source texts}
160: \label{sec:backr-orig-source}
161:
162: The text tool system should be easily adaptable to different
163: workflows dealing with text in the ECHO domain. There are two
164: basic types of text sources with a different degree of integration an
165: the central ECHO text corpus.
166:
167: %% FIXME!!
168:
169: The \emph{primary source text} is maintained in the basic word tagged
170: form on a text corpus server. Updates and changes have to be worked
171: into the word tagged text without breaking the referential integrity.
172:
173: As \emph{secondary source text} the basic word tagged text is not
174: the primary source. A mapping file has to be provided,
175: that maps the words in the basic text to other referenceable units in
176: the primary source documents. Updates and changes in the primary
177: document may be followed by updates to the mapping file or the basic
178: text to maintain referential integrity.
179:
180: The distinction between these types of sources concerns mainly the
181: text cruncher producing the basic tagged text and eventually a mapping
182: file and the presentation tools producing views or references to the
183: original source texts.
184:
185:
186:
187: \subsection{Support of additional markup}
188: \label{sec:supp-addit-mark}
189:
190: The basic text tagging format should be transparent to additional
191: markup in the source text to enable the easy integration of the text
192: tools into existing formats and tools. The use of XML namespaces can
193: provide such transparency.
194:
195: The common viewing environment can not be completely
196: agnostic to additional markup. It must be able to interpret a common
197: set of minimal visual markup. Visual elements to be considered are:
198:
199: \begin{itemize}
200: \item paragraphs and/or line breaks
201:
202: \item page breaks
203:
204: \item page images (coupled to page breaks)
205:
206: \item inline images
207: \end{itemize}
208:
209: When presenting text parts to the user as results to a search request
210: it would be useful to have a general mechanism to select larger units
211: around the referenced word. Additional semantical units suitable for
212: this kind of reference would be sentences. The mechanism could try to
213: select the surrounding sentence and then fall back to larger units
214: like a paragraph, a page or the whole text.
215:
216: A translation scheme to map different existing visual markup tags into
217: the common set for the viewing environment should be implemented. The
218: translation could be done directly upon creation of second source
219: texts as these texts are decoupled from the original source text.
220: The translation would have to be done on-the-fly for primary source
221: texts where markup different from the common set is used.
222:
223:
224: \section{Tools}
225: \label{sec:tools}
226:
227:
228: \subsection{Text cruncher}
229: \label{sec:text-cruncher}
230:
231: The \emph{text cruncher} tool takes a text file and eventual
232: information about a primary source and produces a \emph{basic word
233: tagged text}, a \emph{basic word list}, and an eventual
234: \emph{mapping file} if the text is to be considered a secondary source
235: text.
236:
237:
238: \subsection{Morphological analyzer}
239: \label{sec:morph-analys}
240:
241: The \emph{morphological analyzer} tool for a given language takes a
242: word list or a term list of morphological units and
243: produces a \emph{morphological term list} of normalized forms, their
244: morphological description, and references to their occurrences in the
245: provided list.
246:
247: A sub function of the morphological analyzer should be a normalizer for
248: single words to be used in conjunction with the dictionary tool.
249:
250:
251: \subsection{Dictionary}
252: \label{sec:dictionary}
253:
254: The \emph{dictionary analyzer} tool takes a morphologically normalized
255: term list and produces a term list with known terms,
256: references to their definitions and references into the occurrences in
257: the provided list.
258:
259: A sub function of the dictionary analyzer should be a lookup tool for
260: single normalized words or terms.
261:
262:
263: \subsection{Cross referencer}
264: \label{sec:cross-referencer}
265:
266: The \emph{cross referencer} tool takes a word list from one text
267: and a set of word lists from other texts and
268: produces a word list with words from the first list and
269: references into all of the lists.
270:
271:
272: \subsection{Display environment}
273: \label{sec:display-environment}
274:
275: The \emph{display environment} should be able to display a text with
276: minimal visual markup and additional links defined by additional
277: wordlists.
278:
279: The set of necessary visual markup like page breaks, page images,
280: inline images or text formatting should follow an agreed standard.
281:
282: The functionality provided by the links could be direct linking into
283: other texts, morphological analyses, or dictionary entries if the word
284: is only referenced by one word list. In the case of multiple
285: references to a word a mechanism for the selection of one of the
286: possible sources must be provided.
287:
288:
289: \subsection{List inverter}
290: \label{sec:list-inverter}
291:
292: The \emph{list inverter} is a small auxiliary tool that takes a
293: normal word list that is ordered by unique words and produces an
294: \emph{inverted word list} that is ordered by word references.
295:
296:
297:
298:
299: \section{Use cases}
300: \label{sec:use-cases}
301:
302:
303: \subsection{Integration of Archimedes XML texts}
304: \label{sec:integr-arch-xml}
305:
306: The XML texts of the Archimedes project could be integrated in two
307: different ways: either as primary source texts, adding basic word
308: tagging to the Archimedes markup or as secondary source texts by
309: providing mapping files to the unchanged source files.
310:
311: In the first case basic word tagging would be added to the XML
312: document by the text cruncher. The resulting documents could then be
313: further processed and edited, provided that word references are not
314: broken. The text cruncher would produce a basic word list for use with
315: other text tools.
316:
317: In the second case only a secondary source text and a mapping file
318: would be produced by the text cruncher together with the basic word
319: list. The original source text would stay unchanged outside the text
320: repository.
321:
322: Additional mappings would have to be generated to adapt the visual
323: markup used in the Archimedes XML to the common markup for the display
324: environment.
325:
326:
327:
328: \subsection{Integration of existing webpages}
329: \label{sec:integr-exist-webp}
330:
331:
332:
333: \subsection{Integration of raw OCR text}
334: \label{sec:integration-raw-ocr}
335:
336: Raw OCR text as it is generated by automatic OCR on digitized document
337: pages could be considered original source material. The OCR produces
338: one plain text document per scanned image file. A suitable text
339: cruncher would produce a secondary source text for use in the
340: repository with a mapping file referencing the original text files.
341:
342:
343:
344: \subsection{Full text search}
345: \label{sec:full-text-search}
346:
347: (to be done)
348:
349:
350: \subsection{Cross linking of texts}
351: \label{sec:cross-linking-texts}
352:
353: (to be done)
354:
355:
356: \section{Proposed formats}
357: \label{sec:proposed-formats}
358:
359:
360: \subsection{Basic document}
361: \label{sec:basic-docum-form}
362:
363: The basic document format consists of word tags, and optionally language information
364: for morphological analysis and basic visual markup.
365:
366: An example in pseudo XML markup might look like this:
367:
368: \begin{verbatim}
369: <text lang="lat">
370: <word id="1">omnia</word>
371: <word id="2">gallia</word>
372: <word id="3">est</word>
373: <word id="4">divisa</word>
374: <word id="5">in</word>
375: <word id="6">partes</word>
376: <word id="7">tres</word>.
377: </text>
378: \end{verbatim}
379:
380:
381:
382: \subsection{Basic wordlist}
383: \label{sec:wordlist}
384:
385: The basic wordlist consists of all unique words and references to
386: their occurrences in the basic text.
387:
388: \begin{verbatim}
389: <list id="1">
390: <list-entry id="1">
391: <word>patria</word>
392: <word-ref>xlink:bello_gallico#36</word-ref>
393: <word-ref>xlink:bello_gallico#157</word-ref>
394: <word-ref>xlink:bello_gallico#336</word-ref>
395: </list-entry>
396: <list-entry id="2">
397: <word>bello</word>
398: <word-ref>xlink:bello_gallico#189</word-ref>
399: <word-ref>xlink:bello_gallico#236</word-ref>
400: <word-ref>xlink:bello_gallico#557</word-ref>
401: <word-ref>xlink:bello_gallico#1396</word-ref>
402: <word-ref>xlink:bello_gallico#1450</word-ref>
403: </list-entry>
404: </list>
405: \end{verbatim}
406:
407:
408: \subsection{Term list}
409: \label{sec:term-list}
410:
411: A term groups one or more words into a semantical unit. A term list
412: contains chosen terms and references to their occurrences.
413:
414: \begin{verbatim}
415: <list id="1">
416: <list-entry id="1">
417: <term>patria nostra</term>
418: <term-ref>
419: <word-ref>xlink:bello_gallico#36</word-ref>
420: <word-ref>xlink:bello_gallico#37</word-ref>
421: </term-ref>
422: <word-ref>xlink:bello_gallico#36</word-ref>
423: <term-ref>
424: <word-ref>xlink:bello_gallico#155</word-ref>
425: <word-ref>xlink:bello_gallico#157</word-ref>
426: </term-ref>
427: </list-entry>
428: <list-entry id="2">
429: <term>belllo gallico</term>
430: <term-ref>
431: <word-ref>xlink:bello_gallico#12</word-ref>
432: <word-ref>xlink:bello_gallico#13</word-ref>
433: </term-ref>
434: </list-entry>
435: </list>
436: \end{verbatim}
437:
438:
439: \subsection{Primary source mapping}
440: \label{sec:prim-source-mapp}
441:
442: A primary source mapping maps every word of a basic document to its
443: equivalent in the primary source document.
444:
445: \begin{verbatim}
446: <source-mapping>
447: <map id="1">
448: <word-ref>xlink:bello_gallico#1</word-ref>
449: <ref>xlink:bello.txt(1235)</ref>
450: </map>
451: <map id="2">
452: <word-ref>xlink:bello_gallico#2</word-ref>
453: <ref>xlink:bello.txt(1245)</ref>
454: </map>
455: <map id="3">
456: <word-ref>xlink:bello_gallico#3</word-ref>
457: <ref>xlink:bello.txt(1257)</ref>
458: </map>
459: </source-mapping>
460: \end{verbatim}
461:
462:
463:
464: \section{Development priorities and time plan}
465: \label{sec:devel-prior-time}
466:
467: (to be done)
468:
469: \section{Glossary}
470: \label{sec:glossary}
471:
472: \begin{description}
473: \item[word] In a basic text a word is any sequence of characters
474: between delimiters of whitespace or other delimiters. A word on this
475: level is not a semantical, not even a syntactical unit.
476:
477: \item[term] A term is a container for one or more not necessarily
478: adjacent words. Terms can be syntactical or semantical units. Terms
479: can be used and referenced like basic words.
480:
481: \item[word reference] A word reference is an xlink or similar
482: reference to a word or term in a word list or in a basic text.
483:
484: \item[term reference] A term reference is a reference to a term and
485: equivalent to a word reference.
486:
487: \item[word list] A word list is a list containing elements consisting
488: of a word and a list of word references.
489:
490: \item[term list] A term list is equivalent to a word list. Its
491: elements consist of a term and a list of word references.
492:
493: \item[word occurrence list] A word occurrence list is a list where
494: every element is treated like a type and a list of all its instances
495: -- occurrences -- in the text. The same word (type) can occur only
496: once in an occurrence list where it can reference many word instances.
497:
498: \item[word instance list] A word instance list is a word list where
499: every element is treated like a singular object (unlike a word
500: occurrence list). The same word (type) can occur multiple times in an
501: instance list where it can reference only one word or term instance.
502:
503: \end{description}
504:
505:
506: \end{document}
507:
508: %%% Local Variables:
509: %%% mode: latex
510: %%% TeX-master: t
511: %%% End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>