wiki:tmp/projectgoals

Version 5 (modified by jwillenborg, 14 years ago) (diff)

--

Reached project goals (Content-based Web Access)

Requirements (extracted from project description)

  • (1) Such an environment must offer content-based access to the texts, which includes sophisticated search capabilities that depend in part on natural language processing (NLP).
  • (2) The current technical infrastructure of ECHO is inadequate for indefinitely maintaining a (growing) collection of this size. Within the MPDL framework we intend to pilot a replacement architecture for ECHO as well as to prepare a migration path for the ECHO content.
  • (3) Present formatted pages of the XML transcription in parallel with digital page images. Digilib will play the primary role in the disseminating the latter. Here a basic subservice will extract a given page from an XML fulltext and provide a balanced (or “symmetrized”) version. Such a subservice is needed, since the XML between two page break milestones usually is not a well-formed XML fragment without further processing.
  • (4) The system is designed to support multiple XML vocabularies which will require minimal configuration information—such as the TEI document type or ECHO document type.
  • (5) Subsequent to extraction and production of a balanced XML fragment, the display pipeline involves the following three major steps:
    • Rendering. Rendering of the balanced XML fragment will be performed with XSLT on the server side, yielding XHTML for the client. XSLT will be readily pluggable, allowing for multiple output options.
    • Enrichment. The generated XHTML will be enriched with: inline images, links to external resources (e.g. Pollux dictionaries via lemmatization provided by Donatus; geospatial data). At this stage, transliteration of various sorts is also possible (should a Greek text be displayed in a Romanization or in Greek characters? should an Arabic text be displayed fully voweled, in its typical rendition, or in Romanization? should a Sanskrit text be displayed in Devanagari, or Tamil, or Romanization, or IPA? a Chinese text in traditional characters, simplified characters, or pinyin?). This is also the layer at which named entity resolution is most appropriately realized.
    • Generation of a synthetic view. The XHTML view will be synchronized and presented in coordination with the appropriate digital image, provided by Digilib. In addition to the basic display environment, a language-sensitive indexing tool needs to be constructed. Such a tool will allow searching a particular text, a corpus, an arbitrarily selected group of texts/corpora, or all texts for one or more natural language words. The search functionality will be developed using an open-source tool (e.g. Lucene) in combination with the NLP technology hosted by Donatus. Thus, for instance, it will be possible to search for all inflected forms of a Latin verb (or only a subset of those forms).
  • (6) There will also be support for accessing texts through human-constructed indices, which reference the texts through XPointer. In this way, scholars will be able to develop an access approach to a given text.
  • (7) A further component of this project is to extend the Arboreal browser to be able to make use (both read/write) of the MPDL repository. This extension will provide scholars with an alternative/complementary access modality. In addition, Arboreal, which is an inherently network-neutral application, will be able to offer storage within the MPDL repository as an alternative strategy for saving content generated within the program.
  • (8) It is also our intention to integrate a general statistical toolkit currently under development by the Scholarly Computing Group of the MPIWG into this framework.

Reached progress

  • (1): 100%
  • (2): 100%
  • (3): 100%
  • (4): document types "archimedes" and "echo" full realized, document type TEI is already supported in software design and is in preparation: 85%
  • (5)
    • Rendering: 100%
    • Enrichment: inline images, links to external resources (100%), transliteration of various sorts (70%), named entity resolution (??)
    • Generation of a synthetic view: 90%
  • (7) Arboreal functionality will be further developed in cooperation with the Institute of Computer Science, Artificial Intelligence at Eralngen-Nuremberg (Prof. Günter Görz) and Group Software and Internet technology at Hochschule Deggendorf (Prof. Dr. Josef Schneeberger). Also we propose to use XML editors such as Oxygen, Eclipse or Open Office. Functionality such as saving links to user defined fulltext queries or annotations is just developed. Overall progress: 20% (?)
  • (8): not realized yet: 0%

Additional progress (not listed in requirements)

  • user document interface: create/update/delete documents of a collection
  • static views of documents (PDF, HTML)
  • development of open source software: eXist: new function for fast retrieving a fragment between two xml elements
  • dynamic generation of the content and figure list of a book (with links to the page)
  • new type of user queries in documents: XPath/XQuery
  • web interfaces to the main functions of the system (page-fragment, doc-query, etc.)