wiki:tmp/projectgoals

Version 2 (modified by jwillenborg, 13 years ago) (diff)

--

Reached project goals (Content-based Web Access)

Requirements (extracted from project description)

  • (1) Such an environment must offer content-based access to the texts, which includes sophisticated search capabilities that depend in part on natural language processing (NLP).
  • (2) The current technical infrastructure of ECHO is inadequate for indefinitely maintaining a (growing) collection of this size. Within the MPDL framework we intend to pilot a replacement architecture for ECHO as well as to prepare a migration path for the ECHO content.
  • (3) Present formatted pages of the XML transcription in parallel with digital page images. Digilib will play the primary role in the disseminating the latter. Here a basic subservice will extract a given page from an XML fulltext and provide a balanced (or “symmetrized”) version. Such a subservice is needed, since the XML between two page break milestones usually is not a well-formed XML fragment without further processing.
  • (4) The system is designed to support multiple XML vocabularies which will require minimal configuration information—such as the TEI document type or ECHO document type.
  • (5) Subsequent to extraction and production of a balanced XML fragment, the display pipeline involves the following three major steps:
    • Rendering. Rendering of the balanced XML fragment will be performed with XSLT on the server side, yielding XHTML for the client. XSLT will be readily pluggable, allowing for multiple output options.
    • Enrichment. The generated XHTML will be enriched with: inline images, links to external resources (e.g. Pollux dictionaries via lemmatization provided by Donatus; geospatial data). At this stage, transliteration of various sorts is also possible (should a Greek text be displayed in a Romanization or in Greek characters? should an Arabic text be displayed fully voweled, in its typical rendition, or in Romanization? should a Sanskrit text be displayed in Devanagari, or Tamil, or Romanization, or IPA? a Chinese text in traditional characters, simplified characters, or pinyin?). This is also the layer at which named entity resolution is most appropriately realized.
    • Generation of a synthetic view. The XHTML view will be synchronized and presented in coordination with the appropriate digital image, provided by Digilib. In addition to the basic display environment, a language-sensitive indexing tool needs to be constructed. Such a tool will allow searching a particular text, a corpus, an arbitrarily selected group of texts/corpora, or all texts for one or more natural language words. The search functionality will be developed using an open-source tool (e.g. Lucene) in combination with the NLP technology hosted by Donatus. Thus, for instance, it will be possible to search for all inflected forms of a Latin verb (or only a subset of those forms).
  • (6) There will also be support for accessing texts through human-constructed indices, which reference the texts through XPointer. In this way, scholars will be able to develop an access approach to a given text.
  • (7) A further component of this project is to extend the Arboreal browser to be able to make use (both read/write) of the MPDL repository. This extension will provide scholars with an alternative/complementary access modality. In addition, Arboreal, which is an inherently network-neutral application, will be able to offer storage within the MPDL repository as an alternative strategy for saving content generated within the program.
  • (8) It is also our intention to integrate a general statistical toolkit currently under development by the Scholarly Computing Group of the MPIWG into this framework.

Reached progress

  • (1): fully reached
  • (2): fully reached
  • (3):