wiki:tmp/projectgoals

Reached project goals (Content-based Web Access)

Requirements (extracted from project description)

  • (1) system realized as a prototype
  • (2) Such an environment must offer content-based access to the texts, which includes sophisticated search capabilities that depend in part on natural language processing (NLP).
  • (3) The current technical infrastructure of ECHO is inadequate for indefinitely maintaining a (growing) collection of this size. Within the MPDL framework we intend to pilot a replacement architecture for ECHO as well as to prepare a migration path for the ECHO content.
  • (4) Present formatted pages of the XML transcription in parallel with digital page images. Digilib will play the primary role in the disseminating the latter. Here a basic subservice will extract a given page from an XML fulltext and provide a balanced (or “symmetrized”) version. Such a subservice is needed, since the XML between two page break milestones usually is not a well-formed XML fragment without further processing.
  • (5) The system is designed to support multiple XML vocabularies which will require minimal configuration information—such as the TEI document type or ECHO document type.
  • (6) Subsequent to extraction and production of a balanced XML fragment, the display pipeline involves the following three major steps:
    • Rendering. Rendering of the balanced XML fragment will be performed with XSLT on the server side, yielding XHTML for the client. XSLT will be readily pluggable, allowing for multiple output options.
    • Enrichment. The generated XHTML will be enriched with: inline images, links to external resources (e.g. Pollux dictionaries via lemmatization provided by Donatus; geospatial data). At this stage, transliteration of various sorts is also possible (should a Greek text be displayed in a Romanization or in Greek characters? should an Arabic text be displayed fully voweled, in its typical rendition, or in Romanization? should a Sanskrit text be displayed in Devanagari, or Tamil, or Romanization, or IPA? a Chinese text in traditional characters, simplified characters, or pinyin?). This is also the layer at which named entity resolution is most appropriately realized.
    • Generation of a synthetic view. The XHTML view will be synchronized and presented in coordination with the appropriate digital image, provided by Digilib. In addition to the basic display environment, a language-sensitive indexing tool needs to be constructed. Such a tool will allow searching a particular text, a corpus, an arbitrarily selected group of texts/corpora, or all texts for one or more natural language words. The search functionality will be developed using an open-source tool (e.g. Lucene) in combination with the NLP technology hosted by Donatus. Thus, for instance, it will be possible to search for all inflected forms of a Latin verb (or only a subset of those forms).
  • (7) There will also be support for accessing texts through human-constructed indices, which reference the texts through XPointer. In this way, scholars will be able to develop an access approach to a given text.
  • (8) A further component of this project is to extend the Arboreal browser to be able to make use (both read/write) of the MPDL repository. This extension will provide scholars with an alternative/complementary access modality. In addition, Arboreal, which is an inherently network-neutral application, will be able to offer storage within the MPDL repository as an alternative strategy for saving content generated within the program.
  • (9) It is also our intention to integrate a general statistical toolkit currently under development by the Scholarly Computing Group of the MPIWG into this framework.

Reached progress

  • (1): 100%
  • (2): 100%
  • (3): 100%
  • (4): 100%
  • (5): document types "archimedes" and "echo" are realized, document type TEI Lite is already supported in the software design and is in preparation: 85%
  • (6) Overall progress: ca. 85%
    • Rendering: 100%
    • Enrichment: inline images, links to external resources (100%), transliteration of various sorts (70%), named entity resolution, e.g. explicit naming of persons/organizations or places: linking to the GIS system or to Wikipedia articles (ca. 20%)
    • Generation of a synthetic view: 90%
  • (8) Arboreal functionality will be further developed in cooperation with the Institute of Computer Science, Artificial Intelligence at Eralngen-Nuremberg (Prof. Günter Görz) and Group Software and Internet technology at Hochschule Deggendorf (Prof. Dr. Josef Schneeberger). Also we propose to use XML editors such as Oxygen, Eclipse or Open Office. Functionality such as saving links to user defined fulltext queries or annotations is just developed. Overall progress: 20% (?)
  • (9): not realized yet: 0%

Additional progress (not listed in requirements)

  • system reached already "product status" through the use of the ECHO-system and diverse web crawlers (50.000 - 1.500.000 accesses per day), also backup of the data is done frequently and a new mirror system is just built up
  • combined morphology, dictionary and knowledge presentation of words in different languages such as classic latin, greek, chinese, etc., dynamic linking to external dictionaries and lexicons in different languages
  • user document interface: create/update/delete documents in documents collections (with login possibility by the help of the eSciDoc REST-API to user accounts)
  • static views of documents (PDF, HTML)
  • development of open source software: eXist: new function for fast retrieving a fragment between two xml elements
  • dynamic generation of the content and figure list of a book (with links to the page)
  • word indices of books (original word index and morphological word index)
  • new type of user queries in documents: XPath/XQuery
  • query interface: Lucene query syntax and semantics for XML documents, boolean attribute queries, morphological fulltext queries
  • web interfaces to the main functions of the system (page-fragment, doc-query, etc.) which are used in web frontends such as the new echo viewer or the MPIWG fulltext search system
  • integration of eXist and eSciDoc: first prototype finished

Future requirements/developments

  • maintenance of external objects such as annotations and user queries and their dynamic presentation
  • support of TEI Lite as a new document type
  • extension of the language technology
    • filling of last gaps: support more languages and dictionaries
    • import of keywords of further dictionaries and wikipedia lexicons
    • better web presentation
  • better support of the scientific workbench approach: document versioning, authoring, publication process, what does the scientist need in his work, etc.
  • better scalability: up to 100.000 documents, test installations of the new eXist 1.4.1 version and other XML systems
  • faster searching: in all documents, in one document
  • ...
Last modified 13 years ago Last modified on Feb 3, 2011, 12:17:08 PM