== Reached project goals (Content-based Web Access) == Requirements (extracted from [https://itgroup.mpiwg-berlin.mpg.de:8080/tracs/mpdl-project-content/raw-attachment/wiki/WikiStart/MPDL_project_desc.pdf project description]): * Such an environment must offer content-based access to the texts, which includes sophisticated search capabilities that depend in part on natural language processing (NLP). * The current technical infrastructure of ECHO is inadequate for indefinitely maintaining a (growing) collection of this size. Within the MPDL framework we intend to pilot a replacement architecture for ECHO as well as to prepare a migration path for the ECHO content. * Present formatted pages of the XML transcription in parallel with digital page images. Digilib will play the primary role in the disseminating the latter. Here a basic subservice will extract a given page from an XML fulltext and provide a balanced (or “symmetrized”) version. Such a subservice is needed, since the XML between two page break milestones usually is not a well-formed XML fragment without further processing. * The system is designed to support multiple XML vocabularies which will require minimal configuration information—such as the TEI document type or ECHO document type. * Subsequent to extraction and production of a balanced XML fragment, the display pipeline involves the following three major steps: * Rendering. Rendering of the balanced XML fragment will be performed with XSLT on the server side, yielding XHTML for the client. XSLT will be readily pluggable, allowing for multiple output options. * Enrichment. The generated XHTML will be enriched with: inline images, links to external resources (e.g. Pollux dictionaries via lemmatization provided by Donatus; geospatial data). At this stage, transliteration of various sorts is also possible (should a Greek text be displayed in a Romanization or in Greek characters? should an Arabic text be displayed fully voweled, in its typical rendition, or in Romanization? should a Sanskrit text be displayed in Devanagari, or Tamil, or Romanization, or IPA? a Chinese text in traditional characters, simplified characters, or pinyin?). This is also the layer at which named entity resolution is most appropriately realized. * Generation of a synthetic view. The XHTML view will be synchronized and presented in coordination with the appropriate digital image, provided by Digilib. In addition to the basic display environment, a language-sensitive indexing tool needs to be constructed. Such a tool will allow searching a particular text, a corpus, an arbitrarily selected group of texts/corpora, or all texts for one or more natural language words. The search functionality will be developed using an open-source tool (e.g. Lucene) in combination with the NLP technology hosted by Donatus. Thus, for instance, it will be possible to search for all inflected forms of a Latin verb (or only a subset of those forms). * There will also be support for accessing texts through human-constructed indices, which reference the texts through XPointer. In this way, scholars will be able to develop an access approach to a given text. * A further component of this project is to extend the Arboreal browser to be able to make use (both read/write) of the MPDL repository. This extension will provide scholars with an alternative/complementary access modality. In addition, Arboreal, which is an inherently network-neutral application, will be able to offer storage within the MPDL repository as an alternative strategy for saving content generated within the program. * It is also our intention to integrate a general statistical toolkit currently under development by the Scholarly Computing Group of the MPIWG into this framework.