Changes between Version 7 and Version 8 of tmp/projectgoals
- Timestamp:
- Feb 2, 2011, 5:05:58 PM (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
tmp/projectgoals
v7 v8 3 3 === Requirements (extracted from [https://itgroup.mpiwg-berlin.mpg.de:8080/tracs/mpdl-project-content/raw-attachment/wiki/WikiStart/MPDL_project_desc.pdf project description]) === 4 4 5 * (1) Such an environment must offer content-based access to the texts, which includes sophisticated search capabilities that depend in part on natural language processing (NLP).5 * (1) system realized as a prototype 6 6 7 * (2) The current technical infrastructure of ECHO is inadequate for indefinitely maintaining a (growing) collection of this size. Within the MPDL framework we intend to pilot a replacement architecture for ECHO as well as to prepare a migration path for the ECHO content.7 * (2) Such an environment must offer content-based access to the texts, which includes sophisticated search capabilities that depend in part on natural language processing (NLP). 8 8 9 * (3) Present formatted pages of the XML transcription in parallel with digital page images. Digilib will play the primary role in the disseminating the latter. Here a basic subservice will extract a given page from an XML fulltext and provide a balanced (or “symmetrized”) version. Such a subservice is needed, since the XML between two page break milestones usually is not a well-formed XML fragment without further processing.9 * (3) The current technical infrastructure of ECHO is inadequate for indefinitely maintaining a (growing) collection of this size. Within the MPDL framework we intend to pilot a replacement architecture for ECHO as well as to prepare a migration path for the ECHO content. 10 10 11 * (4) The system is designed to support multiple XML vocabularies which will require minimal configuration information—such as the TEI document type or ECHO document type.11 * (4) Present formatted pages of the XML transcription in parallel with digital page images. Digilib will play the primary role in the disseminating the latter. Here a basic subservice will extract a given page from an XML fulltext and provide a balanced (or “symmetrized”) version. Such a subservice is needed, since the XML between two page break milestones usually is not a well-formed XML fragment without further processing. 12 12 13 * (5) Subsequent to extraction and production of a balanced XML fragment, the display pipeline involves the following three major steps: 13 * (5) The system is designed to support multiple XML vocabularies which will require minimal configuration information—such as the TEI document type or ECHO document type. 14 15 * (6) Subsequent to extraction and production of a balanced XML fragment, the display pipeline involves the following three major steps: 14 16 * Rendering. Rendering of the balanced XML fragment will be performed with XSLT on the server side, yielding XHTML for the client. XSLT will be readily pluggable, allowing for multiple output options. 15 17 * Enrichment. The generated XHTML will be enriched with: inline images, links to external resources (e.g. Pollux dictionaries via lemmatization provided by Donatus; geospatial data). At this stage, transliteration of various sorts is also possible (should a Greek text be displayed in a Romanization or in Greek characters? should an Arabic text be displayed fully voweled, in its typical rendition, or in Romanization? should a Sanskrit text be displayed in Devanagari, or Tamil, or Romanization, or IPA? a Chinese text in traditional characters, simplified characters, or pinyin?). This is also the layer at which named entity resolution is most appropriately realized. 16 18 * Generation of a synthetic view. The XHTML view will be synchronized and presented in coordination with the appropriate digital image, provided by Digilib. In addition to the basic display environment, a language-sensitive indexing tool needs to be constructed. Such a tool will allow searching a particular text, a corpus, an arbitrarily selected group of texts/corpora, or all texts for one or more natural language words. The search functionality will be developed using an open-source tool (e.g. Lucene) in combination with the NLP technology hosted by Donatus. Thus, for instance, it will be possible to search for all inflected forms of a Latin verb (or only a subset of those forms). 17 19 18 * ( 6) There will also be support for accessing texts through human-constructed indices, which reference the texts through XPointer. In this way, scholars will be able to develop an access approach to a given text.20 * (7) There will also be support for accessing texts through human-constructed indices, which reference the texts through XPointer. In this way, scholars will be able to develop an access approach to a given text. 19 21 20 * ( 7) A further component of this project is to extend the Arboreal browser to be able to make use (both read/write) of the MPDL repository. This extension will provide scholars with an alternative/complementary access modality. In addition, Arboreal, which is an inherently network-neutral application, will be able to offer storage within the MPDL repository as an alternative strategy for saving content generated within the program.22 * (8) A further component of this project is to extend the Arboreal browser to be able to make use (both read/write) of the MPDL repository. This extension will provide scholars with an alternative/complementary access modality. In addition, Arboreal, which is an inherently network-neutral application, will be able to offer storage within the MPDL repository as an alternative strategy for saving content generated within the program. 21 23 22 * ( 8) It is also our intention to integrate a general statistical toolkit currently under development by the Scholarly Computing Group of the MPIWG into this framework.24 * (9) It is also our intention to integrate a general statistical toolkit currently under development by the Scholarly Computing Group of the MPIWG into this framework. 23 25 24 26 === Reached progress === … … 30 32 * (3): 100% 31 33 32 * (4): document types "archimedes" and "echo" full realized, document type TEI is already supported in software design and is in preparation: 85%34 * (4): 100% 33 35 34 * (5) 36 * (5): document types "archimedes" and "echo" full realized, document type TEI is already supported in software design and is in preparation: 85% 37 38 * (6) 35 39 * Rendering: 100% 36 40 * Enrichment: inline images, links to external resources (100%), transliteration of various sorts (70%), named entity resolution, e.g. explicit naming of persons/organizations or places: linking to the GIS system or to Wikipedia articles (ca. 20%) 37 41 * Generation of a synthetic view: 90% 38 42 39 * ( 6): e.g.: the 5th sentence in the example.xml document: http://example.org/example.xml#/echo/text//s[5]: some functionality is realized: XPath-Queries in documents are possible (e.g.: http://mpdl-proto.mpiwg-berlin.mpg.de/mpdl/page-query-result.xql?document=%2Fecho%2Fla%2FBenedetti_1585.xml&pn=1&query-type=xpath&query=%2F%2Fecho%3As). Also direct links to pages and sentences are possible. Full XPointer linking is not realized yet but this is relative easy. Overall progress: 40%43 * (7): e.g.: the 5th sentence in the example.xml document: http://example.org/example.xml#/echo/text//s[5]: some functionality is realized: XPath-Queries in documents are possible (e.g.: http://mpdl-proto.mpiwg-berlin.mpg.de/mpdl/page-query-result.xql?document=%2Fecho%2Fla%2FBenedetti_1585.xml&pn=1&query-type=xpath&query=%2F%2Fecho%3As). Also direct links to pages and sentences are possible. Full XPointer linking is not realized yet but this is relative easy. Overall progress: 40% 40 44 41 * ( 7) Arboreal functionality will be further developed in cooperation with the Institute of Computer Science, Artificial Intelligence at Eralngen-Nuremberg (Prof. Günter Görz) and Group Software and Internet technology at Hochschule Deggendorf (Prof. Dr. Josef Schneeberger). Also we propose to use XML editors such as Oxygen, Eclipse or Open Office. Functionality such as saving links to user defined fulltext queries or annotations is just developed. Overall progress: 20% (?)45 * (8) Arboreal functionality will be further developed in cooperation with the Institute of Computer Science, Artificial Intelligence at Eralngen-Nuremberg (Prof. Günter Görz) and Group Software and Internet technology at Hochschule Deggendorf (Prof. Dr. Josef Schneeberger). Also we propose to use XML editors such as Oxygen, Eclipse or Open Office. Functionality such as saving links to user defined fulltext queries or annotations is just developed. Overall progress: 20% (?) 42 46 43 * ( 8): not realized yet: 0%47 * (9): not realized yet: 0% 44 48 45 49 === Additional progress (not listed in requirements) === 50 51 * system reached already "product status" through the use of the ECHO-system and diverse web crawlers (50.000 - 1.500.000 accesses per day) 46 52 47 53 * combined morphology, dictionary and knowledge presentation of words in different languages such as classic latin, greek, chinese, etc., dynamic online linking to external dictionaries and lexicons in different languages