wiki:digi-tools-doku

This tutorial describes the tools that been developed by MPIWG for the RDFization of existing content and for the Mapping into the Europeana Data Model. These tools are closely related to the WP2 of the DM2E project.

XML Workflow Tools

http://dm2e.eu

The process of data publishing into echo is complex. The initial input of this process is a file that should be transformed several times by different scripts written in different programming languages. In order to facilitate this process, MPIWG decided to implement a tool that supports the data publishing into echo. This tool is called XML Workflow.

The XML Workflow tool is a Java web application developed by Jorge Urzua for running text conversion and text manipulation scripts. It is expandable by writing new scripts in either Python, XSL, Perl or directly in Java. The source code is available and there are also detailed instructions for building and using the tool.

Entry page

XML workflow tools

Input for a typical workflow is a transcription that was created following the Data Entry Specifications. The work is divided into three phases: checking, creating a well-formed XML and creating valid XML, all of which are described below in detail.

Phase 1: Checking

02_checkTags.png

The text document is uploaded via the webpage and the scripts are generally started by clicking on the "Run" buttoon.

03_checkTagsOutput.png

If there are errors, these are displayed in the "error" tab, other information is shown in the "console" tab

04_modificationInEditor.png

Because of errors being present in the input file, the document has to be modified locally using a text editor.

05_reCheckTags.png

After re-running the script, the text has passed the test and the next steps can be taken.

06_nextWorkflowStep.png

Below the output of the current script is a window which suggests the next meaningful script in the workflow. The output of the current script is used as input to the next one. Of course, other scripts can be selected, as well.

07_findPagebreaks.png

Checking the pagebreaks is important for synchronizing the digital facsimiles with the transcription.

08_pagebreaksChecker.png

09_pagebreakCheckerMore.png

The webservice displays the pages with the first few lines of the content. The user now checks manually if the text corresponds to what is seen on the digital facsimile. Links are provided for comfortable checking.

10_outputs.png

In this case, the script is divided into two parts, the first part creating an configuration file which can be altered by the user (shown below). This is then evaluated in the second step.

11_pbConfigurationFile.png

The configuration file for the pagebreak script. The last entry should be removed. There is no corresponding pagebreak for that image in the transcription.

12_showDiff.png

A useful feature is to show the effects of one script by displaying the current and the previous version side by side: the Diff.

13_DiffTool.png

The Diff tool showing the beginning of the text document after application of the pagebreak script. Green lines have been altered: the filename has been written behind each pagebreak-pseudo-tag (pseudo, because this is not really XML yet).

14_unknownCharacterWarning.png

Another preparational step is the treatment of unknown characters. Characters that were not recognized during data entry assigned a code and collected on a list together with a screenshot of that character.

15_unknownCharacterFile.png

A configuration file takes care of these replacements with its corresponding Unicode character. This file is evaluated in the next step and the replacements take place.

16_unknownCharacterOutput.png

Phase 2: Creating well-formed XML

Following these important preparational steps is the conversion to a well-formed XML document. These are additional replacements and the resolving of shorthands that were used during data entry.

18_helpText.png

The built-in help describes the functionality of each script.

19_stringInputForXML.png

One of the final steps is the insertion of metadata that reside already in the system. For that reason, the identifier of the document has to be put in at this point.

20_wellformedXML.png

The first lines of the XML, displayed in the browser.

21_testWellformedness.png 22_XMLWellformed.png

A script checks the XML if it is well-formed.

Phase 3: Creating valid XML

Although the XML document being well-formed does not mean that it is also valid to an XML schema. There are still a few steps to be taken. This is done in the third phase.

23_moveFloatsDiff.png

Floating elements like notes and images are moved away from their original places, being replaced by an anchor. The diff shows the effect of this.

25_divStructure.png

Lateron, a div structure is added which is also used for creating a table of contents.

27_XMLValid.png

As a final stage, the validity is tested against a schema. In this case, the document is valid. In some cases, the document has to be edited locally to stand this test.

Extras

28_whatshallwedonow.png

Further extras can be applied to the XML document, available in the bottom window. For example, mathematical formulas written as LaTeX can be converted to MathML here.

Upload

29_uploadSandbox.png

A valid XML file can then be uploaded in the Sandbox for further checking.

30_indexmeta.png

In addition to that, the file containing the metadata of the work has to be expanded with the path and name of the XML document so that it is also displayed in the ECHO display environment.

31_operationStatus.png

Operation status shows that the text has successfully been uploaded. During that process, the text is also analysed morphologically and connected to various dictionaries available in the system.

Online representations

32_resultPollux.png

The text being displayed in the sandbox, maroon-coloured words can be clicked on, showing appropriate dictionary entries.

33_wordInfo.png

Word information

DM2E Mapping Server

The DM2E project considered the use of MINT for the RDFization of data. MINT is based on XSL Transformation. Due to the specific characteristics of our data (index.meta), we cannot use this tool. For this reason, Jorge Urzua developed a web server that is able to transform index.meta into general RDF and into EDM.

34_mappingServer

Using the DM2E Mapping Server, the metadata information is converted various RDF models, for example the DM2E Data Model. With the data present in this format, the various RDF-aware tools available in the DM2E toolchain can be used and the data finally ingested into the DM2E triple store

35_resultPubby.png

After ingestion into the DM2E triple store, the results can be easily browsed

36_resultEuropeana.png

Of course, the source has also been given to Europeana

Text enrichment Tools

Different tools for enriching text have been developed in order to integrate them in to DM2E services.

Lemmatisation

This section addresses the integration of the MPIWG Lemmatisation in Pundit.

A lemma is the canonical form of a set of words. Lemmatisation refers to the morphological analysis of words that aims to find the lemma of a word by removing inflectional endings and returning the base or dictionary form of a word.

MPIWG Lemmatisator

The MPIWG Lemmatisator is a web service that is part of the Language technology services (Mpdl) that is hosted in http://mpdl-service.mpiwg-berlin.mpg.de/mpiwg-mpdl-lt-web/. This web service is not responsible for the lemmatisation of words, however it accesses several other web services (like: http://www.perseus.tufts.edu/hopper/) that have dictionaries of words and their lemmas. The MPIWG Lemmatisator is only responsible for the words query and for the merging of the responses from the other services in a unique response.

The lemmatisator supports the following languages: Arabic, Chinese, Dutch, English, French, German, Ancient Greek, Italian and Latin.

As example, the following link illustrates the response of the lemmatisation for the query “multa”: http://mpdl-service.mpiwg-berlin.mpg.de/mpiwg-mpdl-lt-web/lt/GetLemmas?query=multa&language=lat&outputFormat=html

Pundit Integration

Pundit is an annotation tool based on semantic web technologies. In order to allow the use of the Mpdl in Pundit, the lemmatisator should be able to transform its response to RDF. The Web Service (hosted temporally in https://openmind-ismi-dev.mpiwg-berlin.mpg.de/lemmatisator) attempts to solve this issue by the transformation of the response from the MPIWG Lemmatisator to RDF Triples.

The triples returned by this service implement the Gold Ontology (see: http://lov.okfn.org/dataset/lov/vocabs/gold). For example, the query the word “mula” in Latin returns: “mula is lemma of multus”. Using the Gold Ontology, the last triple would be expressed as follow:

http://mpiwg.de/ontologies/ont.owl/lemma#multus writtenRealization http://mpiwg.de/ontologies/ont.owl/word#multa

Last modified 9 years ago Last modified on Feb 17, 2015, 7:50:40 AM

Attachments (39)