wiki:digi-tools-doku

Version 8 (modified by Klaus Thoden, 9 years ago) (diff)

--

XML Workflow Tool

http://dm2e.eu

The XML Workflow tool is a Java web application for running text conversion and text manipulation scripts. It is expandable by writing new scripts in either Python, XSL, Perl or directly in Java. The source code is available and there are also detailed instructions for building and using the tool.

Entry page

XML workflow tools

Input for a typical workflow is a transcription that was created following the Data Entry Specifications. The work is divided into three phases: checking, creating a well-formed XML and creating valid XML, all of which are described below in detail.

Phase 1: Checking

02_checkTags.png

The text document is uploaded via the webpage and the scripts are generally started by clicking on the "Run" buttoon.

03_checkTagsOutput.png

If there are errors, these are displayed in the "error" tab, other information is shown in the "console" tab

04_modificationInEditor.png

Because of errors being present in the input file, the document has to be modified locally using a text editor.

05_reCheckTags.png

After re-running the script, the text has passed the test and the next steps can be taken.

06_nextWorkflowStep.png

Below the output of the current script is a window which suggests the next meaningful script in the workflow. The output of the current script is used as input to the next one. Of course, other scripts can be selected, as well.

07_findPagebreaks.png

Checking the pagebreaks is important for synchronizing the digital facsimiles with the transcription.

08_pagebreaksChecker.png

09_pagebreakCheckerMore.png

The webservice displays the pages with the first few lines of the content. The user now checks manually if the text corresponds to what is seen on the digital facsimile. Links are provided for comfortable checking.

10_outputs.png

In this case, the script is divided into two parts, the first part creating an configuration file which can be altered by the user (shown below). This is then evaluated in the second step.

11_pbConfigurationFile.png

The configuration file for the pagebreak script. The last entry should be removed. There is no corresponding pagebreak for that image in the transcription.

12_showDiff.png

A useful feature is to show the effects of one script by displaying the current and the previous version side by side: the Diff.

13_DiffTool.png

The Diff tool showing the beginning of the text document after application of the pagebreak script. Green lines have been altered: the filename has been written behind each pagebreak-pseudo-tag (pseudo, because this is not really XML yet).

14_unknownCharacterWarning.png

Another preparational step is the treatment of unknown characters. Characters that were not recognized during data entry assigned a code and collected on a list together with a screenshot of that character.

15_unknownCharacterFile.png

A configuration file takes care of these replacements with its corresponding Unicode character. This file is evaluated in the next step and the replacements take place.

16_unknownCharacterOutput.png

Phase 2: Creating well-formed XML

Following these important preparational steps is the conversion to a well-formed XML document. These are additional replacements and the resolving of shorthands that were used during data entry.

18_helpText.png

The built-in help describes the functionality of each script.

19_stringInputForXML.png

One of the final steps is the insertion of metadata that reside already in the system. For that reason, the identifier of the document has to be put in at this point.

20_wellformedXML.png

The first lines of the XML, displayed in the browser.

21_testWellformedness.png 22_XMLWellformed.png

A script checks the XML if it is well-formed.

Phase 3: Creating valid XML

Although the XML document being well-formed does not mean that it is also valid to an XML schema. There are still a few steps to be taken. This is done in the third phase.

23_moveFloatsDiff.png

Floating elements like notes and images are moved away from their original places, being replaced by an anchor. The diff shows the effect of this.

25_divStructure.png

Lateron, a div structure is added which is also used for creating a table of contents.

27_XMLValid.png

As a final stage, the validity is tested against a schema. In this case, the document is valid. In some cases, the document has to be edited locally to stand this test.

Extras

28_whatshallwedonow.png

Further extras can be applied to the XML document, available in the bottom window. For example, mathematical formulas written as LaTeX can be converted to MathML here.

Upload

29_uploadSandbox.png

A valid XML file can then be uploaded in the Sandbox for further checking.

30_indexmeta.png

In addition to that, the file containing the metadata of the work has to be expanded with the path and name of the XML document so that it is also displayed in the ECHO display environment.

31_operationStatus.png

Operation status shows that the text has successfully been uploaded. During that process, the text is also analysed morphologically and connected to various dictionaries available in the system.

Online representations

32_resultPollux.png

The text being displayed in the sandbox, maroon-coloured words can be clicked on, showing appropriate dictionary entries.

33_wordInfo.png

Word information

35_resultPubby.png

After ingestion into the DM2E triple store, the results can be easily browsed

36_resultEuropeana.png

Of course, the source has also been given to Europeana

Attachments (39)