wiki:despecs

The Data Entry Specifications (DESpecs)

The most recent versions can be found on Pythia. The underlying LaTeX files and some post-processing scripts are under version control in a git repository.

The DESpecs and the ECHO schema

The DESpecs simply describe a standard for marking up the text. The goal is to mark up structural features of the text with reasonably simple rules that keep the balance between cost and benefit. In addition to the DESpecs, we have created an XML format ("ECHO") that the transcriptions should conform to in the end, as well as a workflow for transforming the transcriptions from the DESpecs format into the ECHO format.

We make a clear conceptual distinction between the DESpecs and the ECHO format. In particular, the ECHO format is not simply a well-formed version of the DESpecs format. The DESpecs format just happens to resemble XML. We would not gain much if we ask Formax to send us well-formed XML since we have to post-process the text anyway. For example, it doesn't make much sense to make them type a character variant such as <獘V> as well-formed XML, e.g. <V>獘</V>, let alone as the full <reg norm="獘" type="simple"><image xlink:href="symbols/chinese/⿱敝大.svg"/></reg> in the ECHO format. In this example, we might want to change "simple" to e.g. "ids-list" without having to change the DESpecs.

The workflow consists of a series of scripts and is designed to require as little human intervention as possible (see workflow, in German). Turning the transcription into well-formed XML is only a small and relatively straightforward part of this workflow: resolve some reserved characters, make attributes well-formed, add "/" in empty elements, change the names of some elements, etc. (Some parts are more tricky, for instance the example above or conflicting XML hierarchies of e.g. paragraphs and columns of text.) While it is true that especially some copy/paste mistakes in the XML markup could be avoided if the transcriptions were required to be well-formed XML, mistakes of this kind are relatively rare and easy to spot.

Last modified 5 years ago Last modified on Sep 27, 2019, 2:18:17 PM