wiki:echo-schema/usage-guide

ECHO-Schema: 1. Übersicht, 2. Usage Guide, 3. Umsetzung

2. Usage Guide

Eine erste Version des Usage Guide für das Schema: PDF (Stand April 2010; TO DO: aus LaTeX in das Wiki übertragen, aktualisieren)

General

Design decisions:

  • one schema for all texts
  • modules should be independent
  • tags in the DESpecs should have some counterpart in the Schema, if possible
  • however, do not mimic the DESpecs structure

Standard suffixes as in xhtml:

  • .attrib (defined in echo-attribute)
  • .datatype (defined in echo-datatype)
  • .model (defined in echo-content)
  • .class (defined in several modules)

IDs are expected, but XML texts should also validate without IDs

Informationsquellen und minimales Schema

Der Kern des Schemas: "(--)" bedeutet: nicht minimal, aber man bekommt etwas dafür, was über die optische Anzeige des Textes hinausgeht

Ebene minimal China Modul Element
Grobstruktur + -- echo-start <echo>
+ -- echo-metadata <metadata>
+/-- -- echo-metadata Metadaten
+ -- echo-div <text>
Mittelstruktur (--) + echo-div <div index>, <div toc>
(--) -- echo-div <div chapter> etc.
Feinstruktur + + echo-block <head>, <p>, <note>
+ -- echo-block <s>
-- + echo-block-scholarly <quote>
-- -- echo-block-scholarly <set-off>
Text -- + echo-content <emph>
(--) -- echo-content <reg>
(--) -- echo-content-scholarly <foreign>, <ref>
-- -- echo-content-scholarly <sic>, <set-off>, <q>
Hilfsmodule + n/a echo-attribute --
+ n/a echo-datatype --
F: Figures -- + echo-figure <figure>, <caption> etc.
F: Handwritten -- + echo-handwritten <handwritten>
F: Chinese text -- + echo-chinese-text <head ti>, <p @indent>, <p @ics>, <pb @ics>
F: Textflows -- + echo-textflows <head @flow>, <p @flow>, <div multiflow @flows>
F: Tables -- + echo-import-xhtml <xhtml:table>
F: Lists -- -- echo-import-xhtml <xhtml:ul> etc.
F: Floats -- -- echo-float <div float>
F: Images -- -- echo-figure <image>
T: Milestones (--) + echo-milestone <pb> (auch F), <lb>, <cb>
T: Corruptions -- + echo-gap <gap>, <unsure>
-- + echo-de <de:unknown>, <de:wrong>
T: Chinese notes -- + echo-chinese-text <lb halfline>
T: Floats -- + echo-float <anchor>
T: numbers etc. (--) -- echo-mathematics <num>, <var>
T: formulas (--) -- echo-import-mathml <mml:math>
T: Verse -- -- echo-textflows <lb @label>
T: Images -- -- echo-figure <image>
T: Gis -- -- echo-gis <place>, <time>
G: Gis -- -- echo-gis <dcterms:temporal>, <dcterms:spatial>

Übersicht über die Module

  • echo ist die Haupt-Datei des Schemas. Hier werden alle Module geladen.
  • echo-start, echo-metadata, echo-div, echo-block, echo-block-scholarly definieren die Struktur des XML-Dokuments bis vom root element <echo> bis zum Text-Level:
    <echo>               echo-start.rnc 
        <metadata>       echo-metadata.rnc 
        <text>           echo-div.rnc 
            <div>           - " - 
                <head>   echo-block.rnc 
                <p>         - " - 
                    <s>     - " - 
                <note>      - " -
    
    • echo-start: Zum root element <echo> gibt es kein Gegenstück in den DESpecs. Es muss in das Dokument eingefügt werden.
    • echo-metadata: Zu <metadata> und allen darin enthaltenen Metadaten gibt es kein Gegenstück in den DESpecs. <metadata> und einige Metadaten sind notwendig und müssen in das Dokument eingefügt werden. Andere Metadaten sind optional.
    • echo-div: Zu <text> und <div> gibt es ebenfalls kein Gegenstück in den DESpecs. <text> ist notwendig und muss in das Dokument eingefügt werden. <div> kann weggelassen werden.
    • echo-block, echo-block-scholarly: <head> und <p> haben direkte Gegenstücke in den DESpecs. <note> hat die Gegenstücke <mgl>, <mgr> und <fn>. Das Element <s> hat kein Gegenstück in den DESpecs. Es muss aber in das Dokument eingefügt werden, um das Dokument valide zu machen.

echo-start

<text>: The default type is "free".

echo-metadata

echo-metadata: There is no counterpart of the metadata in the DESpecs.

dcterms

  • <dcterms:identifier>

Bibliographisch:

  • <dcterms:title> +, <dcterms:alternative> *
    • <dcterms:alternative> refines <dcterms:title>
  • <dcterms:creator> +, <dcterms:contributor> *
    • <dcterms:creator> refines <dcterms:contributor>
  • <dcterms:publisher> *
  • <dcterms:language>+
  • <dcterms:date> ?
  • <dcterms:description> *

Lizenz:

  • <dcterms:rights> *, <dcterms:license> *, <dcterms:accessRights>
    • <dcterms:license> may be text or a URI.
    • <dcterms:license> and <dcterms:accessRights> refine <dcterms:rights>
  • <dcterms:rightsHolder> *
  • <dcterms:provenance> *
  • <dcterms:dateCopyrighted> ?
    • <dcterms:dateCopyrighted> refines <dcterms:date>

Dabei meint + "mindestens einmal", * "beliebig oft", ? "höchstens einmal". Ohne Symbol heißt "genau einmal". Verpflichtend sind also:

  • <dcterms:identifier>
  • <dcterms:title>
  • <dcterms:creator>
  • <dcterms:language>
  • <dcterms:accessRights>

Zu "refines": creator refines contributor, d.h. ein creator ist automatisch auch ein contributor; aber ein contributor ist nicht unbedingt ein creator. Anders gesagt: A refines B heißt, A ist eine Teilmenge von B. Anderes Beispiel: Eiche refines Baum.

other metadata

  • <font>, <font-family>
    • echo.font-families <-- "song style" in echo-chinese-text
  • <echolink>, <echodir>

In general, there is no counterpart for <text> or <div> in the DESpecs.

echo-div

DESpecs nach Schema:

  • <ind> --> <div type="index">
  • <toc> --> <div type="toc">
  • other types:
    • if the type is in the standard list: type="definition" type-free="界"
    • if it is not in the standard list: type="other" type-free="界/definition"

Liste aller div-Typen, die auf besondere Weise behandelt werden:

  • u.a. float
  • aber auch multiflow, parallel; chapter, section; ...

echo-block

  • Headings: <head>
    • <h> --> <head>
  • Semantic units: <s>
    • (<s> is not in the raw text)
  • Floating objects in <s> (all <note>, <handwritten>, <table>; most <figure>; some <math>) are replaced by <anchor> and moved to a <div type="float"> directly behind the <p>. The new <div type="float"> contains all <note>, <handwritten>, <figure>, <table>, <math> that have been moved in this <div>.
  • Notes: <note>
    • <mgl> --> <anchor type="note"/>, <note position="left">
    • <mgr> --> <anchor type="note"/>, <note position="right">
    • echo.note.content = echo.flexible.model to allow for different kinds of notes

echo-block-scholarly

  • <set-off>

echo-content

Most elements in this module have no counterpart in the Specs and will be added in the post-processing stage.

<emph>

<emph> for emphasis (should be used only when something is not tagged otherwise)

The tags _ _ (for italics), <bf>, <sc>, <_>, <^>, <ul>, <ol>, <st>, <red>, <sp> in the Specs are normally represented by <emph style="...">. The tags can be combined, e.g. <emph style="it bf"> for bold italics. For a whole <s> or <p>, the style attribute is there (or even higher in the hierarchy).

<reg>

Only the original text is regularized using echo.reg; typing conventions and additional typos in the transcription are silently resolved.

list of typing conventions in the DESpecs which are silently resolved:

  • $ --> ſ
  • \'q --> q + combining diacritic (U+0300 etc.) and normalization form C, for example q̀
  • ...

examples:

<reg orig="ijs" type="lig">ijs</reg> 
<reg orig="sphęrae" type="simple">sphaerae</reg> 
<reg orig="sphęrae">sphaerae</reg> 
<reg orig="sphę­ rae" type="simple">sphae­<lb/>rae</reg> 
<reg orig="eiuſdẽ" type="context">eiuſdem</reg> 
<reg orig="eſsẽt" type="context">eſsent</reg> 
<reg orig="lib." type="context">liber</reg> 
in <reg orig="lib." type="context">libro</reg> 
<reg orig="qñ" type="wordlist">quando</reg> 
<reg orig="tm̃" type="wordlist/context">tamen</reg> 
<reg orig="tm̃" type="wordlist/context">tantum</reg> 
<reg orig="Arist." type="unresolved">Arist.</reg> 
<reg orig="inrerrogas" type="typo" resp="paul">interrogas</reg> 
<reg orig="quem" type="conjecture" resp="paul">quam</reg> 
<reg orig="re ferre" type="conjecture" resp="paul">re­<lb/>ferre</reg> 
<reg orig="ꝑꝑ" type="unknown">ꝑꝑ</reg> 
<reg orig="ꝑꝑ" type="conjecture" resp="paul">prope</reg>

note:

  • the default type is "simple", e.g. <reg orig="sphęrae">sphaerae</reg>
    • Beispiel veraltet!
    • Beachte: der type kann zurzeit nicht weggelassen werden, und das ist auch gut so, falls man nämlich die <reg> automatisiert nachbearbeiten muss.
  • the first exampe ijs applies only if ij is not silently resolved
  • missing hyphens are indicated by a soft hyphen "­" rather than <reg>; however, you may use "conjecture" in non-trivial cases
  • the generic "abbr" may be used for any abbreviation
  • abbreviations are not resolved within <ref>, e.g. ex <ref id="N400238">.19. lib. quinti Eu-<lb/>clid.</ref> (wirklich?)

Text-Modelle

Avoiding Recursions: Wie ist die Inline-model-Hierarchie?

  • inline anfangen können: s head caption description variables, evtl. note handwritten xhtml
  • in inline, und Inhalt inline (mit Rekursionsgefahr): s-set-off, ref, foreign, emph, q
  • in inline, und Inhalt plaintext: reg, sic, num, var, place, time
  • in inline, und inhalt es selbst: mml.math
  • in plaintext, und Inhalt plaintext: gibt es nicht
  • in plaintext, Inhalt text: unsure (Inhalt in plaintext ändern?)
  • in plaintext, leer: milestones, anchor, gap, unknown, wrong

Schematron-Regel, die Rekursionen aufspürt, d.h.

  • z.B. <ref> in <ref>
  • z.B. <ref> in <foreign> in ... in <ref>

also zusammen: kein Element aus dieser Gruppe darf sich sich selbst als ancestor haben.

echo-content-scholarly

  • <ref>
  • <sic> for mistakes in the original text:
    • o<!> --> o<de:wrong/> --> <sic comment="n missing">o</sic> (see the discussion in echo-de)
  • <foreign>: Foreign text is not marked in the transcription, i.e. <foreign> cannot be inserted automatedly without additional linguistic knowledge.
    • Exception: <rom>sentence</rom> --> <foreign xml:lang="la">sentence</foreign> with language "la" as a first guess, and similarly in Chinese text.
    • (echo.foreign has echo.core.attrib, but echo.language.attrib is obligatory)
  • Quotations:
    • <q> is for short inline quotes. Note that echo.delimiter-attrib is optional; however, please use it if possible
    • <quote> (echo.quote) for longer inline quotes (one-sentence quote are <quote><s>Sentence.</s></quote> and not <s><q>Sentence.</q></s>)
    • <quote> (echo.blockquote) for blockquotes
    • Es kann keine Rekursionen von quote-Elementen geben.

echo-gap

  • @@ --> <gap extent="2"/>
  • <gap> --> <gap/>
  • x< ? > --> x<unsure/> or <unsure>x</unsure> (this can not be fully automated)

echo-de

This module contains tags from the DESpecs that will be removed in the course of processing. We use the namespace "de" for the corresponding elements in the xml:

  • <001> --> <de:unknown code="001"/> (bzw. wir haben eine Tabelle, was gemeint ist)
  • <!> --> <de:wrong/> --> remove or <sic>

echo-figure

  • <fig> --> <figure>, eventuell mit <anchor/>
  • <cap> --> <caption>
  • <desc> --> <description>
  • <var> --> <variables>

echo-handwritten

In its simplest form, <handwritten> is just an empty tag. Nonetheless, within <s> it is replaced by <anchor> and moved to <div type="float"> to cater for scholarly additions, i.e. it is part of echo.float.class and not of echo.inline.class

  • <hd> --> <handwritten/>, eventuell mit <anchor/>

neu: man kann die Position des handgeschriebenen Textes angeben. Alle Positionen von <note>, außer "end", dafür zusätzlich "between lines". Beachte, dass die Position weniger gut als bei <note> automatisch dem DESpecs-konformen Rohtext entnommen werden kann und daher mehr Nacharbeit erfordert.

echo-float

echo-milestone

line breaks

[Dieser Abschnitt ist sicher veraltet!]

<lb/> can be in plaintext (<s>, <head>, some <note>, all members of echo.inline.class) and <p>

in <p>: since a paragraph is split into <s>, most line breaks are actually in <s>. However:

  • <lb/></s><s> and </s><s><lb/> shouldn't occur (--> </s><lb/><s> [and space before </s>?])
  • <lb/></s></p> shouldn't occur at all

in <s> (and similarly for <head> and the members of echo.inline.class):

  • line break --> <lb/>; no space before <lb/>; no line break after <lb/>; space after <lb/> if there is a hyphen before <lb/> (no automated space if the hyphen is missing)

examples:

  • <s>亦<lb/>能使人無疑。</s>
  • <note>Plutar <lb/>chus in <lb/>commẽ <lb/>tario de <lb/>dæmo-<lb/>nio So-<lb/>cratis.</note>

We use the normal hyphen U+002D instead of the soft hyphen U+00AD because the soft hyphen is not displayed in the xhtml. --> ?

column breaks

  • <col 1>...</col><col 2>...</col> --> ...<cb/>...

page breaks

[Dieser Abschnitt ist sicher veraltet!]

<pb/> can occur wherever <lb/> occurs (although it will be rare in <head>), and <div>

  • <pb vii><rh>xyz</rh> --> <pb n="10" o="vii" o-norm="7" rhead="xyz" xlink:href="URI"/>
  • <pb 一六七a> --> <pb n="..." o="一六七a" o-norm="167a" xlink:href="URI"/>

echo-attribute

In echo-attribute werden Standard-Attribute definiert.

Text-Eigenschaften:

  • echo.language.attrib (@xml:lang)
  • echo.style.attrib (@style):
    • direkt in: <text>; <emph>, <num>, <var>, <w>, <place>, <time>, <person>
    • via echo.core.attrib in: <div>, <p>, <quote>, <note>, <handwritten>, <entry>; <reg>, <foreign>, <ref>, <q>
    • via echo.inline.attrib in: <head>, <s>, <caption>, <description>, <variables>, <form>, <translation>, <pronunciation>
    • in xhtml:* als @class
  • echo.id.attrib (@xml:id)
  • echo.core.attrib fasst echo.language.attrib, echo.style.attrib und echo.id.attrib zusammen
  • echo.space.attrib (@xml:space="preserve")
  • echo.inline.attrib ist echo.core.attrib plus echo.space.attrib

Div-Attribute:

  • echo.n.attrib (@n)
  • echo.level.attrib (@level)

Notes:

  • echo.symbol.attrib (@symbol)

Links:

  • echo.file.attrib (@file)
  • echo.internal-link.attrib (@xlink:href, @xlink:label, @xlink:type)
  • echo.external-link.attrib (@xlink:href)

Zitate:

  • echo.delimiter.attrib (@open, @close)

echo-datatype

echo-mathematics

  • number <num>:
    • "vii" --> <num value="7">vii</num>
    • "½" --> <num value="0.5">½</num>
  • variable <var>:
    • "AB" --> <var type="line">AB</var> (type ist optional)

Eine Funktion von <num> und <var> ist es, den Inhalt vor der morphologischen Analyse zu verstecken.

Note: The scope of echo.num and echo.var is very limited. More complex mathematics is expressed with MathML --> echo-import-mathml

echo-chinese-text

  • <ti> --> <head type="ti">
  • indentations in Chinese text:
    • <p ii> --> <p indent="2char"> oder nur "2"?
    • <p xx> --> <p indent="-2char">
    • (indent is deliberately not defined as style="valid css" because it may be semantically meaningful)
  • Linien:
    • <sl> --> <emph style="sl">
    • <dl> --> <emph style="dl">
    • <wl> --> <emph style="wl">
    • <cl> --> <emph style="cl">

Small text:

  • in <p>: <sm> --> <small>
  • everywhere else: <emph style="sm"> (<h>, rhead, <ti>, <toc>, etc.)
  • \\ --> <smlb/>

(plus some manual corrections where this simple distinction doesn't fit, e.g. <sm>chen</sm>)

echo-gis

Note: this module is still experimental.

Beachte in diesem Modul definierte Metadaten

echo-textflows

@flow is normally a number, or "footnote"

echo-import-xhtml

The xhtml modules are part of the Jing distribution:

The original rng files can be converted into the Relax NG compact syntax using Trang. Oxygen offers a GUI for this conversion.

Diese Module übernehmen wir dann ohne weitere Änderungen. Alle Anpassungen werden in echo-import-xhtml gemacht.

xhtml-basic-table

We ignore Block.class in xhtml-basic-table: The following lines replace

"Block.class |= table" in xhtml-basic-table

echo.float.class  |= xhtml.table 
echo.anchor.types |= "table" 

xhtml-list

We ignore Block.class in xhtml-list: The following lines replace

"List.class = ul | ol | dl",

"Block.class |= List.class"

echo.float.class  |= xhtml.ul | xhtml.ol | xhtml.dl 
echo.anchor.types |= "ul" | "ol" | "dl" 

Note that xhtml:ul, xhtml:ol, xhtml:dl do not correspond to <ul>, <ol>, <dl> in the DESpecs!

xhtml-attribs

We use the following xhtml elements: table, caption, tr, th, td; dl, dt, dd; ol, ul, li. All these elements have Common.attrib (th and td via Cell.attrib). xml:lang and class are already in Common.attrib, and we add xml:id { xsd:NCName } to it (however, Common.attrib already includes the attribute id { xsd:ID })

xhtml-datatypes

xhtml-text

Dieses Modul wird nicht importiert. Stattdessen werden Inline.model und Flow.model durch echo.flexible.content ersetzt:

  • Inline.model = echo.flexible.content
  • Flow.model = echo.flexible.content

ursprüngliche Definitionen:

  • Inline.model = (text | Inline.class)*
  • Block.mix = Block.class
  • Block.model = Block.mix+
  • Flow.model = (text | Inline.class | Block.class)*

echo-import-mathml

Note: Simple mathematical terms, i.e. numbers and variables, are marked using echo.num and echo.var (defined in echo-mathematics).

mml.math.content erlaubt beliebige Elemente <mml:*> innerhalb von <mml:math>.

This placeholder code above is good enough for the moment. We simply assume that the MathML parts are well-formed. This is plausible since the MathML code is created from a LaTeX formula by a MathML-converter.

In addition, Oxygen seems to have a separate validation engine for MathML.

Last modified 13 years ago Last modified on Aug 9, 2011, 5:28:33 PM