Version 2 (modified by 14 years ago) (diff) | ,
---|
-
-
2. Usage Guide
- General
- Informationsquellen und minimales Schema
- Übersicht über die Module
- echo-start
- echo-metadata
- echo-div
- echo-block
- echo-block-scholarly
- echo-content
- echo-content-scholarly
- echo-gap
- echo-de
- echo-figure
- echo-handwritten
- echo-float
- echo-milestone
- echo-attribute
- echo-datatype
- echo-mathematics
- echo-chinese-text
- echo-gis
- echo-textflows
- echo-import-xhtml
- echo-import-mathml
-
2. Usage Guide
ECHO-Schema: 1. Überblick, 2. Usage Guide, 3. Umsetzung
2. Usage Guide
Eine erste Version des Usage Guide für das Schema: PDF (Stand April 2010; TO DO: aus LaTeX in das Wiki übertragen, aktualisieren)
General
Design decisions:
- one schema for all texts
- modules should be independent
- tags in the DESpecs should have some counterpart in the Schema, if possible
- however, do not mimic the DESpecs structure
Standard suffixes as in xhtml:
- .attrib (defined in echo-attribute)
- .datatype (defined in echo-datatype)
- .model (defined in echo-content)
- .class (defined in several modules)
IDs are expected, but XML texts should also validate without IDs
Informationsquellen und minimales Schema
Der Kern des Schemas: "(--)" bedeutet: nicht minimal, aber man bekommt etwas dafür, was über die optische Anzeige des Textes hinausgeht
Ebene | minimal | China | Modul | Element |
Grobstruktur | + | -- | echo-start | <echo> |
+ | -- | echo-metadata | <metadata> | |
+/-- | -- | echo-metadata | Metadaten | |
+ | -- | echo-div | <text> | |
Mittelstruktur | (--) | + | echo-div | <div index>, <div toc> |
(--) | -- | echo-div | <div chapter> etc. | |
Feinstruktur | + | + | echo-block | <head>, <p>, <note> |
+ | -- | echo-block | <s> | |
-- | + | echo-block-scholarly | <quote> | |
-- | -- | echo-block-scholarly | <set-off> | |
Text | -- | + | echo-content | <emph> |
(--) | -- | echo-content | <reg> | |
(--) | -- | echo-content-scholarly | <foreign>, <ref> | |
-- | -- | echo-content-scholarly | <sic>, <set-off>, <q> | |
Hilfsmodule | + | n/a | echo-attribute | -- |
+ | n/a | echo-datatype | -- | |
F: Figures | -- | + | echo-figure | <figure>, <caption> etc. |
F: Handwritten | -- | + | echo-handwritten | <handwritten> |
F: Chinese text | -- | + | echo-chinese-text | <head ti>, <p @indent>, <p @ics>, <pb @ics> |
F: Textflows | -- | + | echo-textflows | <head @flow>, <p @flow>, <div multiflow @flows> |
F: Tables | -- | + | echo-import-xhtml | <xhtml:table> |
F: Lists | -- | -- | echo-import-xhtml | <xhtml:ul> etc. |
F: Floats | -- | -- | echo-float | <div float> |
F: Images | -- | -- | echo-figure | <image> |
T: Milestones | (--) | + | echo-milestone | <pb> (auch F), <lb>, <cb> |
T: Corruptions | -- | + | echo-gap | <gap>, <unsure> |
-- | + | echo-de | <de:unknown>, <de:wrong> | |
T: Chinese notes | -- | + | echo-chinese-text | <lb halfline> |
T: Floats | -- | + | echo-float | <anchor> |
T: numbers etc. | (--) | -- | echo-mathematics | <num>, <var> |
T: formulas | (--) | -- | echo-import-mathml | <mml:math> |
T: Verse | -- | -- | echo-textflows | <lb @label> |
T: Images | -- | -- | echo-figure | <image> |
T: Gis | -- | -- | echo-gis | <place>, <time> |
G: Gis | -- | -- | echo-gis | <dcterms:temporal>, <dcterms:spatial> |
Übersicht über die Module
- echo ist die Haupt-Datei des Schemas. Hier werden alle Module geladen.
- echo-start, echo-metadata, echo-div, echo-block, echo-block-scholarly definieren die Struktur des XML-Dokuments bis vom root element
<echo>
bis zum Text-Level:<echo> echo-start.rnc <metadata> echo-metadata.rnc <text> echo-div.rnc <div> - " - <head> echo-block.rnc <p> - " - <s> - " - <note> - " -
- echo-start: Zum root element <echo> gibt es kein Gegenstück in den DESpecs. Es muss in das Dokument eingefügt werden.
- echo-metadata: Zu <metadata> und allen darin enthaltenen Metadaten gibt es kein Gegenstück in den DESpecs. <metadata> und einige Metadaten sind notwendig und müssen in das Dokument eingefügt werden. Andere Metadaten sind optional.
- echo-div: Zu <text> und <div> gibt es ebenfalls kein Gegenstück in den DESpecs. <text> ist notwendig und muss in das Dokument eingefügt werden. <div> kann weggelassen werden.
- echo-block, echo-block-scholarly: <head> und <p> haben direkte Gegenstücke in den DESpecs. <note> hat die Gegenstücke <mgl>, <mgr> und <fn>. Das Element <s> hat kein Gegenstück in den DESpecs. Es muss aber in das Dokument eingefügt werden, um das Dokument valide zu machen.
- echo-content, echo-content-scholarly definieren das Inline-Modell für Text. Der Text muss nicht weiter ausgezeichnet werden, um ein valides Dokument zu haben.
- echo-attribute, echo-datatype definieren Standard-Attribute und -Datentypen. Sie werden von allen Modulen verwendet.
- echo-gap, echo-de
- echo-figure
- echo-handwritten
- echo-milestone
- echo-float
- echo-mathematics
- echo-chinese-text
- echo-gis
- echo-textflows
- echo-import-xhtml
- echo-import-mathml
echo-start
<text>: The default type is "free".
echo-metadata
echo-metadata: There is no counterpart of the metadata in the DESpecs.
dcterms
- <dcterms:identifier>
Bibliographisch:
- <dcterms:title> +, <dcterms:alternative> *
- <dcterms:alternative> refines <dcterms:title>
- <dcterms:creator> +, <dcterms:contributor> *
- <dcterms:creator> refines <dcterms:contributor>
- <dcterms:publisher> *
- <dcterms:language>+
- <dcterms:date> ?
- <dcterms:description> *
Lizenz:
- <dcterms:rights> *, <dcterms:license> *, <dcterms:accessRights>
- <dcterms:license> may be text or a URI.
- <dcterms:license> and <dcterms:accessRights> refine <dcterms:rights>
- <dcterms:rightsHolder> *
- <dcterms:provenance> *
- <dcterms:dateCopyrighted> ?
- <dcterms:dateCopyrighted> refines <dcterms:date>
Dabei meint + "mindestens einmal", * "beliebig oft", ? "höchstens einmal". Ohne Symbol heißt "genau einmal". Verpflichtend sind also:
- <dcterms:identifier>
- <dcterms:title>
- <dcterms:creator>
- <dcterms:language>
- <dcterms:accessRights>
Zu "refines": creator refines contributor, d.h. ein creator ist automatisch auch ein contributor; aber ein contributor ist nicht unbedingt ein creator. Anders gesagt: A refines B heißt, A ist eine Teilmenge von B. Anderes Beispiel: Eiche refines Baum.
other metadata
- <font>, <font-family>
- echo.font-families <-- "song style" in echo-chinese-text
- <echolink>, <echodir>
In general, there is no counterpart for <text> or <div> in the DESpecs.
echo-div
DESpecs nach Schema:
- <ind> --> <div type="index">
- <toc> --> <div type="toc">
- other types:
- if the type is in the standard list: type="definition" type-free="界"
- if it is not in the standard list: type="other" type-free="界/definition"
Liste aller div-Typen, die auf besondere Weise behandelt werden:
- u.a. float
- aber auch multiflow, parallel; chapter, section; ...
echo-block
- Headings: <head>
- <h> --> <head>
- Semantic units: <s>
- (<s> is not in the raw text)
- Floating objects in <s> (all <note>, <handwritten>, <table>; most <figure>; some <math>) are replaced by <anchor> and moved to a <div type="float"> directly behind the <p>. The new <div type="float"> contains all <note>, <handwritten>, <figure>, <table>, <math> that have been moved in this <div>.
- Notes: <note>
- <mgl> --> <anchor type="note"/>, <note position="left">
- <mgr> --> <anchor type="note"/>, <note position="right">
- echo.note.content = echo.flexible.model to allow for different kinds of notes
echo-block-scholarly
- <set-off>
echo-content
Most elements in this module have no counterpart in the Specs and will be added in the post-processing stage.
<emph>
<emph> for emphasis (should be used only when something is not tagged otherwise)
The tags _ _
(for italics), <bf>
, <sc>
, <_>
, <^>
, <ul>
, <ol>
, <st>
, <red>
, <sp>
in the Specs are normally represented by <emph style="...">. The tags can be combined, e.g. <emph style="it bf"> for bold italics. For a whole <s> or <p>, the style attribute is there (or even higher in the hierarchy).
<reg>
Only the original text is regularized using echo.reg; typing conventions and additional typos in the transcription are silently resolved.
list of typing conventions in the DESpecs which are silently resolved:
$
--> ſ\'q
--> q + combining diacritic (U+0300 etc.) and normalization form C, for example q̀- ...
examples:
<reg orig="ijs" type="lig">ijs</reg> <reg orig="sphęrae" type="simple">sphaerae</reg> <reg orig="sphęrae">sphaerae</reg> <reg orig="sphę rae" type="simple">sphae<lb/>rae</reg> <reg orig="eiuſdẽ" type="context">eiuſdem</reg> <reg orig="eſsẽt" type="context">eſsent</reg> <reg orig="lib." type="context">liber</reg> in <reg orig="lib." type="context">libro</reg> <reg orig="qñ" type="wordlist">quando</reg> <reg orig="tm̃" type="wordlist/context">tamen</reg> <reg orig="tm̃" type="wordlist/context">tantum</reg> <reg orig="Arist." type="unresolved">Arist.</reg> <reg orig="inrerrogas" type="typo" resp="paul">interrogas</reg> <reg orig="quem" type="conjecture" resp="paul">quam</reg> <reg orig="re ferre" type="conjecture" resp="paul">re<lb/>ferre</reg> <reg orig="ꝑꝑ" type="unknown">ꝑꝑ</reg> <reg orig="ꝑꝑ" type="conjecture" resp="paul">prope</reg>
note:
- the default type is "simple", e.g. <reg orig="sphęrae">sphaerae</reg>
- Beispiel veraltet!
- Beachte: der type kann zurzeit nicht weggelassen werden, und das ist auch gut so, falls man nämlich die <reg> automatisiert nachbearbeiten muss.
- the first exampe ijs applies only if
ij
is not silently resolved - missing hyphens are indicated by a soft hyphen "" rather than <reg>; however, you may use "conjecture" in non-trivial cases
- the generic "abbr" may be used for any abbreviation
- abbreviations are not resolved within <ref>, e.g. ex <ref id="N400238">.19. lib. quinti Eu-<lb/>clid.</ref> (wirklich?)
Text-Modelle
Avoiding Recursions: Wie ist die Inline-model-Hierarchie?
- inline anfangen können: s head caption description variables, evtl. note handwritten xhtml
- in inline, und Inhalt inline (mit Rekursionsgefahr): s-set-off, ref, foreign, emph, q
- in inline, und Inhalt plaintext: reg, sic, num, var, place, time
- in inline, und inhalt es selbst: mml.math
- in plaintext, und Inhalt plaintext: gibt es nicht
- in plaintext, Inhalt text: unsure (Inhalt in plaintext ändern?)
- in plaintext, leer: milestones, anchor, gap, unknown, wrong
Schematron-Regel, die Rekursionen aufspürt, d.h.
- z.B. <ref> in <ref>
- z.B. <ref> in <foreign> in ... in <ref>
also zusammen: kein Element aus dieser Gruppe darf sich sich selbst als ancestor haben.
echo-content-scholarly
- <ref>
- <sic> for mistakes in the original text:
- o<!> --> o<de:wrong/> --> <sic comment="n missing">o</sic> (see the discussion in echo-de)
- <foreign>: Foreign text is not marked in the transcription, i.e. <foreign> cannot be inserted automatedly without additional linguistic knowledge.
- Exception: <rom>sentence</rom> --> <foreign xml:lang="la">sentence</foreign> with language "la" as a first guess, and similarly in Chinese text.
- (echo.foreign has echo.core.attrib, but echo.language.attrib is obligatory)
- Quotations:
- <q> is for short inline quotes. Note that echo.delimiter-attrib is optional; however, please use it if possible
- <quote> (echo.quote) for longer inline quotes (one-sentence quote are <quote><s>Sentence.</s></quote> and not <s><q>Sentence.</q></s>)
- <quote> (echo.blockquote) for blockquotes
- Es kann keine Rekursionen von quote-Elementen geben.
echo-gap
- @@ --> <gap extent="2"/>
- <gap> --> <gap/>
- x< ? > --> x<unsure/> or <unsure>x</unsure> (this can not be fully automated)
echo-de
This module contains tags from the DESpecs that will be removed in the course of processing. We use the namespace "de" for the corresponding elements in the xml:
- <001> --> <de:unknown code="001"/> (bzw. wir haben eine Tabelle, was gemeint ist)
- <!> --> <de:wrong/> --> remove or <sic>
echo-figure
- <fig> --> <figure>, eventuell mit <anchor/>
- <cap> --> <caption>
- <desc> --> <description>
- <var> --> <variables>
echo-handwritten
In its simplest form, <handwritten> is just an empty tag. Nonetheless, within <s> it is replaced by <anchor> and moved to <div type="float"> to cater for scholarly additions, i.e. it is part of echo.float.class and not of echo.inline.class
- <hd> --> <handwritten/>, eventuell mit <anchor/>
echo-float
echo-milestone
line breaks
[Dieser Abschnitt ist sicher veraltet!]
<lb/> can be in plaintext (<s>, <head>, some <note>, all members of echo.inline.class) and <p>
in <p>: since a paragraph is split into <s>, most line breaks are actually in <s>. However:
- <lb/></s><s> and </s><s><lb/> shouldn't occur (--> </s><lb/><s> [and space before </s>?])
- <lb/></s></p> shouldn't occur at all
in <s> (and similarly for <head> and the members of echo.inline.class):
- line break --> <lb/>; no space before <lb/>; no line break after <lb/>; space after <lb/> if there is a hyphen before <lb/> (no automated space if the hyphen is missing)
examples:
- <s>亦<lb/>能使人無疑。</s>
- <note>Plutar <lb/>chus in <lb/>commẽ <lb/>tario de <lb/>dæmo-<lb/>nio So-<lb/>cratis.</note>
We use the normal hyphen U+002D instead of the soft hyphen U+00AD because the soft hyphen is not displayed in the xhtml. --> ?
column breaks
- <col 1>...</col><col 2>...</col> --> ...<cb/>...
page breaks
[Dieser Abschnitt ist sicher veraltet!]
<pb/> can occur wherever <lb/> occurs (although it will be rare in <head>), and <div>
- <pb vii><rh>xyz</rh> --> <pb n="10" o="vii" o-norm="7" rhead="xyz" xlink:href="URI"/>
- <pb 一六七a> --> <pb n="..." o="一六七a" o-norm="167a" xlink:href="URI"/>
echo-attribute
In echo-attribute werden Standard-Attribute definiert.
Text-Eigenschaften:
- echo.language.attrib (@xml:lang)
- echo.style.attrib (@style):
- direkt in: <text>; <emph>, <num>, <var>, <w>, <place>, <time>, <person>
- via echo.core.attrib in: <div>, <p>, <quote>, <note>, <handwritten>, <entry>; <reg>, <foreign>, <ref>, <q>
- via echo.inline.attrib in: <head>, <s>, <caption>, <description>, <variables>, <form>, <translation>, <pronunciation>
- in xhtml:* als @class
- echo.id.attrib (@xml:id)
- echo.core.attrib fasst echo.language.attrib, echo.style.attrib und echo.id.attrib zusammen
- echo.space.attrib (@xml:space="preserve")
- echo.inline.attrib ist echo.core.attrib plus echo.space.attrib
Div-Attribute:
- echo.n.attrib (@n)
- echo.level.attrib (@level)
Notes:
- echo.symbol.attrib (@symbol)
Links:
- echo.file.attrib (@file)
- echo.internal-link.attrib (@xlink:href, @xlink:label, @xlink:type)
- echo.external-link.attrib (@xlink:href)
Zitate:
- echo.delimiter.attrib (@open, @close)
echo-datatype
echo-mathematics
- number <num>:
- "vii" --> <num value="7">vii</num>
- "½" --> <num value="0.5">½</num>
- variable <var>:
- "AB" --> <var type="line">AB</var> (type ist optional)
Eine Funktion von <num> und <var> ist es, den Inhalt vor der morphologischen Analyse zu verstecken.
Note: The scope of echo.num and echo.var is very limited. More complex mathematics is expressed with MathML --> echo-import-mathml
echo-chinese-text
- <ti> --> <head type="ti">
- indentations in Chinese text:
- <p ii> --> <p indent="2char"> oder nur "2"?
- <p xx> --> <p indent="-2char">
- (indent is deliberately not defined as style="valid css" because it may be semantically meaningful)
- Linien:
- <sl> --> <emph style="sl">
- <dl> --> <emph style="dl">
- <wl> --> <emph style="wl">
- <cl> --> <emph style="cl">
Small text:
- in <p>: <sm> --> <small>
- everywhere else: <emph style="sm"> (<h>, rhead, <ti>, <toc>, etc.)
\\
--> <smlb/>
(plus some manual corrections where this simple distinction doesn't fit, e.g. <sm>chen</sm>)
echo-gis
Note: this module is still experimental.
Beachte in diesem Modul definierte Metadaten
echo-textflows
@flow is normally a number, or "footnote"
echo-import-xhtml
The xhtml modules are part of the Jing distribution:
- Relax NG Homepage
- Thai Open Source: Relax NG, xhtml modules
- Jing and Trang at Google Code
- xhtml modules in our local copy of the Jing distribution: basic-table.rng, list.rng, attribs.rng, datatypes.rng
The original rng files can be converted into the Relax NG compact syntax using Trang. Oxygen offers a GUI for this conversion.
Diese Module übernehmen wir dann ohne weitere Änderungen. Alle Anpassungen werden in echo-import-xhtml gemacht.
xhtml-basic-table
We ignore Block.class in xhtml-basic-table: The following lines replace
"Block.class |= table" in xhtml-basic-table
echo.float.class |= xhtml.table echo.anchor.types |= "table"
xhtml-list
We ignore Block.class in xhtml-list: The following lines replace
"List.class = ul | ol | dl",
"Block.class |= List.class"
echo.float.class |= xhtml.ul | xhtml.ol | xhtml.dl echo.anchor.types |= "ul" | "ol" | "dl"
Note that xhtml:ul, xhtml:ol, xhtml:dl do not correspond to <ul>, <ol>, <dl> in the DESpecs!
xhtml-attribs
We use the following xhtml elements: table, caption, tr, th, td; dl, dt, dd; ol, ul, li. All these elements have Common.attrib (th and td via Cell.attrib). xml:lang and class are already in Common.attrib, and we add xml:id { xsd:NCName } to it (however, Common.attrib already includes the attribute id { xsd:ID })
xhtml-datatypes
xhtml-text
Dieses Modul wird nicht importiert. Stattdessen werden Inline.model und Flow.model durch echo.flexible.content ersetzt:
- Inline.model = echo.flexible.content
- Flow.model = echo.flexible.content
ursprüngliche Definitionen:
- Inline.model = (text | Inline.class)*
- Block.mix = Block.class
- Block.model = Block.mix+
- Flow.model = (text | Inline.class | Block.class)*
echo-import-mathml
Note: Simple mathematical terms, i.e. numbers and variables, are marked using echo.num and echo.var (defined in echo-mathematics).
mml.math.content erlaubt beliebige Elemente <mml:*> innerhalb von <mml:math>.
This placeholder code above is good enough for the moment. We simply assume that the MathML parts are well-formed. This is plausible since the MathML code is created from a LaTeX formula by a MathML-converter.
In addition, Oxygen seems to have a separate validation engine for MathML.