Changes between Initial Version and Version 1 of Language specification


Ignore:
Timestamp:
Oct 21, 2009, 10:03:24 AM (15 years ago)
Author:
kthoden
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Language specification

    v1 v1  
     1{{{
     2#!html
     3
     4    <h1>Language Specification in Arboreal</h1>
     5
     6<p>Arboreal's language architecture (cf. <a
     7href="textarch.png">schematic</a>) requires that the language of text
     8be somehow specified. Here are the rules Arboreal uses:
     9
     10<ol>
     11<li>Language may be specified in the document metadata. Arboreal
     12determines the language by using the XPath query under the
     13<b>&lt;locator&gt;</b> tag in the <b>&lt;metadata&gt;</b> definition
     14in the docspec file. For Archimedes texts, the language is specified
     15in the <b>&lt;lang&gt;</b> section under <b>&lt;info&gt;.</b> E.g.:
     16
     17
     18<p><pre>
     19<b>&lt;info&gt;</b>
     20...
     21    <b>&lt;lang&gt;</b>it<b>&lt;/lang&gt;</b>
     22...
     23<b>&lt;/info&gt;</b>
     24</pre></p>
     25
     26<li>Any element (tag) may have a <b>lang</b> attribute. The value set
     27here applies to the entire subtree for which the element is the root
     28(unless the setting is overridden by a <b>lang</b> attribute of some
     29descendant node or nodes). This language setting overrides the
     30language (if any) that is specified in the document metadata. Note
     31that the language may be set for the entire document simply by
     32supplying a <b>lang</b> attribute for the root element. E.g.:
     33
     34
     35<p><pre>
     36<b>&lt;root lang=&quot;la&quot;&gt;</b>
     37...
     38</pre></p>
     39
     40<li>The text under certain elements is considered as a single unit,
     41called an <b>amalgamation</b>. Nodes to which this behavior applies
     42are called <b>container</b> nodes. The nodes considered containers are
     43enumerated under <b>&lt;containers&gt;</b> in the docspec file. (Also:
     44any node that is the root of a subtree containing only text nodes is
     45automatically considered a container node.) In the Archimedes <acronym
     46title="Document Type Definition">DTD</acronym>, <b>&lt;s&gt;</b> is a
     47container. The amalgamation belonging to a container may consist of
     48text in only a <i>single</i> language. In the case of multilingual
     49documents, however, it will sometimes be necessary for a container
     50(e.g., a sentence) to contain text in more than one language. To allow
     51for this possibility, elements may defined as <b>subcontainers</b> in
     52the docspec file. Text that belongs to a subcontainer is treated as
     53the amalgamation of the subcontainer, not of the (parent) container.
     54In the Archimedes doctype, <b>&lt;foreign&gt;</b> is defined as a
     55subcontainer. Thus we can have something like:
     56
     57
     58<p><tt><b>&lt;s id=&quot;Academica2.18.3&quot;
     59lang=&quot;la&quot;&gt;</b>Cum enim ita negaret quidquam esse quod
     60comprehendi posset (id enim volumus esse <b>&lt;foreign
     61lang=&quot;el&quot;&gt;</b>a)kata/lhpton<b>&lt;/foreign&gt;</b>), si
     62illud esset, sicut Zeno definiret, tale visum (iam enim hoc pro
     63<b>&lt;foreign
     64lang=&quot;el&quot;&gt;</b>fantasi/a|<b>&lt;/foreign&gt;</b> verbum
     65satis hesterno sermone trivimus), visum igitur impressum effictumque
     66ex eo unde esset quale esse non posset ex eo unde non
     67esset...<b>&lt;/s&gt;</b></tt></p>
     68
     69<li>If no language is specified anywhere in the document, the document
     70is considered to be in the default language. This default may be set
     71in the <a href="https://itgroup.mpiwg-berlin.mpg.de:8080/tracs/Arboreal/wiki/Configuration">preferences dialog</a>
     72
     73</ol>
     74
     75<hr>
     76
     77<p>The code used for the language is always the two- or three-letter
     78code specified in <a
     79href="http://lcweb.loc.gov/standards/iso639-2/langcodes.html"><acronym
     80title="International Standards Organization">ISO</acronym> 639</a>.
     81These codes are <i>not</i> case-sensitive. The codes for languages
     82we're currently using are:</p>
     83
     84<p><blockquote>
     85<table border="yes">
     86<tr><td><code>ar</code></td><td>Arabic</td></tr>
     87<tr><td><code>de</code></td><td>German</td></tr>
     88<tr><td><code>en</code></td><td>English</td></tr>
     89<tr><td><code>el</code></td><td>Greek</td></tr>
     90<tr><td><code>fr</code></td><td>French</td></tr>
     91
     92<tr><td><code>it</code></td><td>Italian</td></tr>
     93<tr><td><code>la</code></td><td>Latin</td></tr>
     94<tr><td><code>zh</code></td><td>Chinese</td></tr>
     95</table>
     96</blockquote></p>
     97
     98<p>For an sample document that illustrates language embedding, see <a href="https://itgroup.mpiwg-berlin.mpg.de:8080/tracs/Arboreal/attachment/wiki/Scrapbook/testbed.xml">testbed.xml</a>
     99
     100}}}