wiki:WO3_Euclid_1966

Context Navigation

Version 25 (modified by Wolfgang Schmidle, 16 years ago) (diff)
--

Euclid 1966

Euclid (1966), Ji he yuan ben = Jihe yuanben, ECHO: viewer 1, viewer 2

This text is the Chinese translation of parts of Euclid/Clavius 1607.

Part of WO3, or rather a special work order on its own. Sent with DESpecs version 1.1.2.

Sent: ok/date. Returned: 2009-01-05, raw text. Came with an unknown characters list (<001> and <002>; see attachment "char code.pdf").

1. First Analysis

Difficulties

The only text to Formax in Chinese (in 2008)

Special Instructions

see attachment

2. Questions From Formax

Question

Questions concerning figures and tables, see attachments "email.doc" and "berlin.doc".

Reply

The questions were answered in a few stages, which are not relevant. For the final instructions see "Euclid 1966.txt".

Final Instructions

Final version of the additional instructions for figures and tables: See attachment.

3. Analysis of the Result

We had requested a short sample and were content with the result.

Findings

Unknown characters: <001> appears once in the text and hasn't been included in "ques.xls". It is a badly printed 轂 (U+8F42). It makes sense that they marked it as an unknown character because only by taking the context into account it becomes clear which character is meant.

<002> appears three times and has already appeared in "ques.xls". In "ques.pdf" we told them to transcribe it as U+2F88D. Although this Unicode code point is not included in most Unicode fonts (it is included in Sun-ExtB), we should make it clear for future work orders that they should find every character in Unicode 5.1.0. (Email from 2009-01-05: "There are Chinese characters that we cannot key, and we use <001> and <002> instead of them. We cannot key the word with code 2f88d, so we use <002> instead of it. We have installed the Unicode font, but we cannot find the CJK compatibility ideographs supplement (u+2F800-U+2FA1F) word, can you help us?")

Recommendation

4. Post-Processing

The first XML version of the text is here.

Emendations of the raw text

normalize the zero in modern page numbers

s!○!〇!g;  # white circle U+25CB --> ideographic number zero U+3007

ignore outdentation in the preface

if ($line < 153) { s!<p x>!<p>!; }

pre-process notes that continue on the next line

...

fill in the unknown characters (MSi)

s!<001>!轂!g;  # s!<001>!<unknown code="001" unicode="8F42">轂</unknown>!g;
s!<002>!<unknown code="002" unicode="2F88D">庶</unknown>!g; # the actual Unicode character 庶 breaks oXygen

clarify <?> (the list is not complete!)

s!<？>!<?>!; # line 811: fullwidth question mark U+FF1F --> ASCII question mark U+003F
s!愈<\?>!愈!g; # MSi: the reading is correct
s!丙、等。<?>而戊丙丁、與甲乙丙、又等。!丙、等。而戊丙丁、與甲乙丙、又等。!; # line 1041
# (line 1041: MSi: It is in the middle of a sentence, but a period at this position is quite common nonetheless.)

missing line breaks (the list is not complete!)

s!小於兩直角。則此二橫直線。!小於兩直角。則此二橫<lb/>直線。!; # line 403; may have to do with the neighboring figure
s!俱小於直角。或幷之小於兩直角。!俱小於直角。或幷之小<lb/>於兩直角。!; # line 404

normalize the hash in the table

s!＃!#!g; # fullwidth number sign U+FF03 --> ASCII hash, i.e. number sign U+0023

move the only table in the text (ECHO p.327) out of its surrounding sentence

s!却云十六與十二之比例。若!却云十六與十二之比例。!;  # line 4562
s!八與三、及二與四之比例。!若<lb/>八與三、及二與四之比例。!; #line 4573

misc. emendations

s!N12<114608657010!N12x114608657010!; # line 5: replace "<" in library stamp junk
s!<pb 六><h>幾何原本　卷一之首</h>!<pb 六><rh>幾何原本　卷一之首</rh>!;  # line 245 (obvious mistake)
s!<h>後支前己正論</h>!<p>後支前己正論</p>!;  # line 2175 (Tian Miao: wrong tag)
if ($line == 2992) { s!<h>第三十四題</h>!<h>第十四題</h>!; }  # line 2992 (obvious mistake)

Further processing steps

metadata
unknown characters
figures
the table on ECHO p.327
ad hoc tagging of book covers, preface, chapters, chapter heads, chapter mains, backmatter
page breaks
- The ECHO pages 215 to 220 reduplicate the pages 209 to 214 and have been typed only once
headings
- notes in headings
- headings at the lowest level
paragraphs
- normalize the periods: there should always be a period before </p> and </sm> (if the period is missing, insert an ASCII period)
- turn small text into notes
- tag sentences
- end each line with <lb/>
- outdented paragraphs

Some remaining issues

The metadata may be incorrect in details.
Replacing <001> by the correct character works fine, but <002> is in a higher plane of Unicode and kills off oXygen, so I have used the simpler standard version of this character.
Some <?> have been post-processed already, and I have removed the respective <?> tags because a <?> tag has no value in itself once the line has been checked. See below the list of lines which contain <?> and/or @ and haven't been post-processed yet.
The <desc> and <var> in figures have not been used very consistently. (They did not use <cap> at all, but I only know of one figure where it would make sense to use it.)
Figures and note have no place attribute.
No <num>, <var> (outside of figures), <ptr>, no IDs yet.
The parts of problems with more than one part have not been encoded yet.
Four books, i.e. four titles, one in the front (attribute n=1) and three in the body (n=1, 2, 3).

List of remaining <?> and @

Lines which contain <?> and/or @ and haven't been post-processed yet.

12899:  <var>九，十八，十二，三，二，十四，十八，四，八，卅六，廾<unsure/>四，六，四，廾<unsure/>八，卅六，八</var>
16981:  <s>得<unsure/>
19053:  <var>壬<unsure/>癸甲乙辛</var>
19606:  <var>@甲午<unsure/>未</var>
21516:  <var>癸辛壬<unsure/></var>
22359:  <var>庚戊己壬<unsure/>癸丁甲乙辛丙</var>

10738:  <var>甲乙丙@</var>
16549:  <s>俱@此論。</s>
16550:  <s>可@全題之旨。</s>
19606:  <var>@甲午<unsure/>未</var>
21401:  <var>辰@卯</var>

13927, 14004, 14495: In the meeting on 2009-02-26 we assumed that the @ in three "<s>@ <pb ..." lines stands for some circles on pages 279, 281 and 289 that shouldn't be typed. It turns out, however, that

<p>@
<pb ...

in the raw text denotes unreadable lines on pages 278, 280 and 288. In other words, the @ should be <gap>, but otherwise the tags in the transcription are correct.

Attachments (8)

DESpecs_special_Euclid_1966_2.pdf (902.7 KB) - added by Wolfgang Schmidle 16 years ago.
email.doc (1.1 MB) - added by Wolfgang Schmidle 16 years ago.
berlin.doc (756.0 KB) - added by Wolfgang Schmidle 16 years ago.
Ques.pdf (46.5 KB) - added by Wolfgang Schmidle 16 years ago.
Char code.pdf (16.9 KB) - added by Wolfgang Schmidle 16 years ago. originally as xls file
Euclid 1966.txt (2.4 KB) - added by Wolfgang Schmidle 16 years ago.
Euclid_1966.pl (13.4 KB) - added by Wolfgang Schmidle 16 years ago.
figures_euclid_1966.html (384.1 KB) - added by Klaus Thoden 14 years ago. XQL result for echo:figure as HTML

Download in other formats:

Plain Text