Version 23 (modified by 16 years ago) (diff) | ,
---|
Euclid 1966
Euclid (1966), Ji he yuan ben = Jihe yuanben, ECHO: viewer 1, viewer 2
Part of WO3, or rather a special work order on its own. Sent with DESpecs version 1.1.2.
Sent: ok/date. Returned: 2009-01-05, raw text. Came with an unknown characters list (<001> and <002>; see attachment "char code.pdf").
1. First Analysis
Difficulties
- The only text to Formax in Chinese (in 2008)
Special Instructions
- see attachment
2. Questions From Formax
Question
Questions concerning figures and tables, see attachments "email.doc" and "berlin.doc".
Reply
The questions were answered in a few stages, which are not relevant. For the final instructions see "Euclid 1966.txt".
Final Instructions
Final version of the additional instructions for figures and tables: See attachment.
3. Analysis of the Result
We had requested a short sample and were content with the result.
Findings
Unknown characters: <001> appears once in the text and hasn't been included in "ques.xls". It is a badly printed 轂 (U+8F42). It makes sense that they marked it as an unknown character because only by taking the context into account it becomes clear which character is meant.
<002> appears three times and has already appeared in "ques.xls". In "ques.pdf" we told them to transcribe it as U+2F88D. Although this Unicode code point is not included in most Unicode fonts (it is included in Sun-ExtB), we should make it clear for future work orders that they should find every character in Unicode 5.1.0. (Email from 2009-01-05: "There are Chinese characters that we cannot key, and we use <001> and <002> instead of them. We cannot key the word with code 2f88d, so we use <002> instead of it. We have installed the Unicode font, but we cannot find the CJK compatibility ideographs supplement (u+2F800-U+2FA1F) word, can you help us?")
Recommendation
4. Post-Processing
The first XML version of the text is here.
Emendations of the raw text
normalize the zero in modern page numbers
s!○!〇!g; # white circle U+25CB --> ideographic number zero U+3007
ignore outdentation in the preface
if ($line < 153) { s!<p x>!<p>!; }
pre-process notes that continue on the next line
...
fill in the unknown characters (MSi)
s!<001>!轂!g; # s!<001>!<unknown code="001" unicode="8F42">轂</unknown>!g; s!<002>!<unknown code="002" unicode="2F88D">庶</unknown>!g; # the actual Unicode character 庶 breaks oXygen
clarify <?> (the list is not complete!)
s!<?>!<?>!; # line 811: fullwidth question mark U+FF1F --> ASCII question mark U+003F s!愈<\?>!愈!g; # MSi: the reading is correct s!丙、等。<?>而戊丙丁、與甲乙丙、又等。!丙、等。而戊丙丁、與甲乙丙、又等。!; # line 1041 # (line 1041: MSi: It is in the middle of a sentence, but a period at this position is quite common nonetheless.)
missing line breaks (the list is not complete!)
s!小於兩直角。則此二橫直線。!小於兩直角。則此二橫<lb/>直線。!; # line 403; may have to do with the neighboring figure s!俱小於直角。或幷之小於兩直角。!俱小於直角。或幷之小<lb/>於兩直角。!; # line 404
normalize the hash in the table
s!#!#!g; # fullwidth number sign U+FF03 --> ASCII hash, i.e. number sign U+0023
move the only table in the text (ECHO p.327) out of its surrounding sentence
s!却云十六與十二之比例。若!却云十六與十二之比例。!; # line 4562 s!八與三、及二與四之比例。!若<lb/>八與三、及二與四之比例。!; #line 4573
misc. emendations
s!N12<114608657010!N12x114608657010!; # line 5: replace "<" in library stamp junk s!<pb 六><h>幾何原本 卷一之首</h>!<pb 六><rh>幾何原本 卷一之首</rh>!; # line 245 (obvious mistake) s!<h>後支前己正論</h>!<p>後支前己正論</p>!; # line 2175 (Tian Miao: wrong tag) if ($line == 2992) { s!<h>第三十四題</h>!<h>第十四題</h>!; } # line 2992 (obvious mistake)
Further processing steps
- metadata
- unknown characters
- figures
- the table on ECHO p.327
- ad hoc tagging of book covers, preface, chapters, chapter heads, chapter mains, backmatter
- page breaks
- The ECHO pages 215 to 220 reduplicate the pages 209 to 214 and have been typed only once
- headings
- notes in headings
- headings at the lowest level
- paragraphs
- normalize the periods: there should always be a period before </p> and </sm> (if the period is missing, insert an ASCII period)
- turn small text into notes
- tag sentences
- end each line with <lb/>
- outdented paragraphs
Some remaining issues
- The metadata may be incorrect in details.
- Replacing <001> by the correct character works fine, but <002> is in a higher plane of Unicode and kills off oXygen, so I have used the simpler standard version of this character.
- Some <?> have been post-processed already, and I have removed the respective <?> tags because a <?> tag has no value in itself once the line has been checked. See below the list of lines which contain <?> and/or @ and haven't been post-processed yet.
- The <desc> and <var> in figures have not been used very consistently. (They did not use <cap> at all, but I only know of one figure where it would make sense to use it.)
- Figures and note have no place attribute.
- No <num>, <var> (outside of figures), <ptr>, no IDs yet.
- The parts of problems with more than one part have not been encoded yet.
- Four books, i.e. four titles, one in the front (attribute n=1) and three in the body (n=1, 2, 3).
List of remaining <?> and @
Lines which contain <?> and/or @ and haven't been post-processed yet.
12899: <var>九,十八,十二,三,二,十四,十八,四,八,卅六,廾<unsure/>四,六,四,廾<unsure/>八,卅六,八</var> 16981: <s>得<unsure/> 19053: <var>壬<unsure/>癸甲乙辛</var> 19606: <var>@甲午<unsure/>未</var> 21516: <var>癸辛壬<unsure/></var> 22359: <var>庚戊己壬<unsure/>癸丁甲乙辛丙</var> 10738: <var>甲乙丙@</var> 16549: <s>俱@此論。</s> 16550: <s>可@全題之旨。</s> 19606: <var>@甲午<unsure/>未</var> 21401: <var>辰@卯</var>
13927, 14004, 14495: In the meeting on 2009-02-26 we assumed that the @ in three "<s>@ <pb ..." lines stands for some circles on pages 279, 281 and 289 that shouldn't be typed. It turns out, however, that
<p>@ <pb ...
in the raw text denotes unreadable lines on pages 278, 280 and 288. In other words, the @ should be <gap>, but otherwise the tags in the transcription are correct.
Attachments (8)
- DESpecs_special_Euclid_1966_2.pdf (902.7 KB) - added by 16 years ago.
- email.doc (1.1 MB) - added by 16 years ago.
- berlin.doc (756.0 KB) - added by 16 years ago.
- Ques.pdf (46.5 KB) - added by 16 years ago.
-
Char code.pdf (16.9 KB) - added by 16 years ago.
originally as xls file
- Euclid 1966.txt (2.4 KB) - added by 16 years ago.
- Euclid_1966.pl (13.4 KB) - added by 16 years ago.
-
figures_euclid_1966.html (384.1 KB) - added by 14 years ago.
XQL result for echo:figure as HTML