wiki:WO3_Euclid_1966

Version 12 (modified by Wolfgang Schmidle, 16 years ago) (diff)

--

Euclid 1966

Euclid (1966), Ji he yuan ben = Jihe yuanben ECHO

Part of WO3, or rather a special work order on its own. Sent with DESpecs version 1.1.2.

Sent: ok/date. Returned: 2009-01-05, raw text. Came with an unknown characters list (<001> and <002>; see attachment "char code.pdf").

1. First Analysis

Difficulties

  • The only text to Formax in Chinese

Special Instructions

  • see attachment

2. Questions From Formax

Question

Questions concerning figures and tables, see attachments "email.doc" and "berlin.doc".

Reply

The questions were answered in a few stages, which are not relevant. For the final instructions see "Euclid 1966.txt".

Final Instructions

Final version of the additional instructions for figures and tables: See attachment.

3. Analysis of the Result

We had requested a short sample and were content with the result.

Findings

Unknown characters: <001> appears once in the text and hasn't been included in "ques.xls". It is a badly printed 轂 (U+8F42). It makes sense that they marked it as an unknown character because only by taking the context into account it becomes clear which character is meant.

<002> appears three times and has already appeared in "ques.xls". In "ques.pdf" we told them to transcribe it as U+2F88D. Although this Unicode code point is not included in most Unicode fonts (it is included in Sun-ExtB), we should make it clear for future work orders that they should find every character in Unicode 5.1.0. (Email from 2009-01-05: "There are Chinese characters that we cannot key, and we use <001> and <002> instead of them. We cannot key the word with code 2f88d, so we use <002> instead of it. We have installed the Unicode font, but we cannot find the CJK compatibility ideographs supplement (u+2F800-U+2FA1F) word, can you help us?")

Recommendation

4. Post-Processing

The first XML version of the text is here.

Emendations of the raw text

# normalize the zero in modern page numbers
s!○!〇!g;  # white circle U+25CB --> ideographic number zero U+3007

# ignore outdentation in the preface
if ($line < 153) { s!<p x>!<p>!; } 

# pre-process notes that continue on the next line	
...

# fill in the unknown characters (MSi)
s!<001>!轂!g;  # s!<001>!<unknown code="001" unicode="8F42">轂</unknown>!g;
s!<002>!<unknown code="002" unicode="2F88D">庶</unknown>!g; # the actual Unicode character 庶 breaks oXygen

# clarify <?> (the list is not complete!)
s!<?>!<?>!; # line 811: fullwidth question mark U+FF1F --> ASCII question mark U+003F
s!愈<\?>!愈!g; # MSi: the reading is correct
s!丙、等。<?>而戊丙丁、與甲乙丙、又等。!丙、等。而戊丙丁、與甲乙丙、又等。!; # line 1041
# (line 1041: MSi: It is in the middle of a sentence, but a period at this position is quite common nonetheless.) 

# missing line breaks (the list is not complete!)
s!小於兩直角。則此二橫直線。!小於兩直角。則此二橫<lb/>直線。!; # line 403; may have to do with the neighboring figure
s!俱小於直角。或幷之小於兩直角。!俱小於直角。或幷之小<lb/>於兩直角。!; # line 404

# normalize the hash in the table
s!#!#!g; # fullwidth number sign U+FF03 --> ASCII hash, i.e. number sign U+0023

# move the only table in the text (ECHO p.327) out of its surrounding sentence
s!却云十六與十二之比例。若!却云十六與十二之比例。!;  # line 4562
s!八與三、及二與四之比例。!若<lb/>八與三、及二與四之比例。!; #line 4573

# misc. emendations
s!N12<114608657010!N12x114608657010!; # line 5: replace "<" in library stamp junk
s!<pb 六><h>幾何原本 卷一之首</h>!<pb 六><rh>幾何原本 卷一之首</rh>!;  # line 245 (obvious mistake)
s!<h>後支前己正論</h>!<p>後支前己正論</p>!;  # line 2175 (Tian Miao: wrong tag)
if ($line == 2992) { s!<h>第三十四題</h>!<h>第十四題</h>!; }  # line 2992 (obvious mistake)

Further processing steps

# metadata

# unknown characters

# figures

# the table on ECHO p.327

# ad hoc tagging of book covers, preface, chapters, chapter heads, chapter mains, backmatter

# page breaks
	# The ECHO pages 215 to 220 reduplicate the pages 209 to 214 and have been typed only once

# headings
	# notes in headings
	# headings at the lowest level

# paragraphs

	# normalize the periods: there should always be a period before </p> and </sm> 
	  (if the period is missing, insert an ASCII period)
	# turn small text into notes
	# tag sentences
	# end each line with <lb/>
	# outdented paragraphs

Attachments (8)