= Euclid 1966 = Euclid (1966), Ji he yuan ben = Jihe yuanben [http://libcoll.mpiwg-berlin.mpg.de/elib/all_documents/02NT95YF ECHO] Part of [wiki:WO3], or rather a special work order on its own. Sent with DESpecs version 1.1.2. Sent: ok/date. Returned: 2009-01-05, [http://pythia.mpiwg-berlin.mpg.de/department1/mpdl/raw-texts/WO3_Euclid_1966.txt/WO3_Euclid_1966_V1.txt raw text]. Came with an unknown characters list (<001> and <002>; see attachment "char code.pdf"). == 1. First Analysis == === Difficulties === * The only text to Formax in Chinese === Special Instructions === * see attachment == 2. Questions From Formax == === Question === Questions concerning figures and tables, see attachments "email.doc" and "berlin.doc". === Reply === The questions were answered in a few stages, which are not relevant. For the final instructions see "Euclid 1966.txt". === Final Instructions === Final version of the additional instructions for figures and tables: See attachment. == 3. Analysis of the Result == We had requested a short sample and were content with the result. === Findings === Unknown characters: <001> appears once in the text and hasn't been included in "ques.xls". It is a badly printed 轂 (U+8F42). It makes sense that they marked it as an unknown character because only by taking the context into account it becomes clear which character is meant. <002> appears three times and has already appeared in "ques.xls". In "ques.pdf" we told them to transcribe it as U+2F88D. Although this Unicode code point is not included in most Unicode fonts (it is included in Sun-ExtB), we should make it clear for future work orders that they should find every character in Unicode 5.1.0. (Email from 2009-01-05: "There are Chinese characters that we cannot key, and we use <001> and <002> instead of them. We cannot key the word with code 2f88d, so we use <002> instead of it. We have installed the Unicode font, but we cannot find the CJK compatibility ideographs supplement (u+2F800-U+2FA1F) word, can you help us?") === Recommendation === == 4. Post-Processing == The first XML version of the text is [http://pythia.mpiwg-berlin.mpg.de/department1/mpdl/raw-texts/WO3_Euclid_1966.txt/WO3_Euclid_1966_V2.txt here]. === Emendations of the raw text === normalize the zero in modern page numbers {{{ s!○!〇!g; # white circle U+25CB --> ideographic number zero U+3007 }}} ignore outdentation in the preface {{{ if ($line < 153) { s!

!

!; } }}} pre-process notes that continue on the next line {{{ ... }}} fill in the unknown characters (MSi) {{{ s!<001>!轂!g; # s!<001>!!g; s!<002>!!g; # the actual Unicode character 庶 breaks oXygen }}} clarify (the list is not complete!) {{{ s!<?>!!; # line 811: fullwidth question mark U+FF1F --> ASCII question mark U+003F s!愈<\?>!愈!g; # MSi: the reading is correct s!丙、等。而戊丙丁、與甲乙丙、又等。!丙、等。而戊丙丁、與甲乙丙、又等。!; # line 1041 # (line 1041: MSi: It is in the middle of a sentence, but a period at this position is quite common nonetheless.) }}} missing line breaks (the list is not complete!) {{{ s!小於兩直角。則此二橫直線。!小於兩直角。則此二橫直線。!; # line 403; may have to do with the neighboring figure s!俱小於直角。或幷之小於兩直角。!俱小於直角。或幷之小於兩直角。!; # line 404 }}} normalize the hash in the table {{{ s!#!#!g; # fullwidth number sign U+FF03 --> ASCII hash, i.e. number sign U+0023 }}} move the only table in the text (ECHO p.327) out of its surrounding sentence {{{ s!却云十六與十二之比例。若!却云十六與十二之比例。!; # line 4562 s!八與三、及二與四之比例。!若八與三、及二與四之比例。!; #line 4573 }}} misc. emendations {{{ s!N12<114608657010!N12x114608657010!; # line 5: replace "<" in library stamp junk s!幾何原本 卷一之首!幾何原本 卷一之首!; # line 245 (obvious mistake) s!後支前己正論!

後支前己正論

!; # line 2175 (Tian Miao: wrong tag) if ($line == 2992) { s!第三十四題!第十四題!; } # line 2992 (obvious mistake) }}} === Further processing steps === * metadata * unknown characters * figures * the table on ECHO p.327 * ad hoc tagging of book covers, preface, chapters, chapter heads, chapter mains, backmatter * page breaks * The ECHO pages 215 to 220 reduplicate the pages 209 to 214 and have been typed only once * headings * notes in headings * headings at the lowest level * paragraphs * normalize the periods: there should always be a period before

and (if the period is missing, insert an ASCII period) * turn small text into notes * tag sentences * end each line with * outdented paragraphs === Some remaining issues === * The metadata may be incorrect in details. * Will the in figures collide with the normal ? * Replacing <001> by the correct character works fine, but <002> is in a higher plane of Unicode and kills off oXygen, so I have used the simpler standard version of this character. * Some have been post-processed already, and I have removed the respective tags because a tag has no value in itself once the line has been checked. I have compiled a list of lines which contain and/or @ and haven't been post-processed yet. ("。" means that the period is unclear. And one artifact sentence "。.") * The and in figures have not been used very consistently. (They did not use at all, but I only know of one figure where it would make sense to use it.) * Figures and note have no place attribute. * No , (outside of figures), , no IDs yet. * The parts of problems with more than one part have not been encoded yet. * Four books, i.e. four titles, one in the front (attribute n=1) and three in the body (n=1, 2, 3).