| 1 | {{{ |
| 2 | #!html |
| 3 | <h2 class="titleHead">DE Specs Working Group Meeting</h2> |
| 4 | <div class="author" ><span |
| 5 | class="pplr8t-x-x-120">Klaus Thoden</span></div> |
| 6 | <br /> |
| 7 | <div class="date" ><span |
| 8 | class="pplr8t-x-x-120">12 September 2008</span></div> |
| 9 | </div> |
| 10 | <h3 class="sectionHead"><span class="titlemark">1 </span> <a |
| 11 | id="x1-10001"></a>Introduction</h3> |
| 12 | <!--l. 30--><p class="noindent" >In this meeting, Wolfgang and Klaus presented their list of things that should be considered writing |
| 13 | the DE Specs.<span class="footnote-mark"><a |
| 14 | href="#fn1x0" id="fn1x0-bk"><sup class="textsuperscript">1</sup></a></span><a |
| 15 | id="x1-1001f1"></a> |
| 16 | It was pointed out that the specifications should be fairly general to cover a large set of |
| 17 | books. |
| 18 | <!--l. 37--><p class="noindent" > |
| 19 | <h3 class="sectionHead"><span class="titlemark">2 </span> <a |
| 20 | id="x1-20002"></a>Things to be marked up</h3> |
| 21 | <!--l. 39--><p class="noindent" >Based on examples from ECHO, the following points were discussed. Structural markup |
| 22 | means how text is organized on the page. Positional markup means how the text is |
| 23 | formatted. |
| 24 | <h4 class="subsectionHead"><span class="titlemark">2.1 </span> <a |
| 25 | id="x1-30002.1"></a>Structural markup</h4> |
| 26 | <!--l. 55--><p class="noindent" > |
| 27 | <h5 class="subsubsectionHead"><span class="titlemark">2.1.1 </span> <a |
| 28 | id="x1-40002.1.1"></a>Markup done by the digitizers</h5> |
| 29 | <!--l. 56--><p class="noindent" >Not many things will be marked up by the digitizers. This applies mainly to headings, |
| 30 | paragraphs, columns and marginal notes. All of these will be marked by beginning and end |
| 31 | tags. |
| 32 | <!--l. 60--><p class="indent" > Marginal notes should be written where they occur on the page so that they already |
| 33 | roughly anchored to a certain place. |
| 34 | |
| 35 | |
| 36 | |
| 37 | <!--l. 63--><p class="indent" > When page numbers are found on the page, they will be put as an argument into the |
| 38 | header of the page break. Page breaks will be coded as milestones. |
| 39 | <!--l. 67--><p class="noindent" > |
| 40 | <h5 class="subsubsectionHead"><span class="titlemark">2.1.2 </span> <a |
| 41 | id="x1-50002.1.2"></a>Things to be ignored</h5> |
| 42 | <!--l. 68--><p class="noindent" >Catchwords and signatures at the bottom of the page will be ignored, because they do not |
| 43 | carry any useful information. |
| 44 | <!--l. 71--><p class="indent" > Sentences or other semantic units will not be marked up by the digitizers, because it is too |
| 45 | difficult. |
| 46 | |
| 47 | |
| 48 | |
| 49 | <!--l. 75--><p class="noindent" > |
| 50 | <h4 class="subsectionHead"><span class="titlemark">2.2 </span> <a |
| 51 | id="x1-60002.2"></a>Positional markup</h4> |
| 52 | <!--l. 77--><p class="noindent" > |
| 53 | <h5 class="subsubsectionHead"><span class="titlemark">2.2.1 </span> <a |
| 54 | id="x1-70002.2.1"></a>Ligatures</h5> |
| 55 | <!--l. 79--><p class="noindent" >A list of ligatures will be handed to the digitizers which shows them how they should be |
| 56 | resolved. |
| 57 | <h5 class="subsubsectionHead"><span class="titlemark">2.2.2 </span> <a |
| 58 | id="x1-80002.2.2"></a>Markup of special characters</h5> |
| 59 | <!--l. 82--><p class="noindent" >In order not to have the digitizers type too many tags, special characters could be marked up |
| 60 | more easily. Thus, text in italics or small caps could be surrounded by an underscore (_). They |
| 61 | have to be used with care, as texts might actually contain these characters (especially books |
| 62 | from the 20th century). |
| 63 | <h5 class="subsubsectionHead"><span class="titlemark">2.2.3 </span> <a |
| 64 | id="x1-90002.2.3"></a>Punctuation and spatia and hyphens</h5> |
| 65 | <!--l. 89--><p class="noindent" >The spatia in the books are not consistent, be it between words, letters or letters and |
| 66 | punctuation. As a rule, the digitizers are told not to write a spatium before a punctuation, |
| 67 | even if it is in the text. |
| 68 | <!--l. 93--><p class="indent" > As for spatia inside words, nothing can be done to get the digitizers recognize words. |
| 69 | Such errors will have to be emendated by NLP-tools. This applies also to missing |
| 70 | hyphens. |
| 71 | <h5 class="subsubsectionHead"><span class="titlemark">2.2.4 </span> <a |
| 72 | id="x1-100002.2.4"></a>Physical damage</h5> |
| 73 | <!--l. 98--><p class="noindent" >Text might be rendered unreadable by folds, creases or even holes. In these case, the digitizers |
| 74 | are supposed to mark these locations by a special tag. |
| 75 | <!--l. 102--><p class="noindent" > |
| 76 | <h3 class="sectionHead"><span class="titlemark">3 </span> <a |
| 77 | id="x1-110003"></a>Things to keep in mind</h3> |
| 78 | <ul class="itemize1"> |
| 79 | <li class="itemize">The specifications have to be clear and simple |
| 80 | </li> |
| 81 | <li class="itemize">You cannot code everything!</li></ul> |
| 82 | |
| 83 | |
| 84 | |
| 85 | <!--l. 109--><p class="noindent" > |
| 86 | <h3 class="sectionHead"><span class="titlemark">4 </span> <a |
| 87 | id="x1-120004"></a>Next steps</h3> |
| 88 | <!--l. 111--><p class="noindent" >A first draft version will be delivered on Friday, 19th September. Version 1.0 is due September |
| 89 | 29. |
| 90 | <!--l. 114--><p class="indent" > The authors themselves, as well as willing students, are going to type some text using the |
| 91 | DESpecs for evaluation purposes. |
| 92 | <div class="footnotes"><!--l. 33--><p class="indent" > <span class="footnote-mark"><a |
| 93 | href="#fn1x0-bk" id="fn1x0"><sup class="textsuperscript">1</sup></a></span><span |
| 94 | class="pplr8t-x-x-90">This wiki-page shows the major issues:</span> |
| 95 | <br class="newline" /> <a |
| 96 | href="https://itgroup.mpiwg-berlin.mpg.de:8080/tracs/mpdl-project-content/wiki/SampleTexts" class="url" ><span |
| 97 | class="pcrr8t-x-x-90">https://itgroup.mpiwg-berlin.mpg.de:8080/tracs/mpdl-project-content/wiki/SampleTexts</span></a> </div> |
| 98 | |
| 99 | </body></html> |
| 100 | |
| 101 | }}} |