wiki:HighLevelRequirements

High level requirements for DE Specs

general principles

  • must specifiy standard file encoding and unambiguous conventions for entry of non-ASCII characters
  • a convention is needed for DE personnel to indicate and record unknown characters
  • character entry conventions must be ergonomic and within capabilities of DE firm
  • DE output must be plain text, but will not be well-formed XML
  • DE markup should be concise and unambiguous
  • DE markup should facilitate conversion to target structured XML document

required structural features

  • conventions are needed for standard line, paragraph, and page-level structure
  • markup needs to indicate not only where a feature starts, but also where it ends, unless automatic inference of the end location is trivial
  • must address headers/footers, notes (marginal, foot- and end-), tables, and lists, and figures
  • must support multi-column layouts
  • must indicate relation of text to commentary, where these are presented on the page together
  • must indicate emphasis (e.g. italics)
  • must indicate change of typestyle, where this is semantically significant
  • conventions for abbreviations

expository aspects

  • conventions should be indicated in numbered sections
  • language needs to be kept simple and readable for Chinese employees
  • complex structural features should be illustrated with an example (or examples) from actual texts and desired transcription

coverage

  • DE is not appropriate where OCR would be more cost-effective
  • material needed by the Institute's scientists in the proximate future should be accommodated
  • version targets
    • DE Specs 1.0 should cover printed European books up to the nineteenth century
    • DE Specs 1.1 should add support for Chinese books
    • DE Specs 2.0 should cover also transcriptions made by students or other personnel of annotated matter or manuscripts
  • out of scope for DE Specs 1.0-2.0
    • specialized document types such as dictionaries
    • dramatic and verse literature
    • complex formal language content (e.g. mathematics, chemical formulae, musical notation)
    • documents such as notebooks, personal letters, and financial documents
    • twentieth-century material (perhaps with certain exceptions)
Last modified 16 years ago Last modified on Sep 19, 2008, 10:26:08 AM