wiki:Preparing the documents for keying

Version 3 (modified by Klaus Thoden, 14 years ago) (diff)

--

Workflow

Viewing the images

After it has been made clear which books are to be typed, the scans of said book are being looked at. The following issues are important:

Quality of the scans

This issue is not necessarily part of the workflow as the library's digigroup normally takes care of that (the major part of books are scanned at the MPIWG). However, some scans might be faulty nevertheless. It also happens that in the process of uploading the files to the ftp-Server some jpg-files are broken (or in the process of being downloaded by the data entry firm). This is about ten pages in a work order of 5.000 pages. This seems to be a problem of the jpg format.

Structural organization of the book

The book in question has to be examined according to the document DE Specs (latest version, which can be found on the versioned file folder). The DE Specs document should cover most of the cases that might occur in a book.

The instructions are kept simple and cover even the very basic issues how a book should be typed. It is recommended that one is familiar with the contents of the DE Specs.

It is not necessary that each page of a book is examined closely. After the first few pages of the main matter it should be obvious how the book works. The rest can be viewed via the thumbnails (if one views the images using digilib -- which is recommended due to the documentation in the wiki (see below)).

Looking at the thumbnails is mostly enough to spot difficult parts of a book. One can easily tell pages with pure text from pages containing other structures, e. g. tables, indexes or images.

Difficulties in pages with plain text can basically only arise on the level of unknown or unreadable characters.

Printing, special characters

Spotting difficult characters is of course an issue where the page has to be examined very closely. The same goes for damages in the book. They might occur here and there. In the DE Specs there is a flow chart which checks the eventualities what to do with an unknown character on the side of the data entry firm (check if in Unicode, check if in unknown characters).

Documenting in the wiki

The wiki is the central place for storing information about the texts. For each work order sent to the data entry firms, there should be an information page. Each of these overview pages contains links to the images of every book in the work order. These pages are based on a template which documents the whole work flow. It contains the following elements:

  • links to the image scans and to the overview page of the work order
  • information which version of the DE Specs was used and if special instructions were given
  • when the images were sent and when the text was returned
  • information about expected difficulties
  • Questions by the data entry firm and answers given
  • Analysis of the results
  • Post processing

Expected difficulties

During the examination of the images (see above), difficult structures should be noted. This may apply to things not covered in the DE Specs and which have to be dealt with in the future or with things that have gone wrong a few times in the past. Links to the difficulties should be provided by using the features of the digilib tool.

The digilib tool is sometimes irresponsive, lacks features (buttons) or is not available at all (heavy traffic?). A possible solution is to download the images from the server to view them directly. However, marking and linking things in the wiki is not possible then.

Special Instructions

To be added

Analysis of the results

At some point, the data entry will be finished and a respective *.txt file will have returned. Accompanying that file might be an updated version of the unknown characters list. Said list should be examined as to which characters were added.

One obvious part of this analysis is of course to check how the data entry firm fared with the expected difficulties. This is done by looking up the passages in the received text file and judging how they were typed. In any case, the results should be written down in the wiki. Lateron, a decision has to be made what to do if there is an error in the typing. The possibilities are:

  • correct the mistake silently
  • ask the data entry firm to redo the part