wiki:Unknown Characters

Handling of unknown characters during data entry

The books that are sent for being transcribed sometimes contain characters that are unknown to people dealing with modern texts only. Moreover, some of them are not even to be found in Unicode. The data entry firm is asked to substitute unknown characters by a three-digit numerical code between angled brackets. On return of the work order, a list of unknown characters, containing the code and a small image of the respective character is enclosed with the typed texts.

The list of unknown characters is global in the sense that it is valid for each text of each work order. Before a new character is added, it has to be made sure that it is not already on the list.

Thus, the workflow of handling unknown characters is as follows:

  1. Check if the character is dealt with in section 3.2.1 of the DE Specs: Characters to be Typed Directly
  2. Check if the character is included in section 3.2.2 of the DE Specs which deals with combining characters (a set of common characters including a diacritic which are typed using an escape sequence for the diacritic).
  3. Check if the character is contained in section 3.4 or 4.3 of the DE Specs. These sections deal with ligatures in Latin and Greek.
  4. Check if the character is documented in section 7 of the DE Specs. Here, common astronomical and technical symbols are displayed.
  5. Check if the character is available somewhere in Unicode.
  6. Check if the character is already in the list of unknown characters
  7. If none of the above applies, the character is to be added to the list.

Handling of unknown characters in the later stages of the workflow

In the later stages of the workflow, the codes for the unknown characters are to be replaced by real characters. One way to do this is to use the reg-tag when the unknown character is an abbreviation.

Scripts

Christopher Mielack wrote a Python script to compute the frequency of all unknown characters per text.

Another script will have to be written that replaces the codes with real characters, where applicable.

Last modified 13 years ago Last modified on Jun 9, 2011, 12:26:20 PM