wiki:OCR_evaluation

Version 7 (modified by Klaus Thoden, 13 years ago) (diff)

--

The workflow is going to be adapted to allow the use of OCRed text as input. The OCR engine is going to be OCRopus.

Tutorial video and other videos

The documents of the previous workflows were assessed in terms of how well they might perform being OCRed.

Command overview

The following commands (taken from above video) allow the recognition of English text:

  1. ocropus-binarize 035.jpg
  2. ocropus-pseg book/????.png
  3. ocropus-lattices -m OCRopus/ocropy/2m2-reject.cmodel book/0001/??????.png
  4. ocropus-align -l OCRopus/ocropy/data/default.fst book/0001/??????.fst
  5. ocropus-hocr book/

First results

A page of Bacon 1670 (link) was OCRed with above commands. The output does not look too good (xhtml file attached):

116
tha| yet it cannot be for other - Kcafo+: - ~r f1r(l, it is found 9tit upon
ccrta;n Trccs; and thofe Trces bear no fuch Fruit, os may allure that |rd to
B|rrics, and fo is of(cn found thcrc| which may have givcn occaf1on tothc
tale. But that which maketh an cnd of th|quc(lion is, that M1(Tcltoe hath
w|nld go into a Bv@gh( ?nd bcf1dcs/it feemcrh to bc morcfat and un@uous,
th1n the ordinary Sap of thc T!5c| both py thc Bcrry which is clammy, and
by that it connnucth grccn winter and Summcr, which Ihc Trcc do(h
not.
tryal would bc m-c,b|riPPing vf thc Bough of aCrab~cc inthe Ba~and
likely, to try it with fomeothcr watering or anointing, . that w7c notfonatu-
things askill not me Bongh.
1t wcrc good to try, what T\&nM would put forth, @ thcy bc fo|biddcn
Io p@tfottb thcir natn|l Boughs: powl thcreforc a Tree; .andcover it,(omc
will Dut forth Roots 9 for fo will a Cions, bcir@ turncd down into clav.
fhat is .not fo f1atnral to thcplant as Clay isg tryit with Lcathcr,or Clotb, or
becn knodn t|gtow o\&t of apol9rd-
.A Man may count the prickcs / Trccs to bcakinde d Excrcfccnce,for
the/will ncvcr be@onghs| nor be Lcaves. @c plants that have p.ricklcs,
are Thorns, Black and whitc. s Bryer, Rofc, Ltmmon"trecsv Crab-trccq
Pricklcs in thc Lc2f .are, H9lly5 Juniper, vhin .bu(h, ~hi(bc i Ncttlcs alfo
have a fiall vencmous Fr~Ici fo hath Borrage, but harmlefs. Thc . @ufc
For the ha@c of the5pirit to put "Th, a@d the w|9t ot nourimmcty toput
forth a Bough, and thc clof~fs of thc Balks caufc Frickles .in Bonghsg 0nd
the Leav@s othcrw.!fcarc Wygh,as Burragc and Nettles are. As (or thc Leavcs
of Holl)|, they arefmooth, but nevcr Plain, bnr asit wcrc with folds for thc2
amc caufc.
56o. Therc be alfo Tlots, thavhough the? havc no ?~kles, yet they havea
kinde of Downey 9r Velvct Rine upop their Leafes; as Refe-C\<mpi0@ |te~
Cdrh2isns,,

Another example

This is Wilkins 1684:

Raw OCR:

wilkins/0001/010001.png : 48 mttIf uaa |J"|5rM1 
wilkins/0001/010006.png : D~@ii&9yb us th|q|~yf,PmmfM?Y5-T?!
wilkins/0001/010002.png : HoweVcr, the Ppnc eny wt 9b ft| 
wilkins/0001/010007.png : L1ve at an eaHer 1iHe, Dy !ccu9% U9 Mc |- 
wilkins/0001/010003.png : Lofs iii being depr|d .dt thR p|y|cK, . u?Ps 
wilkins/0001/010008.png : only, |nd receiVi9g if ~e|.y9urifh~Mj 
wilkins/0001/010004.png : at fome times we had the pr1~ggT to ne" 
wilkins/0001/010009.png : and fof . this Very Reai~ uaysM wapm9es
wilkins/0001/010005.png : it| Thenindeed Pbil# the Rw thin|s it w9Wd 
wilkins/0001/01000a.png : Enabled to tarrd9rty Day|an.ooqy Nigptg
wilkins/0001/01000b.png : in the Mount without eating any pMng, we- 
wilkins/0001/010010.png : /aefeb@e, Plat#, ~~ and qpFrs. ? -put De.. 
wilkins/0001/01000c.png : &ufe he there heard the .Mdody bt tne -Fh 
wilkins/0001/010011.png : cadfe it is not no~ Ithink, .Adirm.d ~ anyi 
wilkins/0001/01000d.png : veus.-Rifam teneat|e. -I Rn9w tlMs M|@ck 
wilkins/0001/010012.png : I fhall not therefore beflow either Pains or 
wilkins/0001/010013.png : Time in arguing againR it. 
wilkins/0001/01000e.png : Lath hadgreat P4trons, pom @acFd.ang PP- 
wilkins/0001/010014.png : It ma| (ildic| t@atoaVeonb Named mefe 
wilkins/0001/01000f.png : ma1F - Autb9rsJLo as o|Y~ @rde, PRF,
wilkins/0001/010015.png : Three l0(% a94 foE thT two 1ee eeiF~ 
wilkins/0001/01001a.png : as may make it po|f1ble tobe Inhabited, and 
wilkins/0001/010016.png : have refbrred the Keader to othcrs tor iaus- 
wilkins/0001/01001b.png : what thofeCualities are whereinit moren|ar- 
wilkins/0001/010017.png : fadioti I ihaR in the next p~ procee4 td 
wilkins/0001/01001c.png : ly A|rees ah our Ear& 
wilkins/0001/01001d.png : pROP n& 
wilkins/0001/010018.png : the Natd|e of theMo@ns Body,to knoP bhe- 
wilkins/0001/01001e.png : TA2t tIk |0|n | d wa CwnNde|l, Op~d 
wilkins/0001/01001f.png : Bod|
wilkins/0001/010019.png : ther that be capibm ofany fildh conditions2 
wilkins/0001/010020.png : Ir|%t:2|&ff2Xf22@ 
wilkins/0001/010021.png : agrccd 

With ocropus-align:

* wilkins/0001/010001.fst:  7.66  2.19: 48 mttlf M|aa |J"|5rul
* wilkins/0001/010006.fst:  8.57  2.57: D| f~ii&9yb us th|q|P?yf,Pm~M?Y5-T?!
* wilkins/0001/010002.fst:  6.41  1.79: Howe@r, the P9nc emy 0~ 9b ft|
* wilkins/0001/010007.fst:  5.07  1.36: L1ve at an eal1er 1i@e, Dy ~u9% U9 Mc |-
* wilkins/0001/010003.fst:  3.95  1.03: Lofs iii being depr|d .dt th1/due, . u?Ps
* wilkins/0001/010008.fst:  6.25  1.99: only, |nd receivi9g if ~e|.y9urifh~Mj
  wilkins/0001/010004.fst:  0.91  0.98: at fome times we had the pr1~o to ne"
* wilkins/0001/010009.fst:  5.53  1.53: .and fof . this Very Reai~ uays@ wap m9es
  wilkins/0001/010005.fst:  1.41  0.58: it| Thenindeed Pbil# the Rw thin|s it w9Wd
* wilkins/0001/01000a.fst:  4.08  1.21: Enabled to tarry r9rty Day|an.oor!y Nigp3
  wilkins/0001/01000b.fst:  0.24  0.58: in the Mount without eating any p@ng, we-
* wilkins/0001/010010.fst:  3.70  1.36: /aefeb@e, rat#, ~er~ and qp|rs. ? -put De..
* wilkins/0001/01000c.fst:  4.60  1.31: &ufe he there heard the .Mdody bt tne -Fh
* wilkins/0001/010011.fst:  5.64  1.62: cadfe it is not no~ Ithink, A|rm.d ~ anyi
  wilkins/0001/010012.fst:  0.51  0.37: I fhall not therefore beflow either Pains or
* wilkins/0001/01000d.fst:  2.54  0.85: veus.-Rifam teneat|e. -I Rn9w tl@s M|@ck
  wilkins/0001/010013.fst:  0.06  0.20: Time in arguing again@ it.
* wilkins/0001/01000e.fst:  3.90  1.17: Lath had great P4trons, pom @aqyd.a1q I3o-
* wilkins/0001/010014.fst:  3.59  1.29: It ma| (ildic| t@at~Veonb Named mefe
* wilkins/0001/01000f.fst:  7.04  2.23: ma1F - Autb9rs,fu| as d@|Y#~ @rde, !#R|,
* wilkins/0001/010015.fst:  5.18  1.68: Three l0(% a94 fo| th| two 1we eeiF~
  wilkins/0001/01001a.fst:  1.02  0.44: as may make it po|f1ble robe Inhabited, and
  wilkins/0001/010016.fst:  0.92  0.43: have refbrred the Keader to othcrs tor iaus-
* wilkins/0001/010017.fst:  7.05  1.73: fadioti I ma@ in the next p~ procee4 td
  wilkins/0001/01001b.fst:  0.75  0.48: what thofe Cualities are whereinit mor@n|ar-
  wilkins/0001/01001c.fst:  1.99  1.17: ly A|rees ah our Ear&
  wilkins/0001/010018.fst:  1.04  0.49: the Natd|e of the Mo@m Body,to kno| bhe-
* wilkins/0001/01001d.fst:  4.46  1.62: PROP n&
* wilkins/0001/010019.fst:  4.19  1.14: ther that be Capib| ofany fi1dh conditions2
* wilkins/0001/01001f.fst:  4.45  2.02: B#d|
* wilkins/0001/01001e.fst:  5.82  2.03: TA2t tlk 3f0|n | d 40|, C022||de|l, O~@|
  wilkins/0001/010020.fst:  1.74  0.85: Ir|%t:2|&ff2Xf22@
  wilkins/0001/010021.fst:  0.73  0.51: agrccd

Attachments (1)

Download all attachments as: .zip