Version 6 (modified by 13 years ago) (diff) | ,
---|
The workflow is going to be adapted to allow the use of OCRed text as input. The OCR engine is going to be OCRopus.
Tutorial video and other videos
The documents of the previous workflows were assessed in terms of how well they might perform being OCRed.
- Easy
- Bernstein, 1897 (Fraktur)
- Berzelius 1819
- Ampère 1844
- Medium
- Vitruvius 1511
- Cataneo 1600
- Aristoteles 1547 (italics)
- Archimedes 1565 (many pictures (problematic?))
- Cataneo 1572 (bad printing)
- Viviani 1659
- Bianconi 1746
- Zanotti 1752
- Bion 1765 (blackletter)
- Vitruvius 1800 (clear, but weak printing, can see printing of the other side)
- Gallaccini 1767 (small font, but quite clear)
- Angeli 1668 (contains italics, otherwise very clear)
- Trigault 1639
- Bernoulli 1738
- Hard
- Vitruvius 1544
- Vitruvius 1757 (Mixed blackletter and antiqua)
- Zonca 1656 (bad printing)
- Bacon 1670 (contains italics, bad printing)
- Clavius 1606
- Barrow 1674 (bad printing, italics and Greek)
- Gravesande 1721
- Vitruvius 1618 (Thesaurus has columns)
- Mersenne 1635 (Microfilm)
- Aristoteles 1548 (contains Greek)
- Vitruvius 1556 (extremly small font)
- Aristoteles 1585 (bad printing)
- Specklin 1599 (bad printing, blackletter)
- Biancani 1635 (very small font)
- Vitruvius 1567 (very small font, mixed italics and upright)
- Archimedes 1565 (mixed italics and upright)
Command overview
The following commands (taken from above video) allow the recognition of English text:
ocropus-binarize 035.jpg
ocropus-pseg book/????.png
ocropus-lattices -m OCRopus/ocropy/2m2-reject.cmodel book/0001/??????.png
ocropus-align -l OCRopus/ocropy/data/default.fst book/0001/??????.fst
ocropus-hocr book/
First results
A page of Bacon 1670 (link) was OCRed with above commands. The output does not look too good (xhtml file attached):
116 tha| yet it cannot be for other - Kcafo+: - ~r f1r(l, it is found 9tit upon ccrta;n Trccs; and thofe Trces bear no fuch Fruit, os may allure that |rd to B|rrics, and fo is of(cn found thcrc| which may have givcn occaf1on tothc tale. But that which maketh an cnd of th|quc(lion is, that M1(Tcltoe hath w|nld go into a Bv@gh( ?nd bcf1dcs/it feemcrh to bc morcfat and un@uous, th1n the ordinary Sap of thc T!5c| both py thc Bcrry which is clammy, and by that it connnucth grccn winter and Summcr, which Ihc Trcc do(h not. tryal would bc m-c,b|riPPing vf thc Bough of aCrab~cc inthe Ba~and likely, to try it with fomeothcr watering or anointing, . that w7c notfonatu- things askill not me Bongh. 1t wcrc good to try, what T\&nM would put forth, @ thcy bc fo|biddcn Io p@tfottb thcir natn|l Boughs: powl thcreforc a Tree; .andcover it,(omc will Dut forth Roots 9 for fo will a Cions, bcir@ turncd down into clav. fhat is .not fo f1atnral to thcplant as Clay isg tryit with Lcathcr,or Clotb, or becn knodn t|gtow o\&t of apol9rd- .A Man may count the prickcs / Trccs to bcakinde d Excrcfccnce,for the/will ncvcr be@onghs| nor be Lcaves. @c plants that have p.ricklcs, are Thorns, Black and whitc. s Bryer, Rofc, Ltmmon"trecsv Crab-trccq Pricklcs in thc Lc2f .are, H9lly5 Juniper, vhin .bu(h, ~hi(bc i Ncttlcs alfo have a fiall vencmous Fr~Ici fo hath Borrage, but harmlefs. Thc . @ufc For the ha@c of the5pirit to put "Th, a@d the w|9t ot nourimmcty toput forth a Bough, and thc clof~fs of thc Balks caufc Frickles .in Bonghsg 0nd the Leav@s othcrw.!fcarc Wygh,as Burragc and Nettles are. As (or thc Leavcs of Holl)|, they arefmooth, but nevcr Plain, bnr asit wcrc with folds for thc2 amc caufc. 56o. Therc be alfo Tlots, thavhough the? havc no ?~kles, yet they havea kinde of Downey 9r Velvct Rine upop their Leafes; as Refe-C\<mpi0@ |te~ Cdrh2isns,,
Attachments (1)
-
result.html (4.1 KB) - added by 13 years ago.
First OCR of Bacon, standard settings
Download all attachments as: .zip