#-*- coding:utf-8; mode: org; mode: visual-line; fill-column: 10000 -*- #+TITLE: Documentation of the XML Workflow Tools webservice #+AUTHOR: Klaus Thoden #+EMAIL: kthoden@mpiwg-berlin.mpg.de #+DATE: 2012-02-09 Thu #+DESCRIPTION: #+KEYWORDS: #+LANGUAGE: en #+OPTIONS: H:3 num:t toc:t \n:t @:t ::t |:t ^:t -:t f:t *:t <:t #+OPTIONS: TeX:t LaTeX:t skip:nil d:nil todo:t pri:nil tags:not-in-toc #+OPTIONS: author:nil creator:nil timestamp:nil #+INFOJS_OPT: view:nil toc:nil ltoc:t mouse:underline buttons:0 path:http://orgmode.org/org-info.js #+EXPORT_SELECT_TAGS: export #+EXPORT_EXCLUDE_TAGS: noexport #+LINK_UP: #+LINK_HOME: #+XSLT: * How to use this file :noexport: This file is edited in Emacs org-mode! This file should be regarded as the place for the whole documentation of the project. By the by, all the text should be translated into English. Exporting is made pretty easy, using this tool. For example, as regards [[id:17B01441-4FD8-42CE-BF39-453096B19EE1][Hilfetexte]], each section has a property set which publishes it into the right html file. Thus, an update of the documentation of the scripts is very easy: just export it to html. To make the exported text look better, some local variables are altered: lines are wrapped at the right side of the screen, and the auto-filling of paragraphs is triggered at column 10.000. This should avoid ugly line breaks, e.g. when exporting to HTML. Also, do not use blank lines between paragraphs (as we all know and love it from LaTeX). ** LaTeX Export To make the export look nice, this source code block needs to be evaluated (move point into the block and type C-c C-c (org-babel-execute-src-block)) and the line ~#+LaTeX_CLASS: per-file-class~ has to be set in the preamble of this document. Right now, the export with this class is broken, maybe because of the source code examples in this text. You can read more about the more sophisticated LaTeX export at the [[http://orgmode.org/worg/org-tutorials/org-latex-export.html][worg]] site. #+name: setup #+begin_src emacs-lisp :results silent :exports none (unless (find "per-file-class" org-export-latex-classes :key 'car :test 'equal) (add-to-list 'org-export-latex-classes '("per-file-class" "\\documentclass{scrartcl} [NO-DEFAULT-PACKAGES] [EXTRA]" ("\\section{%s}" . "\\section*{%s}") ("\\subsection{%s}" . "\\subsection*{%s}") ("\\subsubsection{%s}" . "\\subsubsection*{%s}") ("\\paragraph{%s}" . "\\paragraph*{%s}") ("\\subparagraph{%s}" . "\\subparagraph*{%s}")))) #+end_src * About the tool This is a Java web application for running text conversion and text manipulation scripts. It is expandable by writing new scripts in either Python, XSL, Perl or directly in Java. * Installation ** Software requirements *** For running - A binary distribution of Tomcat 6 or 7 (tested with 6.0.33 and 7.0.26) - On ocropus.rz-berlin.mpg.de: #+begin_src bash Creating config file /etc/default/tomcat7 with new version Adding system user `tomcat7' (UID 107) ... Adding new user `tomcat7' (UID 107) with group `tomcat7' ... Not creating home directory `/usr/share/tomcat7'. Starting Tomcat servlet engine tomcat7 ...done. Setting up authbind (1.2.0build3) ... #+end_src - Perl (tested with 5.8.9 and 5.10.1) - Additional modules: libxml2 binding for perl (Ubuntu package libxml-libxml-perl) - Python (tested with 2.5.1 and 2.6.5) - Additional modules: simplejson - libxml2-utils (for xmllint) *** For building - Subversion (tested with 1.6.5) - Ant (1.8.2) - Eclipse (Galileo or Helios) with plugin Eclipse Web Developer Tools - Java (tested with Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_30-b03-389-9M3425), OpenJDK Runtime Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.04.3) and Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-9M3326)) ** Getting the source code You can check out the source code from our Subversion directory: #+begin_src sh svn co https://it-dev.mpiwg-berlin.mpg.de/svn/digitizing-tools digitizing-tools #+end_src Import the project into Eclipse for comfortable editing. ** Configurations *** System You need to set an Environment variable: ~ECHO_SCRIPTS_DIR~ which should point to the directory ~digitizing-tools/scripts~. On the production system, it can be set in a file called ~setenv.sh~ (create that, if not there), located in ~~ (which might be ~/usr/share/tomcat7/bin~) with the line ~export ECHO_SCRIPTS_DIR="/usr/share/tomcat7-resources/digitizing-tools/resources/scripts"~ adjust the path, if different. *** Tomcat server NB: on ocropus.rz-berlin.mpg.de, APACHE_DIR is ~/etc/tomcat7~, which is symlinked to also ~/var/lib/tomcat7/conf/~. 1. Edit the file ~/conf/tomcat-users.xml~ and specify a user, for example #+begin_src xml #+end_src 2. Create a file with the name ~digitizing-tools#resources.xml~ in ~/conf/Catalina/localhost~ to set the location for the output files and make the css file work. The file should contain the following the following line #+begin_src xml #+end_src 3. Some scripts are executed by jython. For this, jython has to be installed, but there also has to be a directory (presently located in ~/usr/share/tomcat7~) called ~.jython-cache~, which is readable and writable by the user ~tomcat7~ *** Eclipse 1. In ~build.properties~ (in the root the project) the variable ~catalina.home~ should be assigned with the path of the binary tomcat distribution. 2. Edit the file ~conf/digitizing-tools.properties~. Some scenarios are collected in ~conf/digitizing-tools.template.properties~. You need to adjust the paths to your needs. The setting of the ~url~ option is only needed for uploading files into the sandbox. *** Starting the server 1. In root folder of the project, where ~build.xml~ is located, execute the command: #+begin_src sh ant tomcat-deploy-debug #+end_src * Adding new workflow scripts 1. Create a java class in ~src~ directory in package ~de.mpiwg.itgroup.digitools.web.session.scripts~ giving it a meaningful name. Additionally, the class must extend the class ~de.mpiwg.itgroup.digitools.web.session.AbstractScript~. Some parameters should be given: - =public static String id = "xxx";= :: id for script - =this.setSubfolder("adjustHarriot");= :: directory in ~digitizing-tools/scripts/~ - =this.setNextScriptIdentifier= :: Identifier for the next logical step in the workflow - =this.setProgrammingLng(PYTHON);= :: optional, setting the script language (default not set, i. e. ~PERL~) - =this.setShowStringInput(true);= :: optional, if true, shows an input box for a string (default not set, i. e. false) - =this.setOutputTextFileExtension(XML);= :: optional, output file format (default not set, i. e. ~TXT~) - =this.setFileForNextScript(INPUT_TEXT);= :: optional, if set, the input text will be submitted to the following script rather than the output of the current script Example: ~test_validity~ only checks if the text is valid and the output of the script has only informational character. - =this.setLabel("xxx")= :: text, that is being displayed - in the menu on the home page - in the menu in the "Workflow" window - in the heading on the script page ("Script: ") - in the name of the output text file 2. Insert the class in the ~de.mpiwg.itgroup.digitools.web.session.SessionBean~. An object of class should be created in the ~SessionBean~ as private attribute. Additionally, a public method called ~getScript_XX~ must be inserted, which returns the new object. In other wards: make another line where the other scripts are put into a ~scriptMap~ and define a ~public String~ in the middle of the file. 3. Create a directory in ~scripts~ and put in the script. 4. Create a help file in the directory of the script or (the preferred way) create a section in the file ~Dokumentation.org~ in the [[id:188EA146-B205-4A8B-A28D-F819A3100C92][scripts]] section, stating the ~EXPORT_FILE_NAME~ in the properties of that section and write the help text there. 5. Make an entry in the ~WebContent/template/template.jsp~ to make the script appear in the menu bar. The ~value~ attribute may contain any text, but should not be too long. 6. Restart the server and test the script. * Updating the production system In order to update the production system, the project is built locally and then deployed on the production system. ** Local build 1. Update the local repository, and run the system on your local machine to see whether any errors occur. 2. If not, edit ~conf/digitizing-tools.properties~ so that they reflect the settings on the production system (i. e. paths and URL). Check ~conf/digitizing-tools.template.properties~ for an example. 3. By running #+begin_src sh ant tomcat-deploy-debug #+end_src you create the ~war~ file which will be deployed lateron. It is situated in ~/webapps/~. 4. Change the settings back in ~digitizing-tools.properties~. ** On the production system 1. Delete the ~resources~ directory (as stated in the ~digitizing-tools.properties~) and copy your local version in its place (if you are bold, you might also use ~rsync~). Also, copy the ~scripts~ directory into the remote version of the ~resources~. 2. Navigate your browser to the remote system's tomcat installation, suffixing it with ~/manager/html~ 3. Click the ~Undeploy~ button in the ~/digitizing-tools~ line. 4. Further down, choose ~Select WAR file to upload~ and upload the freshly generated ~war~ file, situated in ~/webapps/~. 5. Click ~Deploy~ and check the new version. 6. If pages are not shown correctly, copy your local version of ~/conf/Catalina/localhost/digitizing-tools#resources.xml~ to the production server's instance of tomcat (might be situtated at ~/etc/tomcat7/Catalina/localhost~ or ~/usr/local/tomcat/apache.../conf/Catalina/localhost~) 7. Check if the diff2html is displayed correctly. At least one time the executable flag was not set. * How to write scripts The scripts in the toolbox are mainly scripts that take a text file and alter it by searching and replacing things in it. Currently, these scripts can be in either Perl or Python, although other languages might be employed as well. Right now, some of the scripts in above languages are actually wrappers for executing XSLT processors in Java. However, as the webservice is designed to output status information which the scripts would write to ~stdout~ and ~stderr~, libraries have been written that direct these screens to the respective outlets. These libraries exist in both Perl and Python. ** How the scripts are executed - Command line options The scripts are executed with a bunch of options that direct the output streams but also specify the input and output files. There are six arguments available: - ~--console~ :: This specifies the path to the file which will show up in the "Console" tab of the output field. - ~--outputTextFile~ :: This is where the output of the script is written to. - ~--contextOutput~ :: Some scripts do not alter the input file but create a new file which, after it has been edited by the user, serves as an additional input to some other script. - ~--contextInput~ :: If a context file is needed, this command will specify the path to the context file. - ~stringInput~ :: If the variable ~setShowStringInput~ is set to true in the Java class, an additional option can be given as an input. This is the second last argument on the command line. - ~inputFile~ :: The last argument on the command line is the input text. ** Addressing the options in the script Due to performance issues, all the output files and the messages are written to files, rather than directly to the output windows on the webpage. Thus, statements like Perl's ~warn~ or ~die~ and Python's ~print~ end up nowhere. Instead, the following functions have to be used to put the files and messages in their places. Firstly, the library has to be loaded. In Perl: #+begin_src perl require $ENV{'ECHO_SCRIPTS_DIR'} . "/share/common_functions.pl"; #+end_src In Python: #+begin_src python sys.path.append(os.environ["ECHO_SCRIPTS_DIR"]+"/share") import common_functions #+end_src The common functions are: - ~printInOutputTextFile(line)~ :: This function writes text into the output text file. Normally, this is the contents of the input file with changes made to it by the current script. - ~printlnInConsole(message)~ :: This function displays messages in the "Console" part of the output field. - ~printlnInContextInput(message)~ :: This function is not used right now and maybe superfluous. - ~getInputContext()~ :: Scripts that need a context file as input read it with this function. - ~printlnInContextOutput(message)~ :: If a context file needs to be written, the contents should be written with this function. - ~printlnInERROR(message)~ :: This replaces the pipe to ~stderr~. Messages directed there will show up in the "Error" part of the output field. - ~escapeTags~ :: As the output boxes "Console" and "Error" are capable of displaying html, it might be helpful to escape the html entities. - ~consoleDebug~ :: Displays messages for debugging purposes in the output field. - ~consoleWarn~ :: Displays warning messages in the output field. - ~consoleStatus~ :: Displays status messages in the output field. * Help texts ** share/help *** Figure Generator :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/getFiguresHelp.html :END: The figure generator produces a folder filled with figures which will be displayed whilst viewing the accompanying text in the ECHO viewer. The whole system is meant to facilitate the otherwise cumbersome process of cutting out the images and putting them into a folder on the server. Rather, digilib itself will be used to determine the position of the images. The input for this script is a text file filled with addresses to the figures. The script will download these images, put them into a zipped folder which then has to be put on the server. A detailed description of using digilib for these purposes, please refer to [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/Cutting%2520out%2520images][the wiki]]. *** XML Template Generator :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/bareEchoXMLHelp.html :END: The template generator returns an XML file that consists of a metadata block and page break tags. This information is taken from the online resources within the institute. To generate a file you need to enter the ID of the online resource and the path name which can be either ~permanent~ or ~experimental~. The default path is ~permanent/library~. How to find out the identifier: you can search for the resource via [[http://141.14.236.36/fmi/xsl/katalog/findrecords.xsl?-db%3Dkatalog&-grammar%3Dfmresultset&-view%3D][the library catalogue]]. The link in the results to "Scans (high quality)" contains the path, e. g. ~http://echo.mpiwg-berlin.mpg.de/ECHOdocuView?mode=imagepath&url=/mpiwg/online/permanent/library/DSHR9GHY/pageimg~, where the path name would be ~permanent/library~ and the ECHO ID is ~DSHR9GHY~. *** Home :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/homeHelp.html :END: This is the manual for the online implementation of the workflow for converting text documents that were typed according to the [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/despecs][DESpecs]] to an XML document which validates according to the [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/echo-schema][ECHO Schema]]. This page is the starting point to all the services. Navigate to the desired point by using the menu on top. The services offered are: - Scripts for converting raw text - or parts of it - into the ECHO XML format. - An interface to an OCR service (not functional yet) - An XML template generator from a set of images. - An upload interface to make the texts viewable in the ECHO display environment. *** Input File :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/inputFileHelp.html :END: Please choose the input text file for this conversion step. If you see a file name next to the "Upload" button, there is already a file chosen, most probably, because you use the output file of another script as input. To change this, press the "Upload" button and choose another file. After choosing a file, press "Execute" to run the script. *** Main :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/mainHelp.html :END: This is the manual for the online implementation of the workflow for converting text documents that were typed according to the [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/despecs][DESpecs]] to an XML document which validates according to the [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/echo-schema][ECHO Schema]]. *** Output :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/outputHelp.html :END: The output help shows you some information about the performance of the script and gives you the opportunities to examine the result of the script. - Execution state :: Shows either "running" or "finished". - Executed command :: The text shown is what was actually executed on the server. This is mainly for debugging purposes. - Execution time :: Displays how long the script was running. - Console/ error window :: This window shows messages from the script. If there was an error in the execution, it will be displayed here. - Output text file :: the converted text file is offered here for viewing and downloading. Please note that, if the output is an XML file, it might not be wellformed so that the browser cannot display it but rather shows an error. In that case, download the file and open it in an text editor. - Configuration file (optional) :: Some scripts produce a configuration file which you need to download and edit and upload again elsewhere. *** Parameter List :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/parameterListHelp.html :END: Here, different parameters can be chosen. *** Workflow :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/share/help/workflowHelp.html :END: This window lets you continue with the workflow with three different options. - Next script in the workflow :: The output file will serve as the input file of the next logical step in the workflow. - Go to other script :: Take the output file and feed it into a different script. - Upload to Sandbox :: Go to the upload interface to check the file in. ** scripts :PROPERTIES: :ID: 188EA146-B205-4A8B-A28D-F819A3100C92 :END: *** Harriot :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/adjustHarriot/mainHelp_adjustHarriot.html :END: The Harriot online project is using a greater set of XML elements than the ones that are supported by the stable release of the ECHO Schema. In order to display the text anyway, certain elements have to be removed. *** Make Wiki pages :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/makeWiki/mainHelp_makeWiki.html :END: This script writes wiki pages for Work Orders. As input, it expects a comma-separated (no spaces in between) list of identifiers which will then be turned into syntax for the wiki. Just copy and paste! *** Convert ECHO to TEI :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/echo2tei/mainHelp_echo2tei.html :END: This script performs an XSL transformation from an ECHO XML document to TEI. Obviously, only a small subset of TEI tags is supported. *** Convert ARB manuscripts to ECHO :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/arb2echo/mainHelp_arb2echo.html :END: This script performs an XSL transformation from an database dump of the ColorConText to an ECHO XML document. *** Find pagebreaks :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/find_pb/mainHelp_find_pb.html :END: This script finds the pagebreaks in the document and, by creating links to the pageimages on the server, creates a preview of which pagebreak in the text will be linked to what page. If there are errors, the source text document has to be changed and uploaded again. You will need to know the path to the images on the institute's servers. A way to find that out is to search for the resource in [[http://141.14.236.36/fmi/xsl/katalog/findrecords.xsl?-db%3Dkatalog&-grammar%3Dfmresultset&-view%3D][the library catalogue]]. The link in the results to "Scans (high quality)" contains the path, e. g. ~http://echo.mpiwg-berlin.mpg.de/ECHOdocuView?mode=imagepath&url=/mpiwg/online/permanent/library/DSHR9GHY/pageimg~, where the string input would be ~permanent/library/DSHR9GHY~ Please do also check the generated configuration file which contains the filenames of the page images. Sometimes, there is a file for a testing page which needs to be removed. The second step writes the file names into the text document. *** Synchronize pagebreaks :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_2_02_sync_pb/mainHelp_2_02.html :END: After having edited the source text document so that there are as many pagebreaks in the text as there are page images on the server, the links to these pages are created. *** Find forbidden characters :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_2_03_find_forbidden_characters/mainHelp_2_03.html :END: This script finds characters that should not occur in the raw text, because it might confuse the conversion scripts or it is plainly improbable that a character appears in a book of a certain period (an IPA character cannot be expected in a 17th century book). The characters are collected in the "configuration" file. For further replacements, just add them to file with the following syntax: #+begin_src sh = #+end_src Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a2.03ersetzeverboteneZeichenimText][in the wiki]]. *** Detect unknown characters :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_2_04_check_unknown_characters/mainHelp_2_04.html :END: The digitized text might contain characters that are hard to type. Therefore, the data entry firm is asked to encode characters that are impossible to type with a number and provide a table with a screenshot of that character and the corresponding code on delivery. This table is evaluated in this step to suggest replacements for the unknown characters in the text. However, it has to be checked manually, if the suggestions are right. The second step in this series executes these changes. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a2.04pr%C3%BCfeunknowncharacters][in the wiki]]. *** Check escape sequences :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_2_05_check_escape_sequences/mainHelp_2_05.html :END: This script checks for characters like ~{~ and ~\~ and if they are allowed to occur at this point in the text. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a2.05pr%C3%BCfeescapesequences][in the wiki]]. *** Check tags :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_2_07_check_tags/mainHelp_2_07.html :END: This script performs checks concerning the correctness of tags in the text. Errors are shown in the Error output. All of the errors should be taken care of, as the following scripts in the workflow rely on the correctness of the tags. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a2.07pr%C3%BCfetags][in the wiki]] *** Preparation :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_00_Preparation/mainHelp_3_00.html :END: Some parts of the DESpecs have not yet been implemented in scripts. Therefore, they need to be taken out of the text. This is done here. The tags concerned are: - ~~ - ~~ - ~~ - stray underscores - ~~ We apologize for any inconvenience. *** Replace unknown characters :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_01_replace_unknown_characters/mainHelp_3_01.html :END: This script resolves the list of unknown characters. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a3.01ersetzeunknowncharacters][in the wiki]]. *** Replace replacements :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_02_replace_replacements/mainHelp_3_02.html :END: This script resolves the list of replacements. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a3.02ersetzereplacements][in the wiki]]. *** Replace escape sequences :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_03_replace_escape_sequences/mainHelp_3_03.html :END: This script replaces escape sequences. For example, the data entry types an ~æ~ ligature as ~{ae}~. Now, it is converted to the actual Unicode character. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a3.03ersetzeescapesequences][in the wiki]]. *** Replace underscores :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_04_replace_underscores/mainHelp_3_04.html :END: This script replaces the shorthand for italics, ~_~ into XML tags. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a3.04ersetzeitalics][in the wiki]]. *** Add basic XML and metadata :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_05_add_basic_xml/mainHelp_3_05.html :END: This script adds a basic XML structure to the raw text. For this purpose, it also needs metadata which the user has to enter. If there are already scans available in the ECHO environment, the path can be entered here. The script will then try to get the metadata from there. Please note that you have to specify the full path, as in ~/permanent/library/UR271U6Y~ or ~experimental/klaus/ocr-1~. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a3.05Metadatenrootelement][in the wiki]]. *** Make tags wellformed :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_06_make_tags_wellformed/mainHelp_3_06.html :END: The last script in this part of the workflow creates some more well-formed tags and replaces some characters with the respective XML entities. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a3.06wohlgeformtesxml][in the wiki]]. *** Make wellformed :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_make_wellformed/mainHelp_3_make_wellformed.html :END: This is a metascript which performs all the above actions in one go. It is not implemented yet. *** Test wellformedness :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_3_test_wellformedness/mainHelp_3_test_wellformedness.html :END: This script performs a test to see if the XML is wellformed. If so, the next phase of the workflow can be started. *** Convert page breaks :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_01_pb/mainHelp_4_01.html :END: This script inserts proper ECHO-conform pb-tags. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a4.01pb][in the wiki]]. *** Move floats :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_02_move_floats/mainHelp_4_02.html :END: This script takes some elements out of their original place, puts an ~~ element in and groups the elements in question in a ~
~ of the type "float" after the paragraph. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a4.02floatsherausziehen][in the wiki]]. *** Insert line breaks :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_03_insert_lb/mainHelp_4_03.html :END: This script converts linebreaks into XML tags, thereby lessening the number of lines. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a4.03lb][in the wiki]]. *** Insert semantic units :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_04_insert_s/mainHelp_4_04.html :END: This script tags the semantic units of the text, employing a simple heuristic. Before running the script, escape characters can be inserted to exclude selected interpunction characters from being converted. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a4.04s][in the wiki]]. *** Convert emph :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_05_emph/mainHelp_4_05.html :END: This script converts the original tags of marking emphasis into ~~ tags with a "style"-attribute. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a4.05emph][in the wiki]]. *** Convert tables :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_06_tables/mainHelp_4_06.html :END: This script is not implemented yet. Sorry. *** Insert div structure :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_07_insert_div/mainHelp_4_07.html :END: This script inserts a basic div structure. Each ~~ element marks the beginning of a new ~
~. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a4.07div][in the wiki]]. *** Make valid :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_make_valid/mainHelp_4_make_valid.html :END: This is not implemented yet. *** Test validity :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_4_test_validity/mainHelp_4_test_validity.html :END: This script tests if the XML is valid according to the ECHO XML Schema. If not, error messages are displayed. Please correct them and try to validate the text again. *** Insert regularizations :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_5_01_insert_reg/mainHelp_5_01.html :END: This script regularises the text. Some parameters can be chosen from by choosing a text file from the list below. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a5.01reg][in the wiki]]. *** Number divs :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/Filter_5_06_number_divs/mainHelp_5_06.html :END: This script numbers the div attributes ~level~ and ~n~. Read more about this script [[https://it-dev.mpiwg-berlin.mpg.de/tracs/mpdl-project-content/wiki/workflow#a6.01div-Attribute][in the wiki]]. *** Convert shorthand to MathML :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/mathml/mainHelp_mathml.html :END: In the Harriot online project, a simple syntax for expressing math was used following [[http://pythia.mpiwg-berlin.mpg.de/department1/harriotOnline/documents/conventions.txt][these conventions]]. This script converts the formulas to valid MathML. The script is a slightly adapted version of the perl port of [[http://pillars.che.pitt.edu/LaTeXMathML/][Douglas Woodall's LaTeXMathML]]. *** Replace page images :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/replacePageImg/mainHelp_replacePageImg.html :END: This script assists in replacing page images. The input is an XML file and the online path to the new pageimages. If there is an ~echodir~-entry in the metadata of the XML file, you can type in "echodir" (without quotes), so the script will use that path. *** Assign IDs :PROPERTIES: :EXPORT_FILE_NAME: ../scripts/assignIDs/mainHelp_assignIDs.html :END: This script assigns xml:id attributes to a number of elements. * To be written :noexport: Some extra stuff here. ** How to install the OCR bit Following the slides from Uwe Springmann (https://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/lehre/sose-14/ko_5220061) - sudo apt-get install tesseract-ocr* (installs 3.02.01 on Ubuntu precise) - installed ocropus from the source (hg pull…) - installed everything mentioned in PACKAGES apart from firefox - download models - install xvfb as fake X server - xvfb-run ocropus-nlbin benedetti_1585_0007.tif -o bene - ocropus-gpageseg bene/*.bin.png - Results do not look good. More training needed? Do that later.