--- texttool-architecture/soft-search.tex 2004/06/01 12:00:21 1.1 +++ texttool-architecture/soft-search.tex 2004/06/01 12:14:48 1.2 @@ -1,229 +1,18 @@ -\subsubsection{rec.cgi (register text)} -\label{sec:rec.cgi} +\subsubsection{q1 (corpus-wide search)} +\label{q1} \paragraph -On the ECHO server, the registration of new texts is implemented by -means of a cgi script, reg.cgi -(archimedes/web/cgi-bin/toc/admin/reg.cgi ). reg.cgi retrieves a -metadata file in MPIWG archive metadata format from the entered uri -(currently only local paths are supported ) and constructs from this -file a toc.cgi object file (see below) , which it writes to toc.cgi's -data section. [corpus???] It should be stressed that this is a -registration procedure developed for a particular implementation of -toc.cgi and not a part of the core application. +This section describes the software associated with the ECHO +lemmatized text search +\url{http://echo.mpiwg-berlin.mpg.de/ECHOVIEW/ECHO_view.css} -\paragraph -reg.cgi takes two parameters, path and show. Path should give the -local path to the metadata file for the text that is being -registered. If ``show'' is set to 1, reg.cgi will return for -inspection the toc.cgi object file that it has built out of the -submitted metadata file. - -\paragraph{input metadata file} - -The input metadata file must have the following form - -\paragraph -\begin{verbatim} - - ... - - - - -Mainzer Untergerichtsordnung (von 1534) -anon -1580 - yes - pageimgtif - /mpiwg/online/experimental/echo_DRQEdit_test/anon_Mainz_1580/fulltextDW/mainzugo02_utf8.xml - pb01-presentation/info.xml - - - -\end{verbatim} -\paragraph{archimedes object registration} - -\subsubsection{toc.cgi (display text)} -\label{sec:toc.cgi} - -\paragraph{plan of this section } - -\begin{enumerate} -\item An overview of toc.cgi architecture -\item A walk-through of typical cgi queries for toc.cgi -\item An index of cgi parameters and values with short descriptions of function -\end{enumerate} - -\paragraph{Overview of toc.cgi architecture} - -\subparagraph{} -toc.cgi is a perl script for displaying collections of xml texts and -linking them to related resources such as page-images, morphological -analysis, commentaries, dictionaries, etc. It implements generic methods -for resource-linking provided by a series of perl modules which are in -turn based mainly on generic open-source tools for xml manipulation and networking -written in C. - -\subparagraph{toc.cgi collections--Network transparency} -Each of the collections in toc.cgi is a ``virtual'' collection, that -is, a collection of links or uri's to resources that reside somewhere on an accessible -network, local or remote. - -\subparagraph{toc.cgi collections--remote resources} - -What is at the other end of the link is of no concern to toc.cgi, as -long as the resource referenced by the link meets minimal toc.cgi -requirements--how the resource is actually implemented and exposed is -a matter for the resource provider. The link may, for instance, point -directly to an xml text or it may point to a container which exposes a -particular xml view of an underlying resource that is perhaps not in -xml format at all. - - -\subparagraph{resource registry} - - - - -\paragraph{cgi parameters -- standard queries} - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=corpus }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=corpus } -\newline -\newline -get a listing of corpora - - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=xmlcorpusmanifest }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=xmlcorpusmanifest } -\newline -\newline -get an xml listing of corpora - - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi } -\newline -\newline -get a listing of works in default corpus - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?corpus=1 }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?corpus=1 } -\newline -\newline -get a listing of works in corpus 1 [default corpus = 0] - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=xmlcorpuslist }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=xmlcorpuslist } -\newline -\newline -get an xml listing of works in default corpus - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=xmlcorpuslist;corpus=1 }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?step=xmlcorpuslist;corpus=1 } -\newline -\newline -get an xml listing of works in corpus 1 - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?dir=baifl_renav_006_la_1537;step=thumb }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?dir=baifl_renav_006_la_1537;step=thumb } -\newline -\newline -get a work from default corpus with thumbnail navbar displayed left - - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?dir=jorda_ponde_050_la_1533;step=thumb;ftype=thumbright }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?dir=jorda_ponde_050_la_1533;step=thumb;ftype=thumbright } -\newline -\newline -get a work from default corpus with thumbnail navbar displayed right - -\htmladdnormallink{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?dir=jorda_ponde_050_la_1533;step=textonly;corpus=;page=22 }{ http://archimedes.mpiwg-berlin.mpg.de/cgi-bin/toc/toc.cgi?dir=jorda_ponde_050_la_1533;step=textonly;corpus=;page=22 } -\newline -\newline -get a page of text from a work from default corpus - - - - -\subsubsection{Indexing} -\label{sec:indexing} - -\paragraph{Status quo ECHO} -Currently indexing is not implemented on the ECHO server. - -\paragraph{Plan ECHO} - -\begin{enumerate} -\item construct remote (141.14.236.86) index for each file at - per-change or daily intervals -\item store indices locally in -archimedes/data/db/PROJECT_NAME/CORPUS_NAME/WORK -\item 2 progs on server 1. cgi: indexer 2. backend da_remote -\item 2 progs on client 1. cgi: sendindex 2. backend getindex -\item indexing transaction handled by two cgi scripts, one on the - server the other on the client [this is the 1st implementation bcs - its easiest and there are no port issues, but probably it'd be - better to have a separate port]. -\item client cgi: getindex -- sends 1. list of files to index - 2. uri to which xml notification of completion is to be sent. Upon - notification, activates backend prog that fetches and installs the - indices. -\item server cgi: indexer receives filelist and notification - addess. Activates backend that fetches files, indexes, places - completed indexes in a networked location, then sends xml - notification back to client. -\item single script provides backend access to indices -\item leave front-end issues like display, collection and navigation - to web-design programmers. Do only a sample for now. -\end{enumerate} - -\subsubsection{Morphology} -\label{sec:morphology} - - -\subsubsection{Dictionary server} -\label{sec:dictionary-server} - - -\subsubsection{helper programs} - -\paragraph{addarch.pl ARCHIMEDES} - -Automatically registers new texts as toc.cgi objects when they appear in -cvs. Automatically updates relevant morphological indices (slow!) each -time a cvs update occurs. This program is called by a hook in the cvs -``loginfo'' configuration file. - - -\paragraph{makelemma.pl ARCHIMEDES} - -Updates lemmatization indices. -Parameters: -No parameter--update all lemmatization indices -[latin | ital | greek | en | nl | de]-- update this language - -\paragraph{makefast.pl ARCHIMEDES} - -Updates the toc.cgi morphology indices -Parameters -No parameter--update all lemmatization indices -[latin | ital | greek | en | nl | de]-- update this language - -\subsubsection{summary of differences btwn the archimedes toc.cgi - implementation and the echo toc.cgi impelementation (toc.x.cgi)} - -\paragraph{missing in archimedes} \begin{enumerate} - -\item html templates (coded but phased out of cvs branch) -\end{enumerate} - -\paragraph{missing in echo} -\begin{enumerate} - -\item word-coloring? -\item remote text method may work differently -\end{enumerate} - -\paragraph{differences} -\begin{enumerate} -\item structure of info.xml -\item resource-discovery algorithm for info.xml +\item xml-rpc interface to 141.14.236.86 and implementations (archimedes/bin/getindex + and archimedes/bin/make_indices) +\item search module archimedes/code/IncPerl/Archim/Toc/Search.pm and + implementation ( archimedes/web/cgi-bin/search/q1 ) \end{enumerate} +\paragraph