\documentclass[a4paper]{article} \usepackage[latin1]{inputenc} \usepackage[T1]{fontenc} \usepackage{ae} \usepackage{url} %\usepackage{hyperref} %% for latex2rtf :-( % remember to replace "%" in URL! %\newcommand{\url}[1]{\verb!#1!} %\renewenvironment{footnotesize}{}{} \newcommand{\digilib}{\texttt{digilib}} \title{Draft: World Wide digilib -- Resource Identifier in ECHO} \author{Robert Casties\thanks{IT-Group, Max Planck Institute for the history of science}} \date{Version~0.6 of \today} \begin{document} \maketitle \tableofcontents \section{Digital Resource Identifier DRI} The \emph{Digital Resource Identifier} is a worldwide unique identifier for a digital resource. The resource may be an electronic text, single or multiple digital images, an audiovisual media file or other type of electronic resource that is accessible over the Internet. The identifier provides a stable point of reference for digital resources in the Internet. The identifier is therefore independent from the address, implementation and directory layout of the location of the resource. The identifier is unique and constant and it can be used in other documents to reference the resource without the risk of having a broken reference in the future because the address or filename of the resource has changed. The identifier supports infrastructure for the ``sustainability'' of digital resources to guarantee that not only the identifier always points to the same resource but also the resource stays available in the Internet. The infrastructure supports backup copies and load balancing mechanisms. The implementation and enduring support of the actual servers and digital resources is in itself mostly an organisational and social challenge that cannot be solved by technological measures alone. \subsection{Structure of the DRI} \label{sec:structure-dri} The \emph{Digital Resource Identifier} has the following properties: \begin{itemize} \item Total address space of 70 bit, partitioned into a million subspaces of 50 bit for $10^{15}$ or 1125 billion different resources per subspace. \item The identifier contains only (uppercase) letters and digits. \item The identifier is composed of a 4 character \emph{subspace} or \emph{namespace identifier}, a 10 character \emph{resource identifier} and a 1 character checksum, giving a total of 15 characters for the full DRI. \end{itemize} \subsection{Character set} \label{sec:charset} The identifier is composed only of letters and digits. Uppercase and lowercase letters are not distinguished. The resulting character set has $26+10=36$ characters. Four characters with ambiguous shapes that might lead to errors are omitted: ``O'' (vs. ``0''), ``I'' (vs. ``1'' or ``l''), ``L'' (vs. ``1'' or ``I''), and ``J'' (vs ``1'' or ``I''). The resulting set of 32 characters can be used to represent 5 bit of information. \begin{table}[htbp] \centering \begin{footnotesize} \begin{tabular}{cc|cc|cc|cc} character & value & character & value & character & value & character & value \\ \hline 0 & 0 & A & 10 & N & 20 & Y & 30 \\ 1 & 1 & B & 11 & P & 21 & Z & 31 \\ 2 & 2 & C & 12 & Q & 22 \\ 3 & 3 & D & 13 & R & 23 \\ 4 & 4 & E & 14 & S & 24 \\ 5 & 5 & F & 15 & T & 25 \\ 6 & 6 & G & 16 & U & 26 \\ 7 & 7 & H & 17 & V & 27 \\ 8 & 8 & K & 18 & W & 28 \\ 9 & 9 & M & 19 & X & 29 \\ \end{tabular} \end{footnotesize} \caption{Character set for identifier} \label{tab:chartable} \end{table} The 50 bit of the chosen address for the resource is divided into ten pieces of 5 bit. The pieces are each encoded into one character according to the character table in table~\ref{tab:chartable}. The resulting string of 10 characters is called the \emph{resource address}. \subsection{Namespaces} \label{sec:namespaces} The total address space of 70 bit is divided into $2^{20}$ (1048576) subspaces of 50 bit. These subspaces, also called namespaces, can be assigned to institutions that wish to implement their own allocation of resource identifiers for reasons of efficiency and maintenance. All resulting resource identifiers are only valid once they are registered with the central \emph{resource registry}. Each subspace is identified by a four-character \emph{name space identifier}. The 10 character \emph{resource address} is prefixed with the \emph{name space identifier}, resulting in a 14 character \emph{unique address} for each resource. Subspaces and their name space identifier are registered by the central resource registry. An institution or project that wishes to implement its own allocation of resource identifiers contacts the resource registry and receives a name space identifier for a currently unused subspace. The subspace is then marked as being used by this institution or project. New resource identifiers in this subspace can only be assigned by the institution or project that owns the subspace. The central resource registry allocates and registers resource identifiers for institutions, projects and individuals that do not want to maintain their own subspace. Resource identifiers allocated by the central resource registry are in the \texttt{ECHO} namespace. The namespaces \texttt{0000}, \texttt{TEMP} and \texttt{ECHO} are reserved for use with the central resource registry. \subsection{Checksum} \label{sec:checksum} A checksum of one character (5 bit) is calculated over the 14 characters (70 bit) of the \emph{unique address}. The checksumming method is similar to the method used for ISBN (International Standard Book Number). The differences are the number system, which is base-32 for the DRI (ISBN: base-10) and the modulus, which is 31 for the DRI (ISBN: 11). The checksum number is calculated with the formula \begin{displaymath} c = \sum_{i=1..14} i x_i \pmod{31} \end{displaymath} The resulting checksum number $c$ is converted to a character according to table~\ref{tab:chartable} and appended to the end of the \emph{unique address} giving the full \emph{Digital Resource Identifier}. The DRI is only valid if the checksum calculated over the unique address part of the identifier (the first 14 characters) matches the checksum value (the last character). \section{Central resource registry} \label{sec:central-registry} The central resource registry is the keystone in the concept of stable and sustainable digital resource identifiers and references. Resources can be moved and renamed on local servers, duplicated onto other servers and servers can even be shut down (given the resource had been duplicated) without resources getting lost or breaking links or references to the resource. The resource registry server acts as a switchboard between the user requests for a resource and local servers providing the resource. URLs and other so called ``global'' references to a resource via its DRI access the resource registry server that dispatches the request to the local server. In this way only the resource registry server's address has to remain stable. This places a high burden of availability on the registry server. This challenge can be met on a technical level with standard technology (transparent replication and load balancing) and scaled to higher performance levels when the demand rises. More importantly a durable solution has to be established on the organizational and social level for running the server. The resource registry maintains the mapping database between the digital resource identifiers and the location of the resources on the local servers. In this way it has a list of all known resource identifiers and ensures that all resource identifiers are unique. The database on the resource registry server can additionally store a set of minimal meta informations on the resources and provide searches in this metadata. One item of this minimal meta information should be a URL to further information on the resource. The resource registry server provides a HTTP redirect function for transparent HTTP access to resources and optionally other webservice access (XML-RPC, SOAP). Special client software for accessing resources can harvest and cache DRI mappings from the central registry for short times to improve performance or offline work. As mentioned in chapter~\ref{sec:namespaces} parts of the resource identifier address space can be assigned to institutions or projects to implement their own allocation of resource identifiers. These identifiers are generally valid only after they have been registered with the central resource registry. The central resource registry remains the only authoritative source of digital resource identifiers and their mapping to local resources. The resource registry provides interfaces to \begin{itemize} \item redirect HTTP requests with resource identifiers to local resource servers \item query the mapping of resource identifiers using a webservice interface \item hand out new resource identifiers and acquire the necessary mapping information \item change resource mapping information or resource meta information \item query the database for meta information \item upload sets of externally allocated resource identifiers \item download sets of identifiers or the whole database for caching purposes. \end{itemize} \subsection{Handling of digital resource identifiers in HTTP requests} \label{sec:dri-resolution-http} A global HTTP request usually accesses a digital resource via some kind of display tool (for example \digilib{}) that is able to render a web representation of the resource. While the resource identifier is embedded in the DRI part of the URL, other aspects of the rendering (for example which tool to use) are embedded in other parts of the URL that may be specific to the display tool. Therefore the registry server has to treat URLs differently depending on the display tool. The handling of HTTP requests has three steps: \begin{enumerate} \item Identification of the DRI in the request string. \item Lookup of additional information on the handling of the request based on the DRI. \item Redirect of the client to the local resource server. \end{enumerate} The first part of the treatment of the URL is the identification of the DRI in the HTTP request string. Three basic ways of handling the DRI are envisaged: \begin{itemize} \item The DRI can be embedded as part of the URI path\footnote{The first part of the URI path, separated by slashes, that is a valid DRI string.} (\url{http://driserver.echo.eu/dri/ECHO00001A2B3CX}), \item it can be provided as a special HTTP GET or POST parameter for a defined environment like \digilib{}\footnote{The environment itself should be identified by the first parts of the URI path.} (\url{http://driserver.echo.eu/digilib/digilib.jsp?dri=ECHO00001A2B3CX&pn=5}) or \item it can be extracted from the request by a generic pattern matching scheme (this option is computationally most expensive) \end{itemize} Once the DRI is identified more information about the resource can be looked up in the central resource database. From this point on the redirection of the request can be handled differently depending on the record type information in the database. An extensible set of URL rewrite rules will be implemented by the server. The type of rule to be used is part of the resource record of the DRI in the central resource registry. The following rules should be part of the first implementation of the registry server: \begin{description} \item[redirect] only the host part of the URL is replaced by the local host name from the resource record. \item[replace] the full URL is replaced by the local URL from the resource record. \item[\digilib{}] the host part of the URL is replaced by the local host name from the resource record and the remaining part is replaced according to \digilib{} rules. \item[rewrite] the host part of the URL is replaced by the local host name from the resource record and the remaining part is replaced according to generic substitution rules with wildcard patterns. \end{description} The introduction of other specialized types of rewrite rules can be implemented as extension modules to the resource server. \subsubsection{Redirect and replace type DRI resolution} \label{sec:redirect-type-dri} When a DRI resource record has a resolution type of ``redirect'', then only the host part of the URL is replaced in the redirected request by the local host given in the resource record. See table~\ref{tab:redirect-resolv}. \begin{table}[htbp] \centering \begin{tabular}{lp{0.7\textwidth}} incoming request & \url{http://driserver.echo.eu/dri/ECHO00001A2B3CX} \\ \texttt{local\_host} record & \texttt{penelope.unibe.ch} \\ redirect request & \url{http://penelope.unibe.ch/dri/ECHO00001A2B3CX} \end{tabular} \caption{redirect type DRI resolution} \label{tab:redirect-resolv} \end{table} When a DRI resource record has a resolution type of ``replace'', then the whole URL is replaced in the redirected request by the local URL given in the resource record. See table~\ref{tab:replace-resolv}. \begin{table}[htbp] \centering \begin{tabular}{lp{0.7\textwidth}} incoming request & \url{http://driserver.echo.eu/dri/ECHO00001A2B3CX} \\ \texttt{local\_url} record & \url{http://penelope.unibe.ch/docuserver/compago/compare.pl?32} \\ redirect request & \url{http://penelope.unibe.ch/docuserver/compago/compare.pl?32} \end{tabular} \caption{replace type DRI resolution} \label{tab:replace-resolv} \end{table} \subsubsection{\digilib{} type DRI resolution} \label{sec:digilib-type-dri} When a DRI resource record has a resolution type of ``\digilib{}'', then the host part of the URL is replaced by the local host in the resource record and the remaining part is replaced according to \digilib{} parameter format. In the preferred parameter-style format the DRI is given as the parameter ``dri''. The local URL for the redirect is constructed by replacing the URI path up to the ``?'' with the digilib path from the resource record and adding a local filename as parameter ``fn''. See table~\ref{tab:digilib-resolv}. \begin{table}[htbp] \centering \begin{tabular}{lp{0.7\textwidth}} incoming request & \url{http://driserver.echo.eu/digilib/digilib.jsp?dri=ECHO00001A2B3CX&pn=5} \\ \texttt{local\_host} record & \texttt{penelope.unibe.ch} \\ \texttt{digilib\_path} record & \texttt{/docuserver/digitallibrary/digilib.jsp} \\ \texttt{digilib\_file} record & \texttt{public/Beispiele} \\ redirect request & \url{http://penelope.unibe.ch/docuserver/digitallibrary/digilib.jsp?dri=ECHO00001A2B3CX&fn=public/Beispiele&pn=5} \end{tabular} \caption{digilib type DRI resolution} \label{tab:digilib-resolv} \end{table} In the deprecated plus-style format the DRI could be placed the first part of the parameter path, prefixed with ``dri:''. In the local URL the local pathname is appended to the DRI part. \subsubsection{Rewrite type DRI resolution} \label{sec:rewrite-type-dri} When a DRI resource record has a resolution type of ``rewrite'', then the host part of the URL is replaced by the local host name from the resource record and the remaining part is replaced according to generic substitution rules with wildcard patterns. \subsection{Handling of digital resource identifiers as a web service} \label{sec:handl-dri-web} The basic function of resolution of a DRI as well as other maintenance functions like the registration of new DRIs or the download of parts or all registered DRI mappings should also be accessible with a web service interface. Specifications for the web service interface have to be established. \section{Resource metadata} \label{sec:resource-metadata} The set of metadata about a resource that is stored on the resource server is called a \emph{resource record}. Since the requirements of access, structure and amount of metadata for different projects can hardly be generalized the resource server stores only a minimal set of fields that is sufficient for the basic functions of access to the resource, sustainability of access, and interoperability. More extensive and project specific metadata sets should be stored and maintained on external servers. The optional resource information field can be used to point to external metadata representations. \subsection{Basic metadata} \label{sec:basic-metadata} The amount of metadata is dependent on the type of resource record. Common to all records is the \texttt{dri} field for the resource identifier. Redirect-type records require an additional \texttt{local\_host} field for the host name of the local host. Replace-type records require an \texttt{local\_url} field for a full URL. Digilib-type records require at least the three fields \texttt{local\_host}, \texttt{digilib\_path}, and \texttt{digilib\_file} and an optional parameter \texttt{digilib\_pageno}. The basic fields can be found in table~\ref{tab:basic-meta}. \begin{table}[htbp] \centering \begin{tabular}{lr|l} type & field & description \\ \hline \textbf{redirect} & & \\ & \texttt{record\_type} & type of record (``redirect'') \\ & \texttt{dri} & DRI \\ & \texttt{local\_host} & local host name \\ \hline \textbf{replace} & & \\ & \texttt{record\_type} & type of record (``replace'') \\ & \texttt{dri} & DRI \\ & \texttt{local\_url} & full local URL \\ \hline \textbf{digilib} & & \\ & \texttt{record\_type} & type of record (``digilib'') \\ & \texttt{dri} & DRI \\ & \texttt{local\_host} & local digilib server \\ & \texttt{digilib\_path} & URI path of the digilib installation \\ & \texttt{digilib\_file} & digilib path name (parameter fn) \\ & \texttt{digilib\_pageno} & optional page number (parameter pn) \end{tabular} \caption{Basic metadata fields} \label{tab:basic-meta} \end{table} The resource server may implement additional fields like owner and group fields for internal management and user access functions. \subsection{Alternate server and backup server} \label{sec:redund-serv-back} The resource server architecture is designed to fulfill high demands on the performance and sustainability of access to the resources. These demands can be met by a loosely coupled network of local servers duplicating content for backup and the transparent sharing of concurrent access to resources for enhanced performance. Backup server fields give the names and paths of servers that provide copies of the resource. Requests for the resource are diverted to a backup server when the original server becomes unavailable. Alternate server fields give the names paths of servers that provide copies of the resource. Requests for a resource are spread among all alternate servers for the same resource according to a load-balancing pattern. The pattern can be a simple round-robin scheme or a more sophisticated scheme based on server performance or the geographical location of client and server. A resource record can have any number of backup server and alternate server fields. If a resource is required to have at least one backup server is a policy decision of the hosting project that is not enforced by the resource server. \subsection{Additional resource information} \label{sec:addt-reso-inform} The resource server itself carries only minimal metadata on a resource but it provides a basic mechanism to store and access more extensive information on external servers. Every resource record can have a resource info URL that is stored in the \texttt{info-url} field. \begin{table}[htbp] \centering \begin{tabular}{l|l} field & description \\ \hline \texttt{info-url} & URL to external information \end{tabular} \caption{External resource information} \label{tab:extern-reso-inform} \end{table} The external resource information can be accessed in a standardized way on the resource server where the DRI of the resource is part of the URI path: \url{http://driserver.echo.eu/resinfo/ECHO00001A2B3CX/} Requests to this URL will be redirected to the URL in the \texttt{info-url} field in the resource record. \end{document}