Annotation of worldwide-digilib/worldwide-digilib.tex, revision 1.1.1.1

1.1       casties     1: \documentclass[a4paper]{article}
                      2: 
                      3: \usepackage[latin1]{inputenc}
                      4: \usepackage[T1]{fontenc}
                      5: \usepackage{ae}
                      6: 
                      7: \usepackage{url}
                      8: %\usepackage{hyperref}
                      9: 
                     10: %% for latex2rtf :-(
                     11: % remember to replace "%" in URL!
                     12: %\newcommand{\url}[1]{\verb!#1!}
                     13: %\renewenvironment{footnotesize}{}{}
                     14: 
                     15: 
                     16: \newcommand{\digilib}{\texttt{digilib}}
                     17: 
                     18: \title{Draft: World Wide digilib -- Resource Identifier in ECHO}
                     19: 
                     20: \author{Robert Casties\thanks{IT-Group, Max Planck Institute for the
                     21:     history of science}} 
                     22: 
                     23: \date{Version~0.6 of \today}
                     24: 
                     25: \begin{document}
                     26: 
                     27: \maketitle
                     28: 
                     29: \tableofcontents
                     30: 
                     31: \section{Digital Resource Identifier DRI}
                     32: 
                     33: The \emph{Digital Resource Identifier} is a worldwide unique
                     34: identifier for a digital resource. The resource may be an electronic
                     35: text, single or multiple digital images, an audiovisual media file or
                     36: other type of electronic resource that is accessible over the
                     37: Internet.
                     38: 
                     39: The identifier provides a stable point of reference for digital
                     40: resources in the Internet. The identifier is therefore independent
                     41: from the address, implementation and directory layout of the location
                     42: of the resource. The identifier is unique and constant and it can be
                     43: used in other documents to reference the resource without the risk of
                     44: having a broken reference in the future because the address or filename
                     45: of the resource has changed.
                     46: 
                     47: The identifier supports infrastructure for the ``sustainability'' of
                     48: digital resources to guarantee that not only the identifier always
                     49: points to the same resource but also the resource stays available in
                     50: the Internet. The infrastructure supports backup copies and load
                     51: balancing mechanisms. The implementation and enduring support of the
                     52: actual servers and digital resources is in itself mostly an
                     53: organisational and social challenge that cannot be solved by
                     54: technological measures alone.
                     55: 
                     56: 
                     57: 
                     58: \subsection{Structure of the DRI}
                     59: \label{sec:structure-dri}
                     60: 
                     61: The \emph{Digital Resource Identifier} has the following properties:
                     62: 
                     63: \begin{itemize}
                     64: \item Total address space of 70 bit, partitioned into a million
                     65:   subspaces of 50 bit for $10^{15}$ or 1125 billion different
                     66:   resources per subspace.
                     67: 
                     68: \item The identifier contains only (uppercase) letters and digits.
                     69:   
                     70: \item The identifier is composed of a 4 character \emph{subspace} or
                     71:   \emph{namespace identifier}, a 10 character \emph{resource
                     72:     identifier} and a 1 character checksum, giving a total of 15
                     73:   characters for the full DRI.
                     74: \end{itemize}
                     75: 
                     76: 
                     77: 
                     78: 
                     79: \subsection{Character set}
                     80: \label{sec:charset}
                     81: 
                     82: The identifier is composed only of letters and digits. Uppercase and
                     83: lowercase letters are not distinguished. The resulting character set
                     84: has $26+10=36$ characters. Four characters with ambiguous shapes that
                     85: might lead to errors are omitted: ``O'' (vs. ``0''), ``I'' (vs. ``1''
                     86: or ``l''), ``L'' (vs. ``1'' or ``I''), and ``J'' (vs ``1'' or ``I'').
                     87: The resulting set of 32 characters can be used to represent 5 bit of
                     88: information.
                     89: 
                     90: \begin{table}[htbp]
                     91:   \centering
                     92:   \begin{footnotesize}
                     93:   \begin{tabular}{cc|cc|cc|cc}
                     94:     character & value & character & value & character & value &
                     95:     character & value \\ \hline
                     96:     0 & 0  & A & 10 & N & 20 & Y & 30 \\
                     97:     1 & 1  & B & 11 & P & 21 & Z & 31 \\
                     98:     2 & 2  & C & 12 & Q & 22 \\
                     99:     3 & 3  & D & 13 & R & 23 \\
                    100:     4 & 4  & E & 14 & S & 24 \\
                    101:     5 & 5  & F & 15 & T & 25 \\
                    102:     6 & 6  & G & 16 & U & 26 \\
                    103:     7 & 7  & H & 17 & V & 27 \\
                    104:     8 & 8  & K & 18 & W & 28 \\
                    105:     9 & 9  & M & 19 & X & 29 \\
                    106:   \end{tabular}
                    107:   \end{footnotesize}
                    108:   \caption{Character set for identifier}
                    109:   \label{tab:chartable}
                    110: \end{table}
                    111: 
                    112: The 50 bit of the chosen address for the resource is divided into ten
                    113: pieces of 5 bit. The pieces are each encoded into one character
                    114: according to the character table in table~\ref{tab:chartable}. The
                    115: resulting string of 10 characters is called the \emph{resource
                    116:   address}.
                    117: 
                    118: 
                    119: 
                    120: 
                    121: \subsection{Namespaces}
                    122: \label{sec:namespaces}
                    123: 
                    124: The total address space of 70 bit is divided into $2^{20}$ (1048576)
                    125: subspaces of 50 bit. These subspaces, also called namespaces, can be
                    126: assigned to institutions that wish to implement their own allocation
                    127: of resource identifiers for reasons of efficiency and maintenance. All
                    128: resulting resource identifiers are only valid once they are registered
                    129: with the central \emph{resource registry}.
                    130: 
                    131: Each subspace is identified by a four-character \emph{name
                    132:   space identifier}. The 10 character \emph{resource address} is
                    133: prefixed with the \emph{name space identifier}, resulting in a 14
                    134: character \emph{unique address} for each resource.
                    135: 
                    136: Subspaces and their name space identifier are registered by the
                    137: central resource registry. An institution or project that wishes to
                    138: implement its own allocation of resource identifiers contacts the
                    139: resource registry and receives a name space identifier for a currently
                    140: unused subspace. The subspace is then marked as being used by this
                    141: institution or project. New resource identifiers in this subspace can
                    142: only be assigned by the institution or project that owns the subspace.
                    143: 
                    144: The central resource registry allocates and registers resource
                    145: identifiers for institutions, projects and individuals that do not
                    146: want to maintain their own subspace. Resource identifiers allocated by
                    147: the central resource registry are in the \texttt{ECHO} namespace.
                    148: 
                    149: The namespaces \texttt{0000}, \texttt{TEMP} and \texttt{ECHO} are
                    150: reserved for use with the central resource registry.
                    151: 
                    152: 
                    153: \subsection{Checksum}
                    154: \label{sec:checksum}
                    155: 
                    156: A checksum of one character (5 bit) is calculated over the 14
                    157: characters (70 bit) of the \emph{unique address}. The checksumming method is
                    158: similar to the method used for ISBN (International Standard Book
                    159: Number). The differences are the number system, which is base-32 for
                    160: the DRI (ISBN: base-10) and the modulus, which is 31 for the DRI
                    161: (ISBN: 11).
                    162: 
                    163: The checksum number is calculated with the formula
                    164: \begin{displaymath}
                    165:   c = \sum_{i=1..14} i x_i \pmod{31}
                    166: \end{displaymath}
                    167: 
                    168: The resulting checksum number $c$ is converted to a character
                    169: according to table~\ref{tab:chartable} and appended to the end of the
                    170: \emph{unique address} giving the full \emph{Digital Resource
                    171:   Identifier}.
                    172: 
                    173: The DRI is only valid if the checksum calculated over the unique
                    174: address part of the identifier (the first 14 characters) matches the
                    175: checksum value (the last character).
                    176: 
                    177: 
                    178: 
                    179: 
                    180: \section{Central resource registry}
                    181: \label{sec:central-registry}
                    182: 
                    183: The central resource registry is the keystone in the concept of stable
                    184: and sustainable digital resource identifiers and references. Resources
                    185: can be moved and renamed on local servers, duplicated onto other
                    186: servers and servers can even be shut down (given the resource had been
                    187: duplicated) without resources getting lost or breaking links or
                    188: references to the resource.
                    189: 
                    190: The resource registry server acts as a switchboard between the user
                    191: requests for a resource and local servers providing the resource. URLs
                    192: and other so called ``global'' references to a resource via its DRI
                    193: access the resource registry server that dispatches the request to the
                    194: local server. In this way only the resource registry server's address
                    195: has to remain stable.
                    196: 
                    197: This places a high burden of availability on the registry server. This
                    198: challenge can be met on a technical level with standard technology
                    199: (transparent replication and load balancing) and scaled to higher
                    200: performance levels when the demand rises. More importantly a durable
                    201: solution has to be established on the organizational and social level
                    202: for running the server.
                    203: 
                    204: The resource registry maintains the mapping database between the
                    205: digital resource identifiers and the location of the resources on the
                    206: local servers. In this way it has a list of all known resource
                    207: identifiers and ensures that all resource identifiers are unique.
                    208: 
                    209: The database on the resource registry server can additionally store a
                    210: set of minimal meta informations on the resources and provide
                    211: searches in this metadata. One item of this minimal meta information
                    212: should be a URL to further information on the resource.
                    213: 
                    214: The resource registry server provides a HTTP redirect function for
                    215: transparent HTTP access to resources and optionally other webservice
                    216: access (XML-RPC, SOAP).
                    217: 
                    218: Special client software for accessing resources can harvest and cache
                    219: DRI mappings from the central registry for short times to improve
                    220: performance or offline work. 
                    221: 
                    222: As mentioned in chapter~\ref{sec:namespaces} parts of the resource
                    223: identifier address space can be assigned to institutions or projects
                    224: to implement their own allocation of resource identifiers. These
                    225: identifiers are generally valid only after they have been registered
                    226: with the central resource registry.
                    227: 
                    228: The central resource registry remains the only authoritative source of
                    229: digital resource identifiers and their mapping to local resources.
                    230: 
                    231: The resource registry provides interfaces to
                    232: 
                    233: \begin{itemize}
                    234: \item redirect HTTP requests with resource identifiers to local
                    235:   resource servers
                    236: 
                    237: \item query the mapping of resource identifiers using a webservice
                    238:   interface
                    239: 
                    240: \item hand out new resource identifiers and acquire the necessary
                    241:   mapping information
                    242: 
                    243: \item change resource mapping information or resource meta information
                    244: 
                    245: \item query the database for meta information
                    246: 
                    247: \item upload sets of externally allocated resource identifiers
                    248: 
                    249: \item download sets of identifiers or the whole database for caching
                    250:   purposes.
                    251: \end{itemize}
                    252: 
                    253: 
                    254: 
                    255: \subsection{Handling of digital resource identifiers in HTTP
                    256:   requests}
                    257: \label{sec:dri-resolution-http}
                    258: 
                    259: A global HTTP request usually accesses a digital resource via some
                    260: kind of display tool (for example \digilib{}) that is able to render a
                    261: web representation of the resource. While the resource identifier is
                    262: embedded in the DRI part of the URL, other aspects of the rendering
                    263: (for example which tool to use) are embedded in other parts of the URL
                    264: that may be specific to the display tool. Therefore the registry
                    265: server has to treat URLs differently depending on the display tool.
                    266: 
                    267: The handling of HTTP requests has three steps:
                    268: \begin{enumerate}
                    269: \item Identification of the DRI in the request string.
                    270: 
                    271: \item Lookup of additional information on the handling of the request
                    272:   based on the DRI.
                    273: 
                    274: \item Redirect of the client to the local resource server.
                    275: \end{enumerate}
                    276: 
                    277: The first part of the treatment of the URL is the identification of
                    278: the DRI in the HTTP request string. Three basic ways of handling the
                    279: DRI are envisaged:
                    280: 
                    281: \begin{itemize}
                    282: \item The DRI can be embedded as part of the URI path\footnote{The
                    283:     first part of the URI path, separated by slashes, that is a valid
                    284:     DRI string.} (\url{http://driserver.echo.eu/dri/ECHO00001A2B3CX}),
                    285: 
                    286: \item it can be provided as a special HTTP GET or POST parameter for a
                    287:   defined environment like \digilib{}\footnote{The environment itself
                    288:     should be identified by the first parts of the URI path.}
                    289:   (\url{http://driserver.echo.eu/digilib/digilib.jsp?dri=ECHO00001A2B3CX&pn=5})
                    290:   or
                    291:   
                    292: \item it can be extracted from the request by a generic pattern
                    293:   matching scheme (this option is computationally most expensive)
                    294: \end{itemize}
                    295: 
                    296: Once the DRI is identified more information about the resource can be
                    297: looked up in the central resource database. From this point on the
                    298: redirection of the request can be handled differently depending on the
                    299: record type information in the database.
                    300: 
                    301: An extensible set of URL rewrite rules will be implemented by the
                    302: server. The type of rule to be used is part of the resource record of
                    303: the DRI in the central resource registry. The following rules should
                    304: be part of the first implementation of the registry server:
                    305: 
                    306: \begin{description}
                    307: 
                    308: \item[redirect] only the host part of the URL is replaced by the local
                    309:   host name from the resource record.
                    310: 
                    311: \item[replace] the full URL is replaced by the local URL from the
                    312:   resource record.
                    313: 
                    314: \item[\digilib{}] the host part of the URL is replaced by the local host
                    315:   name from the resource record and the remaining part is replaced according
                    316:   to \digilib{} rules.
                    317: 
                    318: \item[rewrite] the host part of the URL is replaced by the local host
                    319:   name from the resource record and the remaining part is replaced according to
                    320:   generic substitution rules with wildcard patterns.
                    321: \end{description}
                    322: 
                    323: The introduction of other specialized types of rewrite rules can be
                    324: implemented as extension modules to the resource server.
                    325: 
                    326: 
                    327: 
                    328: \subsubsection{Redirect and replace type DRI resolution}
                    329: \label{sec:redirect-type-dri}
                    330: 
                    331: When a DRI resource record has a resolution type of ``redirect'', then
                    332: only the host part of the URL is replaced in the redirected request by
                    333: the local host given in the resource record. See
                    334: table~\ref{tab:redirect-resolv}.
                    335: 
                    336: \begin{table}[htbp]
                    337:   \centering
                    338:   \begin{tabular}{lp{0.7\textwidth}}
                    339:     incoming request & \url{http://driserver.echo.eu/dri/ECHO00001A2B3CX} \\
                    340:     \texttt{local\_host} record & \texttt{penelope.unibe.ch} \\
                    341:     redirect request & \url{http://penelope.unibe.ch/dri/ECHO00001A2B3CX}
                    342:   \end{tabular}
                    343:   \caption{redirect type DRI resolution}
                    344:   \label{tab:redirect-resolv}
                    345: \end{table}
                    346: 
                    347: When a DRI resource record has a resolution type of ``replace'', then
                    348: the whole URL is replaced in the redirected request by the local URL
                    349: given in the resource record. See table~\ref{tab:replace-resolv}.
                    350: 
                    351: \begin{table}[htbp]
                    352:   \centering
                    353:   \begin{tabular}{lp{0.7\textwidth}}
                    354:     incoming request & \url{http://driserver.echo.eu/dri/ECHO00001A2B3CX} \\
                    355:     \texttt{local\_url} record & \url{http://penelope.unibe.ch/docuserver/compago/compare.pl?32} \\
                    356:     redirect request & \url{http://penelope.unibe.ch/docuserver/compago/compare.pl?32}
                    357:   \end{tabular}
                    358:   \caption{replace type DRI resolution}
                    359:   \label{tab:replace-resolv}
                    360: \end{table}
                    361: 
                    362: 
                    363: 
                    364: \subsubsection{\digilib{} type DRI resolution}
                    365: \label{sec:digilib-type-dri}
                    366: 
                    367: When a DRI resource record has a resolution type of ``\digilib{}'', then
                    368: the host part of the URL is replaced by the local host in the resource
                    369: record and the remaining part is replaced according to \digilib{}
                    370: parameter format.
                    371: 
                    372: In the preferred parameter-style format the DRI is given as the
                    373: parameter ``dri''. The local URL for the redirect is constructed by
                    374: replacing the URI path up to the ``?'' with the digilib path from the
                    375: resource record and adding a local filename as parameter ``fn''. See
                    376: table~\ref{tab:digilib-resolv}.
                    377: 
                    378: \begin{table}[htbp]
                    379:   \centering
                    380:   \begin{tabular}{lp{0.7\textwidth}}
                    381:     incoming request &
                    382:     \url{http://driserver.echo.eu/digilib/digilib.jsp?dri=ECHO00001A2B3CX&pn=5} \\
                    383:     \texttt{local\_host} record & \texttt{penelope.unibe.ch} \\
                    384:     \texttt{digilib\_path} record & \texttt{/docuserver/digitallibrary/digilib.jsp} \\
                    385:     \texttt{digilib\_file} record & \texttt{public/Beispiele} \\
                    386:     redirect request &
                    387:     \url{http://penelope.unibe.ch/docuserver/digitallibrary/digilib.jsp?dri=ECHO00001A2B3CX&fn=public/Beispiele&pn=5} 
                    388:   \end{tabular}
                    389:   \caption{digilib type DRI resolution}
                    390:   \label{tab:digilib-resolv}
                    391: \end{table}
                    392: 
                    393: In the deprecated plus-style format the DRI could be placed the first
                    394: part of the parameter path, prefixed with ``dri:''. In the local URL
                    395: the local pathname is appended to the DRI part.
                    396: 
                    397: 
                    398: \subsubsection{Rewrite type DRI resolution}
                    399: \label{sec:rewrite-type-dri}
                    400: 
                    401: When a DRI resource record has a resolution type of ``rewrite'', then
                    402: the host part of the URL is replaced by the local host name from the
                    403: resource record and the remaining part is replaced according to
                    404: generic substitution rules with wildcard patterns.
                    405: 
                    406: 
                    407: 
                    408: \subsection{Handling of digital resource identifiers as a web service}
                    409: \label{sec:handl-dri-web}
                    410: 
                    411: The basic function of resolution of a DRI as well as other maintenance
                    412: functions like the registration of new DRIs or the download of parts
                    413: or all registered DRI mappings should also be accessible with a web
                    414: service interface.
                    415: 
                    416: Specifications for the web service interface have to be established.
                    417: 
                    418: 
                    419: \section{Resource metadata}
                    420: \label{sec:resource-metadata}
                    421: 
                    422: The set of metadata about a resource that is stored on the resource
                    423: server is called a \emph{resource record}. Since the requirements of
                    424: access, structure and amount of metadata for different projects can
                    425: hardly be generalized the resource server stores only a minimal set of
                    426: fields that is sufficient for the basic functions of access to the
                    427: resource, sustainability of access, and interoperability. More
                    428: extensive and project specific metadata sets should be stored and
                    429: maintained on external servers. The optional resource information
                    430: field can be used to point to external metadata representations.
                    431: 
                    432: 
                    433: \subsection{Basic metadata}
                    434: \label{sec:basic-metadata}
                    435: 
                    436: The amount of metadata is dependent on the type of resource record.
                    437: Common to all records is the \texttt{dri} field for the resource
                    438: identifier.  Redirect-type records require an additional
                    439: \texttt{local\_host} field for the host name of the local host.
                    440: Replace-type records require an \texttt{local\_url} field for a full
                    441: URL. Digilib-type records require at least the three fields
                    442: \texttt{local\_host}, \texttt{digilib\_path}, and
                    443: \texttt{digilib\_file} and an optional parameter
                    444: \texttt{digilib\_pageno}. The basic fields can be found in
                    445: table~\ref{tab:basic-meta}.
                    446: 
                    447: \begin{table}[htbp]
                    448:   \centering
                    449:   \begin{tabular}{lr|l}
                    450:     type & field & description \\ \hline
                    451:     \textbf{redirect} & & \\
                    452:     & \texttt{record\_type} & type of record (``redirect'') \\
                    453:     & \texttt{dri} & DRI \\
                    454:     & \texttt{local\_host} & local host name \\ \hline
                    455:     \textbf{replace} & & \\
                    456:     & \texttt{record\_type} & type of record (``replace'') \\
                    457:     & \texttt{dri} & DRI \\
                    458:     & \texttt{local\_url} & full local URL \\ \hline
                    459:     \textbf{digilib} & & \\
                    460:     & \texttt{record\_type} & type of record (``digilib'') \\
                    461:     & \texttt{dri} & DRI \\
                    462:     & \texttt{local\_host} & local digilib server \\
                    463:     & \texttt{digilib\_path} & URI path of the digilib installation \\
                    464:     & \texttt{digilib\_file} & digilib path name (parameter fn) \\
                    465:     & \texttt{digilib\_pageno} & optional page number
                    466:     (parameter pn)
                    467:   \end{tabular}
                    468:   \caption{Basic metadata fields}
                    469:   \label{tab:basic-meta}
                    470: \end{table}
                    471: 
                    472: The resource server may implement additional fields like owner and
                    473: group fields for internal management and user access functions.
                    474: 
                    475: 
                    476: \subsection{Alternate server and backup server}
                    477: \label{sec:redund-serv-back}
                    478: 
                    479: The resource server architecture is designed to fulfill high demands
                    480: on the performance and sustainability of access to the
                    481: resources. These demands can be met by a loosely coupled network of
                    482: local servers duplicating content for backup and the transparent
                    483: sharing of concurrent access to resources for enhanced
                    484: performance.
                    485: 
                    486: Backup server fields give the names and paths of servers that provide
                    487: copies of the resource. Requests for the resource are diverted to a
                    488: backup server when the original server becomes unavailable.
                    489: 
                    490: Alternate server fields give the names paths of servers that provide
                    491: copies of the resource. Requests for a resource are spread among all
                    492: alternate servers for the same resource according to a load-balancing
                    493: pattern. The pattern can be a simple round-robin scheme or a more
                    494: sophisticated scheme based on server performance or the geographical
                    495: location of client and server.
                    496: 
                    497: A resource record can have any number of backup server and alternate
                    498: server fields. If a resource is required to have at least one backup
                    499: server is a policy decision of the hosting project that is not
                    500: enforced by the resource server.
                    501: 
                    502: 
                    503: 
                    504: \subsection{Additional resource information}
                    505: \label{sec:addt-reso-inform}
                    506: 
                    507: The resource server itself carries only minimal metadata on a resource
                    508: but it provides a basic mechanism to store and access more extensive
                    509: information on external servers.
                    510: 
                    511: Every resource record can have a resource info URL that is stored in
                    512: the \texttt{info-url} field.
                    513: 
                    514: \begin{table}[htbp]
                    515:   \centering
                    516:   \begin{tabular}{l|l}
                    517:     field & description \\ \hline
                    518:     \texttt{info-url} & URL to external information
                    519:   \end{tabular}
                    520:   \caption{External resource information}
                    521:   \label{tab:extern-reso-inform}
                    522: \end{table}
                    523: 
                    524: The external resource information can be accessed in a standardized
                    525: way on the resource server where the DRI of the resource is part of
                    526: the URI path: \url{http://driserver.echo.eu/resinfo/ECHO00001A2B3CX/}
                    527: Requests to this URL will be redirected to the URL in the
                    528: \texttt{info-url} field in the resource record.
                    529: 
                    530: 
                    531: \end{document}

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>