\documentclass[a4paper]{article} \usepackage[latin1]{inputenc} \usepackage[T1]{fontenc} \usepackage{ae} %\usepackage{times} %\usepackage{courier} % create in-text links black (with PDF) \usepackage[colorlinks=true,linkcolor=black]{hyperref} % Format URLs nicely (without PDF) %\usepackage{url} \title{A simple metadata format for resource bundles} \author{Robert Casties, Dirk Wintergrün, Hans-Christoph Liess} \date{V1.1.0 of 5.12.2003} \begin{document} \maketitle \tableofcontents \section{File and directory names} \label{sec:file-directory-names} File and directory names should not contain spaces. Allowed characters in filenames are only the alphanumeric set a-z, A-Z, 0-9, hyphen ``-'', underscore ``\_'' and dot ``.''. Files and directories with names that contain illegal characters must be transformed to allowed names. A proposition for a simple transformation rule is \begin{itemize} \item whitespace characters (e.g. blank, tab, cr, lf) are replaced by hyphens ``-'' \item other illegal characters are replaced by underscores ``\_''. \end{itemize} This rule does not provide a reversible mapping to the original illegal file name and it does not provide a collision-free mapping, i.e. two different illegal file names might be mapped to the same allowed file name. Additional precautions for these cases must be taken. \section{Metadata files} \label{sec:metadata-files} The metadata information is stored in the XML format documented below in special files in the resource directory. Two forms of metadata files are possible: \begin{itemize} \item a file named \texttt{index.meta} in a directory. \item a file named like the data file it describes with an additional extension \texttt{.meta}. For example metadata for the file \texttt{0001.tif} would be in a file \texttt{0001.tif.meta}. \end{itemize} The resource directory must contain an \texttt{index.meta} file with information about the resource as a whole. Other directories can contain \texttt{index.meta} files. Additional information about single data files that are part of the resource can either be put in \texttt{file} tags in the \texttt{index.meta} file or in separate \emph{filename}\texttt{.meta} files for each data file. Information from the directory level file is inherited at the file level. \section{Resource format} \label{sec:mpiwg-doc} In this description elements marked ``optional'' need not be supplied by the provider of the resource and may be absent in all versions of the metadata file. Elements marked ``required'' must be supplied by the provider of the resource. Elements marked ``deduced'' can be supplied by the provider of the resource but can also be provided by automatic scripts later in the process, these elements must be present in the final file. File and directory paths in the metadata file use the conventional Unix file separator slash ``/''. The outer container element is \texttt{resource}. It has the following \textbf{attributes}: \begin{description} \item[type] sub-type of resource (e.g. ``ECHO'', ``MPIWG'') -- optional. \item[version] version number of metadata format (currently 1.1) -- required. \end{description} \noindent The allowed \textbf{elements} inside \texttt{resource} are: \begin{description} \item[description] An informal textual description of the resource -- optional. \item[name] The filename of the resource (name of the directory this file is contained in) -- required. \item[creator] The name of the project or person that created the resource -- optional. \item[archive-creation-date] The time and date the archive collection was created -- deduced. \item[archive-storage-date] The time and date the archive was written to permanent storage -- deduced (must not be set by the user). \item[archive-path] The full path to the resource directory inside the whole archive collection, including the resource directory -- deduced. \item[archive-id] The ID for this document in the archive -- required. \item[derived-from] Container for the description of the original resource if this resource is a modified version of another resource -- optional. \begin{description} \item[archive-id] The ID of the original resource -- required. \item[archive-path] The full path to the original resource -- deduced. \item[description] An informal textual description of the relation of this resource to the original resource -- optional. \end{description} \item[linked-with] Container for the description of another resource when this resource is a linked copy of another resource -- optional. \begin{description} \item[archive-id] The ID of the linked resource -- required. \item[archive-path] The full path to the linked resource -- deduced. \item[description] An informal textual description of the relation of this resource to the linked resource -- optional. \end{description} \item[media-type] \label{tag-media-type} The main media type of this resource -- required.\\ The main media type can be overridden by \texttt{media-type}s in subdirectories. Possible types are \begin{itemize} \item \texttt{image} \item \texttt{text} \item \texttt{audio} \item \texttt{video} \item \texttt{data} for other type of data \end{itemize} \item[meta] Additional metadata information about the resource -- optional.\\ For a description of additional metadata see below. \item[dir] Container for the description of a subdirectory -- required (when there are subdirectories).\\ \texttt{dir} tags should not be nested. Directories at lower levels are identified by their \texttt{path}. \begin{description} \item[description] An informal textual description of the subdirectory -- optional. \item[name] The name of the subdirectory -- required. \item[original-name] A text string associated with the directory as original name -- optional. (E.g. if the data in this directory came from an external source and had a name that had to be changed according to section~\ref{sec:file-directory-names} but it should be possible to reference the original name.) \item[path] The directory path of this subdirectory relative to the resource's root directory (excluding the directory itself) -- required (may be empty or omitted if the directory is a direct child of the resource's root directory). \item[meta] Additional metadata information about the directory -- optional.\\ For a description of additional metadata see below. \end{description} \item[file] Container for the description of a file -- deduced.\\ \texttt{file} tags should not be nested in \texttt{dir} tags. Files at lower directory levels are identified by their \texttt{path}. \begin{description} \item[description] An informal textual description of the file -- optional. \item[name] The name of the file -- required. \item[original-name] A text string associated with the file as original name -- optional. (E.g. if this file came from an external source and had a name that had to be changed according to section~\ref{sec:file-directory-names} but it should be possible to reference the original name.) \item[path] The directory path of this file relative to the resource's root directory (excluding the file itself) -- required (may be empty or omitted if the file is in the resource's root directory). \item[date] The file's modification or creation date\footnote{The preferred time and date format is ``YYYY/MM/DD HH:MM:SS''}, whichever is more recent -- optional. \item[modification-date] The file's modification date -- optional. \item[creation-date] The file's creation date -- optional. \item[size] The file size -- deduced. \item[mime-type] The file's mime-type -- optional. \item[md5cs] MD5 checksum of the file content -- optional. \item[meta] Additional metadata information about the file -- optional. For a description of additional metadata see below. \end{description} \end{description} \section{Additional metadata} \label{sec:additional-metadata} All elements with \texttt{meta} tags can contain an arbitrary number of the following additional metadata elements. \subsection{workflow state} \label{sec:workflow-state} All additional metadata elements can have a \texttt{workflow-state} \textbf{attribute}. This attribute reflects the state of the corresponding metadata element. The possible values for the \texttt{workflow-state} attribute are \begin{itemize} \item \texttt{preliminary} this information is preliminary. It must be checked in further workflow steps. \item \texttt{inwork} \item \texttt{final} \end{itemize} workflow states other than \texttt{preliminary} are part of the workflow handling of the respective projects. Metadata elements can appear multiple times with different \texttt{workflow-state} attributes. This enables metadata versioning. \subsection{Content type} \label{sec:content-type} \begin{description} \item[content-type] \label{tag-content-type} The content type of this resource -- required.\\ The content type enables the choice of tools to manipulate and display the resource. There should be a common list of content types. For digital documents (books, manuscripts) this would be "scanned document", for other image data "scanned images".\footnote{The criterion for documents is a ordered succession of image files (pages) and equal image size and resolution throughout the images of a resource.} \end{description} \subsection{Language} \label{sec:lang} The language of a resource (e.g. a text) can be specified with a \texttt{lang} tag. Languages have to be described using the international codes for the representation of names of languages either in two-letter form (ISO 639-1) or in three-letter form (ISO 639-2). The entire catalogue of languages is documented on the page \url{http://www.loc.gov/standards/iso639-2/englangn.html} \subsection{DRI} \label{sec:dri} The \emph{digital resource identifier} for the resource is specified in a \texttt{dri} element. Digital resource identifiers are documented on the page \url{http://pythia.mpiwg-berlin.mpg.de/projects/standards/dri}. \subsection{Collection context} \label{sec:collection-context} The context of a resource as part of a collection or part of a project can be specified in the \texttt{context} element. All elements in the container can appear multiple times. \begin{description} \item[context] information on collection or project context. \begin{description} \item[link] URL to additional context information. \item[name] Textual description of project or collection. \end{description} \end{description} \subsection{Bibliographic information} \label{sec:bibliographic-data} Bibliographic information is presented in a \texttt{bib} container with a \texttt{type} parameter, giving the type of bibliographic resource. The \texttt{type} field can be repeated as a tag in the container. The format is based on the ECHO scheme for bibliographic data (cf. content workflow), the MPIWG ``Projektbibliografie'' and the format of the commonly used program ``EndNote''. \subsubsection{Book} \begin{description} \item [bib type="book"] a published book. \begin{description} \item [author] The author of the book. \item [year] The year of publication. \item [title] Title of the book. \item [series-editor] Name of the series editor, if the book appears in a series. \item [series-title] Title of the serie, if the book appears in a series. \item [series-volume] Volume number, if the book appears in a series. \item [number-of-pages] Number of pages of the entire book. \item [city] City where the book was published. \item [publisher] Name of the publishing company \item [edition] Edition of the book (e.g. third edition) \item [number-of-volumes] Number of volumes, if the the book is published in multiple volumes. \item [translator] Name of the translator. \item [isbn-issn] \end{description} \end{description} \subsubsection{In Book} \begin{description} \item [bib type="inbook"] an article as part of a book. \begin{description} \item [author] The author of the book. \item [year] The year of publication. \item [title] Title of the article. \item [editor] Name of the book's editor. \item [book-title] Title of the book. \item [series-volume] Volume number, if the book appears in a series. \item [pages] Number of pages of the article. \item [city] City where the book was published. \item [publisher] Name of the publishing company \item [edition] Edition of the book (e. g. third edition) \item [series-author] Name of the series editor, if the book appears in a series. \item [series-title] Title of the series, if the book appears in a series. \item [number-of-volumes] Number of volumes, if the the book is published in multiple volumes. \item [translator] Name of the translator \item [isbn-issn] \end{description} \end{description} \subsubsection{Proceedings} \begin{description} \item [bib type="proceedings"] a conference proceedings publication. \begin{description} \item [author] The author of the article. \item [year] The year of publication. \item [title] Title of the article. \item [editor] Name of the book's editor. \item [conference-name] Name of the conference the proceedings are related to. \item [volume] Volume number. \item [pages] Number of pages of the article. \item [date] Date of the conference the proceedings are related to. \item [conference]-location City where the conference was held. \item [publisher] Name of the publishing company \item [edition] Edition of the book (e. g. third edition) \item [series-editor] Name of the series editor, if the book appears in a series. \item [series-title] Title of the series, if the book appears in a series. \item [number-of-volumes] Number of volumes, if the the book is published as multiple volumes. \item [isbn-issn] \end{description} \end{description} \subsubsection{Edited Book} \begin{description} \item[bib type="edited-book"] a book that is the edition of another work. \begin{description} \item [editor] Name of the editor of the book. \item [year] The year of publication. \item [title] Title of the book. \item [series-editor] Name of the editor of the series the book is part of. \item [series-title] Title of the series, if the book is part of a series. \item [series-volume] Volume number, if the book appears in a series. \item [number-of-pages] Number of pages of the article. \item [city] City where the book was published. \item [publisher] Name of the publishing company \item [edition] Information about the edition (e.g. ``Repr. of the London ed. 1652'') \item [number-of-volumes] Number of volumes, if the the book is published as multiple volumes. \item [isbn-issn] \end{description} \end{description} \subsubsection{Journal Article} \begin{description} \item [bib type="journal-article"] an article in a scientific journal. \begin{description} \item [author] The author of the article. \item [year] The year of publication. \item [title] Title of the article. \item [journal] Name of the journal. \item [volume] Volume number, if the journal appears in a series. \item [issue] Number of the issue the article is part of. \item [pages] Number of pages of the article. \item [alternate-journal] Alternate Journal \item [isbn-issn] \end{description} \end{description} \subsubsection{Magazine Article} \begin{description} \item [bib type="magazine-article"] an article in a popular magazine. \begin{description} \item [author] The author of the book. \item [year] The year of publication. \item [title] Title of the article. \item [magazine] Name of the magazine. \item [volume] Volume number, if the book appears in a series. \item [issue-number] Number of the issue the article is part of. \item [pages Number] of pages of the article. \item [date] Date when the article appeared. \end{description} \end{description} \subsubsection{Newspaper Article} \begin{description} \item [bib type="newspaper-article"] an article in a newspaper. \begin{description} \item [author] The author of the article. \item [year] The year of publication. \item [title] Title of the article. \item [Newspaper] Name of the newspaper the article appeared in. \item [pages] Number of pages of the article. \item [issue-date] Date of the issue the article is part of. \item [city] City of the newspaper. \end{description} \end{description} \subsubsection{Thesis} \begin{description} \item [bib type="thesis"] a master/doctorate/etc. thesis. \begin{description} \item [author] The author of the thesis. \item [year] The year of publication. \item [title] Title of the thesis. \item [academic-department] Name of the academic department where the thesis was handed in. \item [number-of-pages] Number of pages of the thesis. \item [city] City where the thesis was published. \item [University] Name of the university where the thesis was handed in. \item [isbn-issn] \end{description} \end{description} \subsubsection{Report} \begin{description} \item [bib type="report"] a scientific report. \begin{description} \item [author] The author of the report. \item [year] The year of publication. \item [title] Title of the report. \item [pages] Number of pages of the report. \item [date] Date when the report appeared. \item [city] City where the book was published. \item [institution] Institution where the report was produced. \item [type] Type of report. \item [report-number] Report number. \end{description} \end{description} \subsubsection{Manuscript} \begin{description} \item [bib type="manuscript"] a handwritten/typewritten manuscript. \begin{description} \item [title] Title of the manuscript. \item [author] The author of the text. \item [location] Name of the library where the manuscript is currently located. \item [year] The year or century of publication. \item [pages] Number of pages of the manuscript. \item [signature] Signature of the manuscript. \item [editorial-remarks] Remarks related to the online publication of the manuscript. This could be notes about annotations etc. \item [description] This can be any kind of description. \item [keywords] Keywords related to the manuscript. \end{description} \end{description} \subsubsection{Generic} \begin{description} \item [bib type="generic"] a generic bibliographic type. This type should only be used in rare cases. \begin{description} \item [author] \item [year] \item [title] \item [secondary-author] \item [secondary-title] \item [volume] \item [number] \item [pages] \item [date] \item [place-published] \item [publisher] \item [edition] \item [tertiary author] \item [tertiary-title] \item [number-of-volumes] \item [type-of-work] \item [subsidiary author] \item [alternate-title] \item [isbn-issn] \item [call-number] \item [label] \item [keywords] \item [abstract] \item [notes] \item [url] \end{description} \end{description} \subsection{Architectural drawings} \label{sec:doc} Specific information for architectural drawings is presented in a \texttt{doc} container with an additional \texttt{type} attribute giving the type of drawing. All elements inside the container can appear multiple times. \begin{description} \item[doc type="Architectural Drawing"] architectural drawing. \begin{description} \item [person] last name and first name of a person, separated by a comma. A further common name for the person can be put infront, separated by a semicolon. \item [location] Name of a place in its common notation. This can be a city or a institution. \item [date] This can be a year (or several years, separated by commas) or a period (1706-1714). Years are noted with four digits. \item [object] Short description of an object or signatures. \item [keywords] Keywords related to the object. \end{description} \end{description} \subsection{Document structure (table of contents)} \label{sec:toc} Information on the structure of a document like the division into parts and chapters in the way of a table of contents is presented in a \texttt{toc} container. The scheme allows multiple logical pages on a single page image as it is often the case with scanned books or manuscripts. The scheme also allows for ``loose'' numbering schemes with roman, arabic or other page numbers consecutively or mixed and changes in the numbering within the document. The flexibility comes from the fact that no additional assumptions about the mapping between logical pages and page images are made in the format. All mapping information is specified by the user. The logical page numbering or naming that can be presented to the user is specified in the \texttt{name} tags while the physical numbering of the page images is specified in the \texttt{index} or \texttt{url} tags. \begin{description} \item[toc] container for document structure \begin{description} \item[page] describes a single logical page \begin{description} \item[name] the ``name'' of the logical page. This can be any string like a page number (arabic, roman, etc.) or a special designation like ``Table 5''. \item[index] the \texttt{digilib} index number\footnote{The index number for digilib is the index in the alphabetical order of the scan file names.} of the scan image of the page. \item[url] alternatively to the \texttt{digilib} index number the full URL of the scan image of the page can be used. \end{description} \item[chapter] describes a section or chapter of the text. \texttt{chapter} elements can be nested. \begin{description} \item[name] the title of the chapter or section. \item[start] the beginning of a page range (usually the first page of the chapter). The \texttt{start} element has an optional \texttt{increment} attribute to indicate the number of logical pages on a scan image.\footnote{This information is only needed by additional tools that try to generate lists of all page and image numbers.} \begin{description} \item[name] the ``name'' of the first page (see \texttt{page}). \item[index] the index of the first page (see \texttt{page}). \item[url] the URL of the first page (see \texttt{page}). \end{description} \item[end] the end of a page range (usually the last page of the chapter). \begin{description} \item[name] the ``name'' of the last page (see \texttt{page}). \item[index] the index of the last page (see \texttt{page}). \item[url] the URL of the last page (see \texttt{page}). \end{description} \item[page] alternative (and additional) to \texttt{start}/\texttt{end} page ranges single \texttt{page} elements can be used inside \texttt{chapter}. \end{description} \end{description} \end{description} %%\url{http://pythia.mpiwg-berlin.mpg.de/toolserver/TS_lise} \subsection{Digital images} \label{sec:inform-scann-imag} Image files representing scanned images can have an \texttt{img} container tag with information about the scan resolution and the size of the original image. This information is used by the \texttt{digilib} image viewing tool. Required is one of three possible sets of tags: \begin{description} \item[img] digital image information. \begin{description} \item[original-size-x] The width of the original image -- required. \\ The unit of measure can be contained as parameter \texttt{unit}, the default is meter ``m''. The width to be considered is the total width of the scanned area. \item[original-size-y] The height of the original image -- required. \item[original-pixel-x] The width of the hi-res scan in pixels -- deduced. \item[original-pixel-y] The height of the hi-res scan in pixels -- deduced. \end{description} \end{description} or \begin{description} \item[img] digital image information. \begin{description} \item[original-dpi-x] The resolution of the hi-res scan in its width in pixels per inch -- required. \item[original-dpi-y] The resolution of the hi-res scan in its height in pixels per inch -- required. \item[original-pixel-x] The width of the hi-res scan in pixels -- deduced. \item[original-pixel-y] The height of the hi-res scan in pixels -- deduced. \end{description} \end{description} or \begin{description} \item[img] digital image information. \begin{description} \item[original-dpi] The resolution of the hi-res scan in pixels per inch if the resolutions in width and height are the same -- required. \item[original-pixel-x] The width of the hi-res scan in pixels -- deduced. \item[original-pixel-y] The height of the hi-res scan in pixels -- deduced. \end{description} \end{description} \subsection{Digital image acquisition} \label{sec:inform-about-image} A description of the technology used in the process of producing a digital image. \begin{description} \item[image-acquisition] description of the image production process \begin{description} \item[device] acquisition device (e.g. ``flatbed scanner'') \item[image-type] type and color-depth of the image -- required (e.g. ``RGB 24 bit'') \item[production-comment] additional textual information about the production process \end{description} \end{description} \subsection{Full text with images} \label{sec:full-text-with} Full text in a XML format should be specified with a \texttt{content-type}\footnote{see section~\ref{tag-content-type} on page\pageref{tag-content-type}} ``fulltext''. The relation between the full text and optional images of whole pages or parts of pages must be specified in a \texttt{text-tool} container. \begin{description} \item[text-tool] representation of full text with images \begin{description} \item[text-file] the file name of the full text file (with path inside document directory) \item[page-images] the directory name of the directory containig the page image files (with path inside document directory) \item[xslt-file] the file name of an additional XSL transformation file \item[text-config] container for configuration options \begin{description} \item[container-tag] the name of the text root element (default ``text'') \item[ref-element-tag] the name of the element that is used as unit of reference when results are presented \item[pagebreak-tag] the name of the element that indicates page breaks (default ``pb'') \end{description} \end{description} \end{description} \subsection{Copyright and access conditions} \label{sec:access-conditions} If the access to a resource is bound to conditions for technical or legal reasons then the conditions can be put in a \texttt{access-conditions} container. Other access rights conditions like copyright can also be documented in this container. \begin{description} \item[access-conditions] legal and technical conditions for access to this resource \begin{description} \item[attribution] The name or institution this resource should be attributed to when it's publicly presented \begin{description} \item[name] a name (free text) \item[url] a URL (with an optional \texttt{label} attribute to show as text) \end{description} \item[copyright] the copyright owner and it's conditions \begin{description} \item[owner] the name of the copyright owner \begin{description} \item[name] a name (free text) \item[url] a URL (with an optional \texttt{label} attribute to show as text) \end{description} \item[date] the date when the copyright was issued \item[duration] the duration of the copyright (if known) \item[description] free-text field for special or additional conditions \end{description} \item[access] conditions of access to this resource \begin{description} \item[internal] access should be restricted to a group of users. The type of group is defined by one of the following \begin{description} \item[institution] the members of this institution. The method to identify a user to belong to the institution is not specified in this document. \item[subnet] all computers with an IP-address in this subnet. The subnet is defined in ``truncated-quad'' (e.g. ``141.14'') or ``adress/netmask'' (e.g. ``141.14.0.0/255.255.0.0'') notation. \item[group] the members of this named group. The method to identify a user to belong to a named group is not specified in this document. \end{description} \item[scientific] access to this resource should be restricted to scientific work \item[free] access to this resource is not restricted \item[special] if none of the above conditions seems appropriate, a free-form text can be specified here. \end{description} \end{description} \end{description} \noindent It should be noted that control over the access to the resource has to be provided by additional technical measures. Access conditions in the metadata file only state that conditions \emph{should} be observed, not that they \emph{are} necessarily observed, as the enforcement of conditions depends on additional technical measures. \subsection{Acquisition of raw-data} \label{sec:acqu-inform} Information about the acquisition source for raw data resources can be provided in an \texttt{acquisition} container. \begin{description} \item[acquisition] the acquisition source of this resource -- required for raw data. \begin{description} \item[provider] where this resource came from -- required \begin{description} \item[name] free-text name of the provider (institution or individual) \item[address] address of the provider \item[contact] contact person at the provider (i.e. name and email) \item[url] URL related to the provider \end{description} \item[date] date of acquisition -- required \item[description] free-text description of the acquisition source or additional information \end{description} \end{description} \subsection{Documentary Films} \label{sec:documentary-films} Documentary films can be described using a \texttt{film-acquisition} container. \begin{description} \item[film-acquisition] description of a (documentary) film -- required for documentary film \begin{description} \item[recording] specification of the recording process \begin{description} \item[author] the person or persons doing the recording \item[date] the date or time span when the film was recorded \item[location] the place where the film was recorded \item[device] recording device used (e.g. ``Sony CP-DV8 Camcorder'') \item[format] format of the recorded film -- required (e.g. ``DV 720x524 25fps interlaced'') \end{description} \item[description] free-form description of the recording and the content of the film \end{description} \end{description} (More information about the digitization step could be added in a \texttt{digitization} tag similar to the \texttt{recording} tag.) \section{Sample metadata files for ECHO resources} The following is a sample metadata index file for a directory containig a scanned document. \begin{small} \begin{verbatim} Fleck, 1980 fleck.1980 University of Bern ubern/wiss-theorie scanned images echo23a45e2329x ger Fleck, Ludwik 1980 Entstehung und Entwicklung einer wissenschaftlichen Tatsache Frankfurt am Main Suhrkamp Wissenschaftstheorie, Fleck, Tatsache Scanned images (300dpi) img \end{verbatim} \end{small} The following is a sample metadata file for a single image of an architectural drawing. \begin{small} \begin{verbatim} Bibliotheca Hertziana scanned images 00000271-asl-160-r-full.tif 315 echo45a67bc4367d ita Ciolli, Giacomo Urban VIII; Barberini, Maffeo Accademia di San Luca Roma 1706 Concorso Clementino Fontana Pubblica Brunnen ASL 160 http://colosseum.biblhertz.it:8080/Lineamenta/ 1033478408.39/1035196181.35/1035196204.09/1035394121.83 \end{verbatim} \end{small} \end{document} %%% Local Variables: %%% mode: latex %%% TeX-master: t %%% End: