Dienst                                            James R. Davis, Xerox
INTERNET-DRAFT                                    Carl Lagoze, Cornell
                                                  July 1994
                                                  Expires December 1994



   Dienst, A Protocol for a Distributed Digital Document Library

STATUS OF THIS DOCUMENT

This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts.

Internet Drafts are working documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress".

To learn the curent status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (EUROPE), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim).

This document is a DRAFT specification of a protocol in use on the internet. Distribution of this memo is unlimited.

Last Revised: 3:00 PM 8 August 1994

This document is also available in ASCII .

Abstract

This document describes Dienst, a protocol for communication with distributed digital library servers. This protocol provides an object-oriented interface to a document model, which allows a user to access complete documents or named sub-parts. It also supports multiple formats for documents. Dienst protocol messages are embedded within HTTP, the protocol used over the World Wide Web. Thus, anyone using a Web browser (e.g. Mosaic, Cello) has access to the services provided by Dienst.

Dienst: A protocol for a Distributed Digital Document Library

1: Introduction

Dienst is a protocol for the search and retrieval of documents from a digital library. Dienst models the digital library as a set of documents, each with a unique identifier, or DocID for short. This DocID is an immutable, persistant object that identifies each document in a location-independent manner. The digital library may be distributed among geographically dispersed servers, since Dienst supports full interoperability among sites. The combination of full interoperability and location independent document identifiers allow clients that use the Dienst protocol to ignore the details of the server distribution and, instead, address a single virtual document collection.

The document model implicit in Dienst is that each document can be in many formats (e.g., TIFF, GIF, Postscript) and consists of a set of named parts. There are two orthogonal parts domains - 1) a physical domain where the parts are numbered pages and 2) a logical domain where parts are objects like chapters, tables, list of reference, and so on. The document model is extensible, in that we may define additional logical parts in the future.

Dienst supports an object-oriented interface to this library and document model - clients encode messages within Dienst requests that address the entire document collection, a particular document in the collection, or a particular part of a document in the collection. Each Dienst protocol request contains the name of the message, the particular document (DocID) to which it applies (if any), and the arguments (e.g. page number, format) for the method (if any).

Dienst messages address four types of digital library services, not all of which are necessarily supported by a particular digital library server.

This taxonomy allows maximal flexibility in the way that particular server implementations interoperate. For example, one server may exist solely as a user interface gateway, providing transparent access for users to a particular domain of indexes and repositories. We see this flexible interoperability as key to the development of a digital library infrastructure where the "collection" will span multiple sites and continents.

The Dienst protocol is built on the framework of the Hypertext Transfer Protocol (HTTP) [HTTP] used on the World-Wide Web [WWW] . The advantages of piggybacking Dienst on HTTP are two-fold. First, Web browsers, such as Mosaic, are reasonably ubiquitous and, at this point, free, making digital library services available to a broad constituency. Second, there is substantial momentum for further development of the functionality of the Web and its constituent technologies. Many components of this developing technology are of direct interest to the digital library community, especially those dealing with authorization and authentication. Advancements in these areas will directly benefit Dienst clienst and servers.

A Dienst request is encoded within a Uniform Resource Locator (URL) [URL] . Specifically, the Dienst message is placed in the "path" portion of the URL, which is opaque to the HTTP client and, as defined in the URL specification, "may define details of how the client should communicate with the server, including information to be passed transparently to the server without any processing by the client."

Each Dienst request that addresses a specific document includes a unique identifier, or DocID, for the document. This document identifier, as specified in the Dienst BNF grammar that follows, consists of the following components:

Some examples of DocID's are:
ISBN_NA:0-395-32943-4

CORNELLCS:TR94-1418

IANA_NA:foobar:94-5-2

The authors recognize that the DocID, as defined here, fits the requirements for Uniform Resource Names as defined in [URN] . When the syntax of a URN is standardized, it will be incorporated into the Dienst protocol either as a replacement or supplement to the existing DocID's.

Responses to Dienst requests are formatted as HTTP responses. Thus, the standard components of an HTTP response such as status-code, content-type, and so on are returned. The actual Dienst response is encoded in the response data of the HTTP response. Refer to the section that describes server responses to Dienst methods and the one that describes error responses for complete information on status-codes, content-types and response data for each request.

An initial version of Dienst and a prototype implementation were developed as part of the Computer Science Technical Report (CSTR) project, an ARPA-sponsored, CNRI-directed effort to create an online digital library of technical reports from the nation's top computer science universities. A description of this initial version is in [DIENST] .

2: Dienst BNF Grammar

As noted above, a Dienst request is encoded in the path portion of a URL. An informal BNF syntax of the protocol is as follows (see the Methods section for a description of each method in the protocol). The protocol is case-sensitive - this is consistent with the rest of the path portion of the URL. Terminals in the BNF grammar are distinguished by names that are all lower case (e.g., index), non-terminals are mixed-case (e.g., Request). The "special" characters ";" (semicolon), "/" (forward slash), "&" (ampersand), "=" (equals), and "?" (question mark) are literals in protocol requests. Finally, any non-terminals that are optional are enclosed within brackets (e.g., [PageNoArgument]). When an optional item is not included in an protocol request, the "special" character that preceeds the optional item (if any) is omitted.
    Request 	    	    =   ProtocolVersion/RequestClass
    
    RequestClass    	    =   MiscRequest |
    	    	    	    	IndexRequest | 
    	    	    	    	RepositoryRequest |
    	    	    	    	UserInterfaceRequest

    MiscRequest	    	    =	misc/MiscReqMethod

    MiscReqMethod   	    =	MISCServicesMethod |
    	    	    	    	MISCTimeMethod |
    	    	    	    	MISCVersionMethod

    MISCServicesMethod	    =	services

    MISCTimeMethod  	    =	time

    MISCVersionMethod	    =	version

    IndexRequest    	    =   index/IndexReqMethod

    IndexReqMethod	    =	INDContentsMethod |
    	    	    	    	INDSearchMethod

    INDContentsMethod  	    =	contents

    INDSearchMethod 	    =	search/SearchType?SearchCriteria

    SearchCriteria  	    =	<see description of INDSearchMethod>

    RepositoryRequest	    =	rep/RepReqMethod

    RepReqMethod    	    =	REPDocFormatsMethod |
    	    	    	    	REPDocPageMethod |
    	    	    	    	REPDocPartMethod |
    	    	    	    	REPDocPrintMethod

    REPDocFormatsMethod	    =	DocID/formats

    REPDocPageMethod	    =	DocID/page?PageArguments

    REPDocPartMethod	    =	DocID/DocumentPart

    REPDocPrintMethod	    =	DocID/print?PrintArguments

    PrintArguments  	    =	PrintPagesArguments&PrintDestArguments
    	    	    	    	
    PrintPagesArguments     =	pages=all |
    	    	    	    	PagesSomeArguments

    PagesSomeArguments 	    =	pages=some&PageRangeArguments

    PageRangeArguments	    =	from=PageNumber&to=PageNumber

    PrintDestArguments	    =	destination=download |
    	    	    	    	destination=printer&printer=PrinterName

    PrinterName	    	    =	<see description of REPDocPrintMethod>

    UserInterfaceRequest    =   ui/UIReqMethod

    UIReqMethod	    	    =	UIDocOverviewMethod |
    	    	    	    	UIDocPageMethod |
    	    	    	    	UIDocPrintMethod |
    	    	    	    	UISearchIntMethod |
    	    	    	    	UIDocSummaryMethod

    UIDocOverviewMethod	    =	DocID/overview?[PageNoArgument]

    PageNoArgument   	    =	PageNumberArgument

    UIDocPageMethod 	    =	DocID/page?PageArguments

    UIDocPrintMethod	    =	DocID/print

    UISearchIntMethod	    =	search

    UIDocSummaryMethod	    =	DocID/summary

    ProtocolVersion 	    =   dienst/1.0

    DocID   	    	    = 	Naming_Authority:Publisher_ID:ID |
    	    	    	    	Naming_Authority:ID |
    	    	    	    	RFC_1357_Publisher:ID

    Naming_Authority	    =	<refer to DocID description above>

    Publisher	    	    =	<refer to DocID description above>

    RFC_1357_Publisher 	    =	<refer to DocID description above>

    ID	    	    	    =	<refer to DocID description above>

    PageArguments   	    =	PageNumberArgument&[FormatArgument]

    PageNumberArgument	    =	page=PageNumber

    PageNumber	    	    =	<page number as a positive integer>

    FormatArgument  	    =	type=MimeTypeValue

    MimeTypeValue	    =	MIMEType;[MIMEParameters]

    MIMEType	    	    =	<see [MIME]>

    MIMEParameters  	    =	<see description of REPDocPageMethod>

    DocumentPart    	    =	body <see REPDocPartMethod below>

    SearchType	    	    =	rfc-1357

3: Discussion of Dienst Methods and Server Responses

This section gives more details on the Dienst methods listed in the protocol BNF grammar. The description of each method contains more information on arguments (if any), and specifies the normal (non-error) server response to the method (Error responses are described in the next section). As previously noted, all Dienst responses are encoded within HTTP responses. The aspects of an HTTP response that are relevant to Dienst are:

The description of each method is followed by an example of a REQUEST and RESPONSE. Long lines in the examples (greater than 72 characters) are broken up with the continuation lines distinguished by having a leading space.

MISCServicesMethod

The server returns a text/plain document that contains the services that it provides, one per line. The possible services are those listed in the BNF grammar (misc, index, rep, and ui).

REQUEST:

dienst/1.0/misc/services
RESPONSE:
misc
ui

MISCTimeMethod

The server returns a text/plain document that contains a single line which is the local time as defined in RFC 822 [CROCKER] , Section 5.1 and modified in RFC 1123 [BRADEN] , Section 5.2.14.

REQUEST:

dienst/1.0/misc/time
RESPONSE:
20 June 94 12:36:47 -0500

MISCVersionMethod

The server returns a text/plain document that contains a single line which is the Dienst protocol version that it supports (e.g. 1.0).

REQUEST:

dienst/1.0/misc/version
RESPONSE:
1.0

INDContentsMethod

The server returns a text/x-dienst-response document consisting of records containing meta-information on all the documents that it indexes. The format of this meta-information follows the encoding proposed for Uniform Resource Characteristics (URC) [URC] . Each record will consist of a set of pairs in the format
[attribute_name]:[value]
The attribute_names may be returned are listed below. NOTES: Attribute_names that are not listed in the URC draft (so-called "experimental attribute_names") are prefixed by "X-". Attribute_names that are required in a returned record are followed by an "*". Multiple authors should be listed in multiple author fields. A record must include at least one URL or URN. Each URL may be followed by a Content-Type and Content-Length. If a value spans several lines, the first character of subsequent lines must be a space. A blank line separates records.

REQUEST:

dienst/1.0/ind/contents
RESPONSE:
X-publisher:CORNELLCS
X-DocID:93-1334
title:Approaches to Passage Retrieval in Full Text Information 
 Systems
author:Salton, Gerard
author:Allan, J.
author:Buckley, C.
X-date: March 1993
URL:https://cs-tr.cs.cornell.edu.edu/dienst/1.0/rep/CORNELLCS:TR93-
 1334/body?type=application/postscript
x-pages:25

X-publisher:CORNELLCS
X-DocID:94-1420
title:Lower Bounds for Dynamic Connectivity Problems in Graphs
author:Fredman, Michael L
author:Rauch, Monika H.
X-date:April 94
URL:https://cs-tr.cs.cornell.edu/dienst/1.0/rep/CORNELLCS:TR94-1420/
 body?text/plain

INDSearchMethod

The server returns a text/x-dienst-response document formatted in the same fashion as for the INDContentsMethod. The records returned are documents that meet the SearchCriteria for the SearchType included in the request. The only currently supported SearchType is rfc-1357. For this SearchType, SearchCriteria has the form term[&terms], where term has the form name=value. "name" is an rfc-1357 field tag (e.g. TITLE, AUTHOR, ABSTRACT) and value is "text" to search for in the respective field in the rfc-1357 bibliographic entry of a document. For a document to meet the rfc-1357 SearchCriteria, all terms must be true, in other words terms are connected by "and". Note that the HTTP protocol requires that any special characters in "values" (e.g. space, question mark, etc) must be represented by escape sequences.

REQUEST:

dienst/1.0/ind/search/rfc-1357?author=rus&abstract=mobile+robot
RESPONSE:
X-DocID:91-1254
title:Task-Level Planning and Task-Directed Sensing for Robots in
 Uncertain Environments
author:Donald, Bruce Randall
author:Jennings, James
author:Brown, Russel
X-date: 12 Jun 91
URL:https://cs-tr.cs.cornell.edu/dienst/1.0/rep/CORNELLCS:TR1-1254/
 body?type=application/postscript

X-publisher:CORNELLCS
X-DocID:94-1429
title:Analyzing Teams of Cooperating Mobile Robots
author:Donald, Bruce Randall
author:Jennings, James
author:Rus, Daniela
X-date:10 Apr 94
URL:https://cs-tr.cs.cornell.edu/dienst/1.0/rep/CORNELLCS:TR94-1420/
 body?type=application/postscript

REPDocFormatsMethod

The server returns a text/x-dienst-response document that consists of a list of tuples: a URL, a Content-Type, an optional Content-Length, and a number of pages (which may be required as specified below). The list indicates the formats in which the server is prepared to deliver the specified document. This list is encoded in the manner proposed for URC's. Note the Content-Length is optional and can only be determined if the data for the format is stored in a single file. The number of pages is required if the URL specifies a format that is available in discreet pages.

REQUEST:

dienst/1.0/rep/CORNELLCS:TR91-1254/formats
RESPONSE:
Content-Type:text/plain
Content-Length:181249
URL:https://foo.edu/dienst/1.0/rep/CORNELLCS:TR93-1334/body?
 type=image/gif
Content-Type:image/gif
X-pages:15

REPDocPageMethod

If the request includes an argument specifying the format to be returned, the server returns the page of the document specified in the request in that format. If no format argument is included, the server uses the "Accept" field in the HTTP request header to determine the format of the document "preferred" by the client and returns the page of the document specified in the request in that format. A format argument is encoded as a MIME type with optional parameters, for example, the format image/gif;dpi=72 indicates a 72 dpi gif image of the page. NOTE: The standard means of specifying the desired MIME type of a response in an HTTP request is to use the ACCEPT field in the request header. However, REPDocPageMethod is intended as the method used by a user interface server to compose HTML documents that consist of a embedded page images and links to previous and next pages of a document (see UIDocPageMethod). HTML has no means of explicitely setting the ACCEPT field in the IMG tag and, thus, the MIME format must be included in the protocol request.

REQUEST:

dienst/1.0/rep/CORNELLCS:TR91-1254/page?page=1&type=image/tiff
RESPONSE:

<byte stream for tiff representation of page 1>

REPDocPartMethod

The server returns the part of the document specified in the request. The desired MIME format of the document part is encoded in the "Accept" field of the HTTP request header (in contrast to the REPDocPageMethod). Note that the only "part" supported at this time is "body", which indicates the full contents of the document. Future versions of the protocol may support additional logical document parts such as chapters, sections, tables, and so on.

REQUEST:

dienst/1.0/rep/CORNELLCS:TR91-1254/body
RESPONSE:
<byte stream of representation of document body in format specified by ACCEPT>

REPDocPrintMethod

The server takes one of two actions depending on the destination argument that is included in the request. The examples below show both types of requests.

REQUEST:

DEINST/1.0/rep/CORNELLCS:TR91-1254/print?
 destination=download&pages=some&from=5&to=8
RESPONSE:

<application/postscript representation of pages 5-8 of the document>

REQUEST:

dienst/1.0/rep/CORNELLCS:TR91-1254/print?
 destination=printer&printer=pr1&pages=all
RESPONSE:

All pages of CORNELLCS:TR91-1254 have been submitted to printer pr1.

UIDocOverviewMethod

The server returns a text/html document that contains in-line, reduced-size, page images (when available) of the specified document. The purpose of this is to facilitate browsing of large documents. The html document should be composed so that a user can select one of the reduced size images (e.g., using the ISMAP facility) and view it in a larger, readable format. Long documents may be divided into several of these "overview" documents. In this case the request should include a "page" argument and the text/html document returned by the server will include hypertext links to get to the next or previous "overview" page.

REQUEST:

dienst/1.0/ui/CORNELLCS:TR91-1254/overview?page=1
RESPONSE:

For a sample response for this method see the prototype implementation.

UIDocPageMethod

The server returns a text/html document that contains an inline image (when available) of the page of the document that is specified in the request. The text/html document will include hypertext links to get to the next or previous page of the document.

REQUEST:

dienst/1.0/ui/CORNELLCS:TR91-1254/page?page=1
RESPONSE:

For a sample response for this method see the prototype implementation.

UIDocPrintMethod

The server returns a text/html document that contains a forms-based interface which allows printing or downloading the document. This should only be permitted if the document is available in a format that is suitable for printing - usually this means it is available in postscript or in a form convertable to postscript. The server should use information in the HTTP request header to determine the domain origin of the client request, to determine if printing is possible or downloading is the only option for this client. The HTML form may permit the user to select specific pages of the document to print or download, if this is possible (i.e., the postscript representation follows postscript document structuring conventions, which make it possible to find code for specific pages).

REQUEST:

dienst/1.0/ui/CORNELLCS:TR91-1254/print
RESPONSE:

For a sample response for this method see the prototype implementation.

UISearchIntMethod

The server returns a text/html document that contains a forms-based interface for submitting a document search. When the search is submitted, the server will handle actual querying of one or more index servers and return a text/html document that contains links to documents which are the result of the search. At this time the only current supported SearchType is rfc-1357. For this type of search, the suggested search form consists of a set of text fields that are labeled as rfc-1357 field types (e.g., Title, Author, Abstract). The user can then enter data into these fields. This data is can then be used by the user interface server to submit an rfc-1357 search request to an index server using the INDSearchMethod. As more SearchType's are incorporated into the protocol, for example full text or complex boolean queries, other user interfaces for searching can be developed.

REQUEST:

dienst/1.0/ui/Search
RESPONSE:

For a sample response for this method see the prototype implementation.

UIDocSummaryMethod

The server returns a text/html document that contains information about the specified document. This information should include the title, author, date, and abstract and links to, or information about, various formats in which the TR is available.

REQUEST:

dienst/1.0/ui/CORNELLCS:TR91-1254/summary
RESPONSE:

For a sample response for this method see the prototype implementation.

4: Error Responses to Dienst Methods

Error responses to Dienst requests are encoded as standard HTTP responses; that is a non-200 (not OK) status code is returned with a printable reason string that gives an explanation of the error for the human reader. There are three possible error status codes returned by a Dienst server:
  1. Bad Request 400 - This status code is returned when the syntax of the Dienst request does not fit the BNF syntax defined by the protocol.
  2. Not found 404 - This status code is returned when the arguments for the method specify a non-existant object (e.g. a nonexistant Document ID or page number of a document).
  3. Not Implemented 501 - The RequestClass specified is not supported by the Dienst server.

5: Related Issues

This section describes a number of issues related to digital library servers that are not addressed by this proposal. The authors view this list as items for future work and invite suggestions or participation from other interested parties.

Authentication and Authorization

Dienst was originally designed as a protocol for communication with technical report servers. These servers have limited, if any, copyright restrictions that would require limiting access to the documents. This is not true for general digital library services where document licenses often specify that the "patrons" be limited to a specific audience (e.g. registered students at a university). Current work to develop a secure HTTP [S-HTTP] may provide some of this desired functionality.

Payment

The issues here are similar to authentication and authorization. Again there is substantial work in the internet community to resolve this issue.

Server Registration

The authors recognize that a prerequisite for a distributed digital library is a means for providing mutual awareness among the servers. That is, there should be a means for a site to "bring up" a server that understands Dienst and make other servers aware of the existance of the new server (so that they may search a new index, for example). The prototype implementation uses a central server that has a (manually maintained) list of existing servers. Clearly, this method does not scale and work needs to be done in this area.

Server Meta-Information

As digital library servers proliferate, the need for extracting meta-information about a server will develop. At the simplest level this meta-information might be an "abstract" of what type of documents the server provides. More structured meta-information might provide the means of implementing more sophisticated search schemes such as automatically determining the index which is most relevant for a specific query [GLOSS] or searches involving information agents [RUS]

6: References

[BIB] Danny Cohen. A Format for E-mailing Bibliographic Records. RFC-1357

[BRADEN] R. Braden. Requirements for Internet Hosts -- Application and Support. RFC-1123.

[CROCKER] David H. Crocker. Standard for the format of ARPA Internet Messages. RFC-822.

[DIENST] James R. Davis, Carl Lagoze. A protocol and server for a distributed digital technical report library. Cornell University Computer Science Department Technical Report 94-1418, June 1994.

[GLOSS] Luis Gravano, Hector Garcia-Molina, Anthony Tomasic. The Efficiency of GLOSS for the Text Database Discovery Problem. Stanford University Technical Report CS-TN-93-2.

[HTTP] Tim Berners-Lee. Hypertext Transfer Protocol(HTTP). Internet Draft.

[MIME] Nathaniel S. Borenstein, Ned Freed. MIME (Multipurpose Internet Mail Extensions) . RFC-1341.

[RUS] Daniela Rus, Devika Subramanian. Information Retrieval, Information Structure, and Information Agents. Submitted to ACM Transactions on Information Systems.

[S-HTTP] Eric Rescorla, Allan M. Schiffman. The Secure HyperText Transfer Protocol. To appear as an RFC.

[URC] Michael Mealling. Encoding and Use of Uniform Resource Characteristics. Internet Draft.

[URL] Tim Berners-Lee, Uniform Resource Locators (URL). Internet Draft.

[URN] K. Sollins, L. Masinter. Requirements of Uniform Resource Names, March 26, 1994. Internet Draft.

[WWW] Tim Berners-Lee, Robert Cailliau, Jean-Francis Groff, and Berd Pollerman. World-wide web: The information universe. Electronic Networking: Research, Applications and Policy 2(1):52-58, 1992.

7: Authors' Addresses:

James R. Davis
Xerox Corporation
Design Research Institute
Cornell University
Ithaca, NY 14853
davis@dri.cornell.edu

Carl Lagoze
Computer Science Department
Cornell University
Ithaca, NY 14853
cjl2@cornell.edu