User:JohnMie

From Wikipedia, the free encyclopedia

The well-behaved document - has Dublin Core inside[edit]

By John W. Miescher, Bizgraphic - Geneva/Switzerland, miescher@bizgraphic.ch

Abstract[edit]

A well-behaved document is an electronic document which is both user friendly and library friendly.

The intent of this document is to show the importance and indeed the necessity for authors and publishers to produce only well-behaved electronic documents because they have a greater chance of being found (in search engines), consulted and referenced on a regular basis and they can save both time and money (during gestation).

User friendliness is best achieved by appropriately instructing authors and publishers as well as parties ordering authoring work. It should include definition of topics, keywords, formatting and style usage in their briefing and insist on implementation discipline.

Library friendliness means a document includes embedded cataloging information in a known format that allows for automatic processing of library records without or only little manual intervention. The Dublin Core set of standard terms seems to be the best choice for embedding such information. It is widespread, PDF files automatically include some of them and there is software available for both reading and writing metadata in this format.

Professional libraries prefer to keep the metadata of their documents in separate repositories for reasons of integrity, but since one does not exclude the other, embedding (a selection of) the same metadata directly into a digital resource, automatically makes this data available to third parties who download such resources to preserve locally in their own knowledge base and/or to consult offline.

Attention: The plea for the well behaved document is not an evaluation or criticism of the Dublin Core or any other metadata generation, description or harvesting system, but a suggestion for embedding simple metadata in electronic documents that are meaningful for the user of a document and that he/she can extract with suitable software for inclusion in his/her private collection.

Keywords[edit]

Well-behaved document; Dublin Core; embedded metadata; Knowledge Management; Digital Asset Management; PDF, EPUB, HTML; catalogue data; automatic library recording

Plea for the well-behaved document[edit]

A well-behaved document is an electronic document that is both user friendly and library friendly.

User friendly means a document (we are talking about any document with more than just a few pages) is easy to read and easy to navigate on any device and for which reading software is readily available. It is in an open format and does not depend on proprietary (requiring purchase) software for display, styles and multimedia content. It must be searchable, has bookmarks (in applications that allow for it, such as PDF files in Acrobat or Adobe Reader), an interactive table of contents, i.e. one with ‘clickable’ links to the correct target page, and possibly an interactive index, cross references and links to external resources. Except for copyrighted material it should not be password protected or encrypted but must allow the user to print it out and to copy/paste portions of the text and possibly to add bookmarks and comments of his own.

This applies not only to scientific papers and manuals but to all documents that are not necessarily read in a continuous stream from cover to cover, such as novels and literary works.

Library friendly means a document has useful embedded metadata which librarians can exploit to automatically classify a document without or with only minimal manual intervention. It is searchable and can easily be indexed for full-text searching across a collection of documents.

Most professional (public, national, academic) libraries prefer to keep the metadata of their documents in separate repositories (aka catalogues) for reasons of integrity and to be able to output such metadata on demand in a variety of formats (Dublin Core http://dublincore.org/documents/dcmi-terms/, MARC/MARC 21 http://www.loc.gov/marc/, MODS http://www.loc.gov/standards/mods/, Dewey Decimal Classification http://www.oclc.org/dewey/ etc.), but since one does not exclude the other, embedding (a selection of) the same metadata directly into a digital resource, automatically makes this data available to third parties who download such resources to preserve locally in their own knowledge base and/or to consult offline.

The Dublin Core set of standard terms in combination with appropriate software is probably the best option to ensure that metadata can be considered useful.

Both user friendliness and library friendliness are, in principle, easy to accomplish, but we found that the majority of documents available on the web, as well as those published internally by large organizations, are neither user friendly nor library friendly.

DCMI Today – Success and issues[edit]

DCMI has attracted a large following and is, by now, known to all professional librarians around the world. They have software to manage and to extract the information contained therein and to deploy it within their own systems complete with their own search mechanisms. With 15 elements and 55 terms to describe resources plus possible qualifiers and an unlimited number of namespaces to make things even more refined, this standard has become the reserve of the most initiated of librarians only. Which is why a large number of the almost 2 mio documents we found on the web under “Dublin Core” stem from universities and full-time librarians. Yet we found very few documents on the internet with embedded metadata beyond the basics and even less that included a meaningful set of DC terms that a librarian could directly exploit to complete his records.

The following assertion is apparently still a far way from being commonly accepted: “Catalogue data that travels with the document facilitates automatic library recording” (J.W.Miescher)

DC terms have not yet attracted the attention of authors[edit]

Books, manuals, papers and all other documents that are being published today, whether in print or just electronically, are authored in the majority of cases by non-librarians who have never heard of the semantic web and are neither familiar with nor interested in DC terms. Most authoring tools and page layout software offer no way of specifying these beyond the basics, i.e. title, author, subject and keywords and often the discipline to fill in even these basics is not there.

DC terms must be added manually[edit]

Librarians and content managers of organizations large enough to have a publication department and a library of their own are thus obliged to profile each document that passes through their hands and manually complete the information that describes the resource adequately.

This implies a lot of guesswork and impairs the quality of the metadata and the reliability of library records and makes knowledge organization difficult.

DC terms would enhance the quality of library listings[edit]

These same content managers regularly publish overviews of their entire libraries or just updates of the new additions. The usefulness of such overviews depends directly on the quality and precision of the descriptions of the resources. The informative value of the associated DC terms could enhance this quality.

Content managers may also issue updates including actual documents for the members of their community or employees of their organization on their internal networks, on the web or on CD. In the case of electronic distribution such updates typically contain an interactive table of contents (TofC hereafter) that allows users to click on a title to bring up the underlying document. Ideally such TofCs would also list titles by topics and/or author and, in the case of international organizations, in multiple languages.

DC terms would improve user friendliness of (electronic) documents[edit]

A search mechanism that depends on embedded metadata is often also included in the displaying application (e.g. Adobe Reader®). In most cases the titles of documents found are displayed in the search result as the only meaningful and unique pointer to the contents of a document and, therefore, the title should be the first and foremost DC term that is embedded. Surprisingly over 65% of PDF files we find on the web and in dedicated collections of international organizations have no meaningful titles embedded as meta tags and an even much higher number of electronic documents are not user friendly. They have no bookmarks and no interactive TofCs for easy navigation due to the fact that the original intent of the publication was for printing only.

Dublin Core has limitations[edit]

The developers of the DC standard had obviously certain scenarios in mind (such classifying web content) and wanted to introduce more specific metadata beyond the basics (title, author, subject, keywords and date (which almost any word processor today can embed automatically), by introducing an extended set of 55 additional terms and refinements or qualifiers, but in so doing the matter became extremely complicated for the average consumer of an electronic document to the point that only well versed librarians manage to correctly map properties of a document to DC terms with the right refinements. Average users may find it difficult to understand why the 15 classical or legacy elements also appear as part of the 55 newer terms and how to deal with the ambiguities this presents (see also [Core’s dirty little secret]).

Most users are only interested in the actual words and don’t care about namespaces and other refinements the Dublin Core system offers. A simple notation in attribute/literal pairs is probably adequate for most private or local repositories.

Another limitation is its limited compatibility with all the other systems in use (see below).

Dublin Core has competition[edit]

Dublin core as a standard has many competitors, standards that are issued by major libraries, universities, school authorities and other interest groups (e.g. [21], [[1]], [Decimal Classification] and many more). None of these groups is willing to abandon their proprietary standard which complicates the life of authors, publishers and librarians alike.

One- or two-way bridges (so-called cross-walks like [to DC]) could be made available to translate between the other standards and Dublin Core so that librarians around the world can easily and automatically classify any document in their own system regardless of source or content matter. However this is an almost impossible task since some formats are richer than others and have many meta tags for which there is no 1:1 equivalent in any of the other standards. And on top of it, many of the implementers (e.g. Library of Congress, other national libraries, academic libraries etc.) apply different value types to tags, create their own tags and include lots of information that has no meaning for the recipient of a document.

Dublin Core Inside[edit]

The term “Dublin Core inside” should be viewed as a mark of quality for all truly well-behaved electronic documents which must be both user friendly and library friendly per above definitions to be allowed to carry and publicly display this mark.

Both conditions of the well-behaved document are, in principle, easy to satisfy and the benefits are potentially quite important, particularly if implemented already at the authoring stage. This may imply some education and perhaps the use of some simple and affordable software tool that can be used by authors, desktop publishers and designers on the one hand and by (local-) librarians and content managers on the other.

Education is about raising awareness among all players in the publication process. It starts with the author or the party ordering the work and the publisher who can already plan their work for user friendliness and library acceptability. The software tools should help managing library collections of electronic documents in PDF, EPUB and HTML format while respecting the DCMI interoperable online metadata standards. Optionally it should also handle physical objects (e.g. printed documents, artefacts, images, movies, music etc.) as a typical user probably has both types in his collection.

Education[edit]

Education should include training of all parties involved in the production of documentation to enhance understanding of the workflow and to recognize the areas where time and money savings can be achieved by avoiding duplication and unnecessary handling. This can take the form of on-site seminars, public lectures, subscription e-mail and written help files and links to relevant papers on Dublin Core sites.

The importance and indeed the necessity of including DC terms in all documents of consequence should be stressed as an integral part of such education. Parties commissioning authoring work should be encouraged to include certain DC elements (e.g. schemes and namespaces to respect) already in their briefs. The incentive for this lies in the savings that can be achieved in the subsequent steps, see workflow chart (fig. 1.) below.

Some of the steps to take are really simple and involve practically no extra work for one but can save a substantial amount of extra work for someone else further down the line. Good planning from concept to output is one such step. This includes the early assignment of relevant topics according to a corporate wide list (namespace). Tweaking the authoring tool or page layout program to generate useful cataloguing data is another useful step, so would be the consistent use of styles to generate a more structured text which would allow (the electronic version of) the document to become interactive, complete with bookmarks and a live TofCs that points to the correct page when clicked.

This last point is particularly important in a world where more and more documents never see print at all but are intended for on-screen consumption and must therefore be searchable and easily navigable. All parties involved in the origination of a document must bear this in mind and be instructed accordingly.

Organizations must establish firm rules and insist on their application when ordering authoring work. Authors must be taught to structure their work from the very beginning and how to include meta tags and navigation elements.


 

Document Creation and Integration Process

Time Line

Decisions to be taken

Comment

 

 

 

Idea / Concept:

What needs to be
communicated

Checklist: the resulting document…

 

Who should read it
who could learn from it
How will readers use it

will be part of a library or collection on the web
will be part of an indexed collection on CD/DVD
must be interactive (bookmarks, clickable TofC)
must be searchable (indexed - by title, topic, author)

 

 

 

Contents and structure

Title

Categorie/Subject

 

Foreword by...

Topics (namespace)

 

Synopsis

 

 

 

consider aspects of multipurposing from the very beginning

 

 

 

appoint Project Manager

Who can do it

Producer, Editor, Printer

 

Who pays for it

Owner, Sponsor(s)

 

Budget

Credits

 

Timeframe

Copyright

 

 

ISBN number, Price. EAN code

 

 

 

select and brief Author

 

 

 

 

 

Research

Resources/References

 

 

Photography/Illustration

 

 

Graphs & Charts

Cover picture

 

 

 

Copy writing *

 

Use styles in word or layout program to structure text to help make final documents interactive:
Main Title, Chapter Title
Themes
Subtitles, Subsubtitles
Paragaph text
Legend, Quote, Citation
Highlight
Keywords

Review+correction **

 

For scientific papers:
peer review
submission to acceptance authority
obtain facutly recommendation

Approval of text

Translations

Page Layout ***

select
Designer
Printer
Content Packager
Webmaster
CD-Producer

Generate bookmarks, tables of content, hyperlinks, indices and crossrefernces in the layout program before creating a PDF file.

 

 

 

output to PDF for multiple deployment

 

Submission to journals for publication

 

 

 

Print

PDF/x3, hi-res

Internet

PDF low-res, Flash, eBook or HTML

CD/DVD-ROM

PDF low-res, interactive

 

 

 

 

 

 

*

typically with MSWord or OpenOffice Writer

**

typically using Adobe Acrobat flow control

***

typically in Adobe InDesign

 

 

prepared 2011 © by Bizgraphic - Geneva

FIG. 1. Genesis of a typical corporate document

The biggest saving potentials are those items that avoid a second handling:

• Assignment of topics (namespaces) and other dc.elements even in the first draft

• Considering aspects of multi-purposing from the very beginning

• Consistent use of styles to automatically generate navigation items

Coaching[edit]

As any content or production professional knows, developing a workflow that actually works can be a major challenge. Keeping track of important files and assets at each stage is critical. Effective file management is an important and necessary part of the creative process.

Coaching,seminars and workshops are probably best suited for imparting the required knowledge and there are several software tools available on the market for document or content management, for maintaining collections of electronic documents and for embedding metadata in electronic files.

The tools[edit]

Software tools to be used in this context should ideally cover two types of tasks:

One for library and content managers and one for persons with a need to read, modify and embed Dublin Core terms to documents such as authors and publishers.

The library management part is essentially a tool to build data bases of physical and electronic library objects and to output these as searchable lists and interactive tables of contents. Individual records contain some logistics information plus fields representing all Dublin Core elements and terms.

Physical (e.g. printed), electronic and web-based documents should live side by side in one list. Physical objects may have to be entered manually whereas electronic documents can be parsed (extracted) automatically regardless of location, i.e. whether stored locally, on a network or on the web.

The document handling part takes care on the one hand of the metadata extraction and manipulation, making files library friendly. On the other hand it allows for adding bookmarks and interactive tables of contents to existing (PDF-) files, making files user friendly.

See table 1 for a listing of desirable features of a suitable software tool and table 2 for a list of formats that could or should be supported


PDF, EPUB and HTML formats supported

Mini-browser for searching, viewing and downloading web content

Manage library with multiple collections

Library friendlier documents: Modify, add or delete metadata in individual files

Read embedded metadata from files, including Dublin Core Terms

Re-map custom and unrecognized meta tags to official Dublin Core Elements and Terms

Customizable bridge for importing metadata records from external sources in a variety of formats (e.g. MARC 21, MODS, EndNote etc.)

Allow definition of user defined attributes not included in the DC standard (particularly for collecions of physical objects)

Build up collections (aka linear containers or catalogues) from files on disk or LAN, from web based content or from manually added physical or unsubstantial objects (print, images, music, services or any other collectable items)

Add or modify metadata of objects or globally find and replace across entire collection

Keep individual records (library cards) and print out records as text or customized PDF template

Directory listing with file preview, sortable and printable

Include thumbnail and/or full size picture of objects - import from file or use included screen capture feature

Generate lists and reports in a variety of formats

Export lists to Excel, tab-delimited text files or PDF

Build printable summaries and tables of contents (TofC) based on embedded catalogue data using standard Dublin Core Terms

Maintain Titles, Topics and Topics Association lists for multilingual TofC sorted alphabetically in each language by title, topic, author and/or any other criteria

Build interactive TofC in PDF format for distribution on CD etc.

Export TofC to inDesign tagged text for professional presentation

User friendlier PDF documents: Allow for adding bookmarks and interactive links to tables of contents of existing PDF documents

Word processing tool including style definitions to create and output well-behaved documents in PDF format

Build interactive TofC in HTML format for distribution on CD or Web

Support for other electronic files formats such as XML, JPEG, TIFF etc.

Allow re-mapping of other XMP meta tags found in PDF created with Acrobat 9 or newer

Versions in other languages

Versions for MAC and LINUX operating systems

Table 1. Desirable features of software for managing collections of well-behaved documents

Library management[edit]

The data can be arranged into collections or lists which can be formatted to tables of contents sorted alphabetically, by topic or by any of the included DC items. Titles and Topics can also be associated via external scripts which facilitates making multiple TofCs from the same collection, e.g. in multiple languages or sorted differently.

These TofCs can be printed and exported to a tab-delimited text file, to Excel®, or to inDesign®. In a second step they can be converted to PDF or HTML and thus become interactive tables of contents to be published electronically, on CD or on the web.

Embedding metadata[edit]

Library-relevant information becomes an integral part of the electronic document itself by embedding standard meta-tags beyond the basics (title, author, subject, keywords). Users can easily view and modify the embedded metadata and, in the case of electronic documents in PDF, EPUB or HTML format, these modifications can be re-embedded into the files themselves (except if web-based).


Physical objects have to be entered manually or imported into the input mask (Fig. 2.) via a suitable bridge tool, electronic files are parsed to reveal embedded metadata which can then be edited in the input mask with spaces for all 15 classical DC elements and the 55 newer DC terms. Schemes and other refinements can be added where appropriate. This helps ensure conformity and uniformity.

Supported document standards[edit]

Supported standards for electronic documents are at least PDF, EPUB and HTML (could eventually be extended to XHTML, XML and others).

• All three supported formats are very popular and open

• there is free reader software available for PCs, eBook readers and mobile devices

PDF

This is the most popular file format for multipage electronic documents. There are several applications that can produce files in this format, above all Acrobat® and other Adobe® programmes. While Acrobat and its PDF-writer allow you to enter the basic metadata (title, author, subject and keywords) which it also lists in its advanced metadata viewer as appropriate DC-elements, DC-terms are not supported, albeit they are recognized/tolerated if entered with certain third-party tools.
It also includes its own XMP set of metadata, but no tool to easily manipulate these within Acrobat itself. All you can do is export/import all advanced meta tags to/from an *.XML file.
Another issue with PDF is encryption and password protection. While some tools can read the basic four tags in most cases, they often cannot update such files after modification.
Suitable software can extract from, modify and update open PDF files with both DC-elements and DC-terms.

EPUB

This format seems to emerge as the new standard for ebooks.
It is not proprietary, some layout programs (e.g. InDesign) can generate it directly, most major ebook readers can handle it and some schools in California made it even mandatory and distributed free eBook readers.
An EPUB file is essentially a zip package containing all elements of the book in html or plain text format. The standard includes a pointer to a separate text file (usually called content.opf) holding all metadata.
Most examples we have seen use dc:terms correctly.
Suitable software can interpret and modify these without problem.

HTML

This is the most uncontrolled format of all and there are literally hundreds of applications that can create it, each adding its own flavor of syntax and scripts.
While the <meta name= tagging> in the <head> block is usually respected by all, we have seen the wildest excesses of what follows after the equal symbol.
By far not all originators use <dc... or <dc.terms… and often those that do add fancy designations of their own.
Suitable software should contain a tool that allows re-mapping unusual meta names to valid dc.elements and dc.terms even before adding these to a collection.
Another issue with HTML files is that many are virtual or created on the fly, e.g. in response to a query, and many are the result of multiple re-directions and are not necessarily the file the user thought he clicked upon. This happens typically on sites built with frame sets or master pages with many links.
Suitable software should allow collections and tables of contents to include dc.elements and dc.terms of virtual files as well.
As downloading such files can yield zero-length files, it is recommended that all downloads be verified before they are added to a collection.

Table 2. Supported file formats

Formats not supported[edit]

Most other document formats are (for the time being) not supported because they are either not popular enough or they require proprietary software for reading (except plain ASCII, ANSI and Unicode text files). Some document formats include hidden metadata which the author may not even be aware of and that he may or may not want to have published. Typical examples of this are MS-Word documents and the output from the other MS-Office packages Excel and PowerPoint. But these can easily be converted to PDF which resolves both the hidden metadata and the public accessibility problems.

Conclusion[edit]

The term “Dublin Core Inside” should become the mark of any well-behaved document. Only documents that are truly user friendly and fully compliant with DCMI standards should be allowed to carry and publicly display this mark. Perhaps this term could even be registered as a trade mark by the Dublin Core Organization. With appropriate training, instructions and incentives, authors and publishers must be shown how easy it is to produce well-behaved documents. They should be brought to structure their work and to include useful metadata in DCMI standard notation from the very beginning. This will ensure that electronic documents intended for a large audience are well-behaved, i.e. they are both user friendly and library friendly.

References[edit]

Dublin Core Metadata Element [Core Metadata Element Set]

DCMI Metadata Terms issued 2008-01-14 by [Usage Board]

For more information please see the Metadata Training Resources page [Open Archives Initiative Protocol for Metadata Harvesting]

EPUB Specifications: Open Publication Structure (OPS) 2.0 v1.0 2007-09-11, downloaded from [Digital Publishing Forum]

Christian Ducharme (2006-05-25). Cours sur le Dublin Core et l’OAI

Dublin Core Metadata Elements - Best Practices, Version 1.1, December, 2005, by Kelley Bachli, Judy Moser and Pat Vince: CCDL Metadata Sub-Task Force

Harrison Ainsworth, HXA (2007-12-28), downloaded from [Format Construction Guide]

XMP-Toolkit-SDK-4.4.2 (2008-06-10) by Adobe ©, downloaded from [XMP Development Center]

HTML5 - A vocabulary and associated APIs for HTML and XHTML, see [Working Draft (2010-03-04)]

Example of a suitable software tool [[2]] (digi-libris manager)