User:Joe carmel/Good Data Practices Related to Government Data

From Wikipedia, the free encyclopedia
Jump to: navigation, search
This page is under construction.

Government Data Management Practices represent the ideas, techniques and tools that promote data quality, accessibility, sharing and integrity of government-published data and documents on the Internet. For governments, good data practices also support data management and data governance initiatives by providing specific practices to be considered in the development of govenment measures. The benefits include improved data quality by using information technology controls while broadening potential reuse of information and web-based data services.

Several government jurisdictions have published guidelines and recommendations related to these practices. For example:


Rather than being innate to specific file formats, data accessibility, reusability, and integrity are produced by following specific practices associated with each format. When good data practices are not used, accessibility, reusability and integrity are diminished or non-existent. For example: a text file surrounded by a single root element is only technically an XML file, an HTML file without a <title> element is less reusable by search engines than files with <title>s, an HTML file that is not well-formed cannot be parsed with Document Object Model (DOM) tools (e.g., Perl's XML-DOM module, XQuery).

The practices described in this article provide the primary ideas, techniques, and tools for both creators and users of government data.

Background[edit]

1980: Data Management International (DAMA), a non-profit organization of over 120 professionals from various fields formed its first chapter in Los Angeles.

1995: US Federal government enacts Paperwork Reduction Act of 1995 with the purpose of providing for "the dissemination of public information on a timely basis, on equitable terms, and in a manner that promotes the utility of the information to the public and makes effective use of information technology."

1998: Jeffrey Zeldman co-founded the Web Standards Project with George Olsen and Glenn Davis. The Web Standards Project is a grassroots coalition fighting for standards which ensure simple, affordable access to web technologies for all.

'Open content' term first used as an attempt to adapt the logic of 'open source' software to the non-software world of cultural and scientific artifacts like music, literature, and images.” [1]

December 2000: US Federal Data Quality Act enacted.

October 2001: US Federal government's Guidelines for Ensuring and Maximizing the Quality, Objectivity, Utility, and Integrity of Information Disseminated by Federal Agencies directs agecnies to develop information resources management procedures for reviewing and substantiating (by documentation or other means selected by the agency) the quality (including the objectivity, utility, and integrity) of information before it is disseminated.

December 2002: US Federal Government: E-Government Act of 2002 becomes Public Law 107-347.

June 2004: US Federal Government: Recommended Policies and Guidelines for Federal Public Websites Final Report of the Interagency Committee on Government Information Submitted to The Office of Management and Budget

February 2005: US National Information Exchange Model (NIEM) launched.[2]

February 2007: European W3C Symposium on eGovernment.

June 2007: W3C's Toward More Transparent Government Workshop on eGovernment and the Web

December 2007: Eight Principles of Open Government Data published by Tim O'Reilly of O'Reilly Media and Carl Malamud of Public.Resource.Org with sponsorship from Sunlight Foundation, Yahoo, and Google.

W3C eGovernment Interest Group formed.

January 2008: SPARQL becomes a W3C Recommendation.[3]

April 2009: DAMA first published the DAMA Guide to the Data Management Body of Knowledge (DAMA-DMBOK) in April 2009. This work, the first of its kind, represents a compilation of data management principals and best practices for producers of electronic data (government and non-government). It provides data management and IT professionals, executives, knowledge workers, educators, and researchers with a framework to manage their data and mature their information infrastructure.

May 2009: US Federal government launches Data.gov to increase public access to high value, machine readable datasets generated by the Executive Branch.

Open Data is Civic Capital: Best Practices for "Open Government Data" published.

June 2009: New York City Council considers Open Data Standards

September 2009: The W3C eGovernment Interest Grouppublished the working draft of Guidelines for Publishing Open Government Data which created interest in practices that make data both human- and machine-readable. [4]

October 2009: Australian Government launches Data.Australia.gov.au.

December 2009: US Federal Open Government Directive mandates that each agency create an Open Government Webpage located at www.[agency].gov/open.

HTML[edit]

Many state, local and federal governments rely on the Internet as a primary means of passing data to both other government entities and members of the public. One of the most common ways of doing this is via the HyperText Markup Language or HTML. However there are many often times overlooked or unappreciated aspects to using HTML of which you may not be aware. This section addresses some of the pros and cons and recommended practices for using HTML in a governmental setting. The rest of this section assumes a basic familiarity with HTML and the surrounding concepts.

Advantages and Disadvantages[edit]

HTML shines as a method of delivering information from a computer to a person. With all of the various methods of reading, updating and publishing HTML, not to mention whole programming languages devoted to it's use (PHP), it's simple to pick up any of a number of tools to produce HTML for most information that government entities may have already stored in electronic format.

In addition to being easy to use, it's also considered fairly easy to learn. Given the simple structure of the data and the limited number of tags commonly used in HTML, time to train people to work with it tends to be fairly low, with the basic concepts often learned in just a day. Combined with the large number of resources available on the Internet, in books and many other places, and you are left with a technology that's widely used, supported and easy to find information on.

However, HTML is not the only, nor at times even the best method of distributing your data. In particular, in cases where information needs to be printed out for some reason, HTML is at a severe disadvantage to other data formats. For example, when dealing with a form or forms that require a handwritten signature, then HTML is not going to be a great tool to use.

In addition, when dealing with computer to computer interactions, such as providing bulk data access to information resellers, HTML may work, but lacks some of the strengths of other formats such as XML. Specifically, HTML does not offer a good method of giving meaning to the data it presents, it simply describes how that data should be presented.

For example, with an HTML document, I may output a first and last name, like so: <strong>John</strong> and <strong>Public</strong>. To an HTML document, there is no difference between the John and the Public, they are both just strings in bold text. Now while a human being could read that information and know that they are being given a first and last name, a computer lacks the ability based on the HTML alone. All a computer could determine is that you have two strings, both supposed to be bold, it would need some other frame of reference to figure out what the information actually means.

While there are accepted methods of adding that frame of reference (see the id tags section below), one is, in effect, treating HTML as XML at that point. It is often of more benefit to move to using XML to gain the flexibility it offers, though not always.

So in short, any data that does not need to be printed out or processed by a computer, HTML is a great way to present your information to both other governmental entities and the public.

Well-formedness[edit]

Recommended Practice: HTML files, especially those with reusable data, should be checked and corrected for well-formedness.

A large number of government web pages do not use well-formed HTML/XHTML. There are a number of reasons for this including web developer errors writing HTML, authoring tools that do not adhere to well-formedness standards, and the evolution of web browsers. Early web browsers supported only a very simple version of HTML. The rapid development of proprietary web browsers led to the development of non-standard dialects of HTML, leading to problems with interoperability. Modern web browsers support a combination of standards-based and de facto HTML and XHTML, which should be rendered in the same way by all browsers.[5] As a result, many web developers have created poorly-formed HTML. The page may display as expected in one browser, but is often interpreted differently by other browsers and authoring tools.

The challenge this creates is that XPath-based tools used for machine readability will likely not be able to parse these files successfully.[6] XPath-based tools provide developers with direct access to sub-document locations regardless of subsequent changes made to the file by the publisher.

Technical Resources

  • XML Well Formedness Checker works with HTML files as well.
  • HTML Tidy is a standalone tool for checking and pretty-printing HTML and fixing up mark-up errors. It also offers a means to convert existing HTML content into well-formed XML, for delivery as XHTML. There is also limited support for ASP, JSTE, and PHP.
  • W3C Validator Tool is a webform that allows validness tests based on providing a URL or pasting in the code. Note that non-valid code is different from well-formedness.
  • 13 Ways to Browser Test and Validate Your Work


Link Management[edit]

A dead link (also called a broken link or dangling link) is a link in an HTML page that points to a web page or server that is permanently unavailable. The most common result of a dead link is a 404 error, which indicates that the web server responded, but the specific page could not be found. The browser may also return a DNS error indicating that a web server could not be found at that domain name. A link might also be dead because of some form of blocking such as content filters or firewalls.

Several tools exist to help check for dead links. See Wikipedia article: dead link.

The id attribute[edit]

Recommended Practice: HTML files should contain id attributes to provide direct machine-readable access to data values. Update HTML files to meet data needs of the public.

The id attribute can be used to identify unique elements within HTML files for re-use. An excellent example is provided by the US Census Bureau in relation to their publication of population statistics at http://www.census.gov/main/www/popclock.html. By using the following span elements with ids within the HTML file, the Census Bureau provides software developers with granular access to specific data values:

  • <span id="usclocknum">308,030,370</span>
  • <span id="wclocknum">6,799,715,290</span>
  • <span id="clocktime">14:01 UTC (EST+5) Nov 27, 2009 </span>

Note that, when combining HTML and CSS, as outlined in the CSS section below, this can lead to some confusion, especially when dealing with people new to CSS. As styles can be defined for an id as well as a class, be sure to use unique names for both your ids and your classes.

Anchors and Citataions[edit]

Needs info.

Access Keys[edit]

Needs info.

HTML CSS[edit]

Cascading Style Sheets or CSS provide a method of defining how information is displayed on a web page. More importantly, they let you define that information once and apply it across as many pages as you'd like. So, given a several thousand page web site, you can change the look and feel (banners, images, etc) in just one place and have the change immediately propagate across all of your existing pages.

As a general rule of thumb, CSS should be used whenever possible to define basic layout features for your web site. These features include the location, size and other general properties of your site navigation, banner images and basic elements to describe page content. Not only does this save you hours of work updating pages due to a formatting change, but it also provides a method of helping to keep your data presentation similar across multiple pages. This,in turn, helps users find and interpret information on your web site.

One important thing to note, however, when using CSS is that older web browsers, such as Internet Explorer version 6 and previous, do not always correctly render CSS-based formatting. If special steps are not taken to assure that rendering of style sheet information is correct in these older web browsers, some users of your site may not be able to find or access certain sections.

There are a number of web sites out there with good information on how to alter or update your CSS to work with Internet Explorer 6, see the technical resources below for details.

Technical Resources

RDFa[edit]

Recommended Practice: Add metadata using RDFa to HTML files

RDFa is a collection of attributes and processing rules for extending XHTML to support enahnced metadata. Adding RDFa also improves the information provided to search engines such as Google and can effect search engine results.[7]


Technical Resources

Human Accessibility Issues[edit]

One often overlooked area of HTML pages is accessibility, in particular Section 508(c) compliance. Assuring that your web pages are accessible to screen readers and other assistive technologies is a federal and/or state requirement for most government run web sites [8]. However very few sites out there actually conform to these standards.

There are several small, simple things that can be done to help assure compliance with the 508(c) rules. Most notable among these is the use of "alt" attributes when posting images. By adding the "alt" attribute, screen readers and other assitive technologies will pick up and display or speak that information in place of the image, providing reference information to the user.

As well, whenever possible, attempt to present data in a text only manner in addition to or rather than graphical. While charts and graphs are certainly nice, the blind and visually impaired will likely not be able to get any meaningful data out of them. Try to assure that you post text only versions of chart and graph data listing, for example, categories and percentages in a table.

There are many, many more things that you can do to assure 508(c) compliance in your web sites, far too many to mention in this article. To find out more information on accessibility, you can start with the following sites:

eXtensible Markup Language (XML)[edit]

XML has become a highly used standard for a number of various machine-to-machine applications. With the flexibility to define what information is present and how it should appear, it provides an easy to use, standard mechanism for transferring data.

Advantages and Disadvantages[edit]

When dealing with trying to transfer information from one computer to another, XML comes out as a strong, well supported technology. With hundreds of different free and commercial software programs out there to create, edit and distribute XML files through virtually any programming language, it has become a defacto standard for data interchange in many institutions.

In this day and age, XML and HTML are often used interchangeably, as X/HTML is actually a defined subset of the larger XML world. This means that it is often easy to train people familiar with HTML to work with XML data. As well, XML has several supporting technologies such as XSLT (eXtensible stylesheet language) that are well defined and allow one to translate XML files into other formats for quick and easy use.

With tools such as this, one can begin with an XML document and transform that data into a web page, a PDF file, an email, a Word document, any number of potential formats that would allow people to access the data as they see fit. However, it is this very flexibility that causes some of the problems that you may encounter with XML.

As with any technology that allows for a great deal of freedom, that freedom gives one the leeway to create issues as well as solve them. An improperly developed XML schema or DTD can cause a number of headaches, especially when trying to make changes to the schema down the road.

When changes are made to the structure of an XML document, you may invalidate previous documents you've created. This may mean that days, months and even years of data can be rendered unreadable or unusable by your existing applications if someone is not careful. In turn, this can lead to costly conversion processes to update older files to keep them in line with the latest valid DTD or Schema. Fortunately, proper up-front planning can eliminate much of this problem.

Even when given the best designed document, however, using that document can become an issue. While some XML transforms are well documented and well supported, such as XML to PDF and XML to HTML, others are less well defined and may be problematic to implement. For example, while XML can be converted to a format such as RTF or SVG, there are very few existing commercial or free software solutions, and the ones that exist are generally on the expensive side and/or are designed for highly specific purposes and, as such, will not fit your needs.

Character Encoding[edit]

The current standard for most text produced by software these days is UTF-8 encoding as opposed to the previous ISO-8869-1 standard used since the 1960s. As XML data requires special characters such as the section sign (commonly used to reference statutes), the double hyphen and the ampersand to be specially encoded, all text absolutely must be run through a character encoding routine before it can be used. In the case of older file formats being converted to XML ones, this becomes twice as critical as the UTF-8 standard does not support the same encoding for these characters as the ISO-8859-1 standard did.

It is highly recommended that you translate all of your text-based data to a UTF-8 format, which allows for a wider range of character data to be stored. This is generally done by taking any non-ASCII character, prefixing it with a &#, converting the character code to a numeric value, then suffixing the string with the semi-colon. For example, the section symbol becomes &#167;

XML CSS[edit]

Recommended Practice: To achieve human readability of XML data, governments should consider adding CSS to their published XML files.

Just as with HTML, XML files can be styled with Cascading Style Sheets (CSS) to provide human-readability on the web. The underlying XML document is still available by either saving the file to a local hard drive or using the View Source option in most browsers. XSLT is considered the more robust and mature technology to transform XML on the web. XSLT's primary advantage over CSS is that it allows for direct access to every element and attribute of the XML document at all times through XPath. XSLT is also written using the XML format. CSS, on the other hand, uses a single pass approach and is not written in an XML format.

Technical Resources

XSLT[edit]

XSLT or XSL Transformations, are a method of translating XML data into another format, most commonly HTML or PDF. The basic principal behind these is similar to using CSS as described in the HTML section of this document: write a style sheet that outlines how to format the data in the XML file. Style sheets can then be re-used, allowing you to keep a consistent look and feel across all of your documents, regardless of how you are displaying the information.

Aside from providing a common look and feel, XSLT can allow you to take a single XML document and publish part or all of it in a number of different ways depending on your needs. For example, our XML document may contain a list of all of our staff members, including their home and work phone numbers. If we wish to publish a staff roster to the web that excludes home phone numbers, while also creating a print copy of our roster internally that has home numbers on it. By simply creating a "web" style sheet and a "print" style sheet, you can keep all of your data in a single XML file but publish it in multiple formats.

It should be noted that these transforms require software that loads the XML and the style sheet and performs the conversions. Some of this software, such as the Apache Project's [FOP] are free but have limited use. Others are more multi-purpose but generally must be purchased, such as Antenna House or RenderX.

Technical Resources

Standard Vocabularies and Schemas[edit]

Governments that publish documents and data using XML often reuse schemas and vocabularies from other sources. This practice increases the reuse of common tools for authoring and consuming of government data.

Publish XML Schemas[edit]

XML schemas (DTDs, XSDs, or Relax NG) should be published for all .gov XML documents and data. Using the <xsd:documentation> element to provide definitions for each element specified in schemas will not only facilitate understanding of the data but also enable the automated aggregation of data dictionaries/registries, which can be used to facilitate discovery and access.

Web Services[edit]

One popular and useful method for transmitting XML data is the Web service. In their most basic (and often most useful) form, a web service is an application that runs on a web server that, rather than returning a nice, formatted HTML page to the user, simply outputs XML data. While there are a number of different methods of creating web services (SOAP and RESTful being two of the more popular), the end result is a process that allows computers, rather than people, to collect data from your site.

Provided that you publish the information about how your XML is formatted (see Publish XML Schemas above), individuals and institutions that regularly request data can then access it quickly and easily without human intervention. For example, the Washington State Legislature has web services providing listings of bills, resolutions and amendments, among other things. With a system like this in place, an individual or entity can then create a program that pulls the data from one of these web services and loads it into their own information center such as a database or Excel spreadsheet. From there, the user can sort and search the data however they please.

By automating distribution of data in a manner such as this, both the agency or entity distributing the data and the person or entity who wants access to the data can save themselves a great deal of time and hassle.

Portable Document Format (PDF)[edit]

The Portable Document Format, or PDF file, is a staple of many agencies publications of data on the web. As most scanning software will quickly and easily create a PDF, often times they are used as a quick and easy method of publishing older information that is not available in electronic format. Combine that with the large number of tools, both free and commercial, that allow people to create and edit PDF files and it becomes easy to see why their use is so popular.

Advantages and Disadvantages[edit]

PDF files can be a great tool for a wide variety of your publishing needs, both on the web and in print. Given their origins in PostScript, a language designed specifically for printing [9], PDF files are a great standard to use for any material you might have that needs to appear on a printed page.

To make things even simpler, virtually all word processing software packages support the ability to "print" a file direct to a PDF, rather than a printer. Free software packages such as [PDFCreator] allow you to expand this to other software such as spread sheets, presentations and others. Some office suites, such as [OpenOffice] provide built-in PDF creation options with no need for additional downloads. Thanks to software such as this, people have the ability to use programs they are already familiar to quickly and easily generate a document that can be distributed and printed in no time.

In addition to being easy to create from standard office software, PDF files are a standard for most scanning software as well. By scanning older documents that may not appear in electronic form, you can standardize your document storage and archival formats using a single technology that is widely supported and easy to access.

The PDF format also supports a number of extensions, such as XMP, PDF/A and the FTK Toolkit that allow for the ability to update PDF files with a great deal of useful information. You can create forms for people to fill out, printer friendly versions of web pages and many, many other things all within the PDF framework.

Like most technologies, however, PDF is not necessarily going to be the perfect solution for all of your needs. While it does provide a number of great benefits, it can be easily misused in ways that can cause issues fairly quickly.

Perhaps the most common pitfall of the PDF is one of the advantages listed earlier, scanned documents. When a document is scanned, an image-based picture of the document is created. This makes it very difficult for software to pick out the individual parts of that document, such as information filled out by a person on a form. While there are tools to assist with issues like this, even the best of these has a very limited success rate with information that is hand written.

In cases where someone may be filling out information, such as a form like a license renewal or permit application, PDF files do not usually work well for collecting the information. Especially in cases where handwriting is involved, it is generally a good idea to steer clear of scanned PDF files when you are trying to collect information off a page.

In addition, PDF files tend to be difficult to use for machine to machine interaction. Similar to HTML, PDF is designed as a presentation standard and, as such, generally does not establish a relationship between the data it's collected and what that data means. While you may be able to extract the string "John Public" from a PDF file, unless some very strict conditions are met, it is difficult at best to know that the string is referring to a persons name.

In short, PDF files make wonderful presentation and archival solutions, especially when one has the need to print. However, in cases where you need to collect and/or process information, it is generally far simpler to use other technologies.

PDF Titles[edit]

Recommended Practice: Review PDF titles before publishing PDF files. Update title as needed. See Metadata recommendations.

Search engines use the PDF title as the heading for hit list items.[10][11] The PDF title is determined by the user, the authoring application, or the distilling software, usually PostScript to PDF. Users can set the PDF title by editing the Document Information section in the authoring application or Acrobat itself. When the author doesn't explicitly create the title, the authoring application may assign its own title. If the Document Information is not set by the application, Adobe Distiller will automatically assign the filename as the PDF title.

Other Technical Solutions

  • Use tools described in the Metadata section below to update PDF metadata in general.

Metadata[edit]

Recommended Practice: Review XMP stored in PDF files before publishing PDF files. Update metadata to meet data quality requirements.

Metadata is stored in PDF files using the Extensible Metadata Platform (XMP) specification as a wrapper for standards-based and locally-defined metadata capabilities.[12] Only a subset of the XMP in each PDF file can be viewed by all users using the Document Information feature in Acrobat Reader or through the Acrobat plug-in. Some authoring applications such as Microsoft Word, automatically populate the metadata in PDF files. This metadata may contain information goverenments might not want to publish. As described by David Fishel, editor, PDFforLawyers.com:

"There has been much written about the dangers of metadata in law firm documents, and in particular of the potential hazards of the metadata in Microsoft Word documents. The primary reason that the metadata recorded by MS Word can become a problem is because Word records most of its metadata invisibly. We need to be aware that it exists, and to make the invisible visible.
"Word was designed to generate all this metadata in order to make it possible for users to create a document, revert to earlier versions, collaborate, and merge the work of several people into a single file. These are all useful things, and users don't want to be bothered with doing and tracking all of these functions. After all, that's what computers are for. In the context of a business document, the metadata functions of Word make perfect sense."[13]

In addition, one of the requirements of the PDF archival format (PDF/A) is the completion of XMP metadata elements including document title, author, subject, keywords, etc.[14]

Technical Resources
For PostScript files

  • pdfmark [15] can be added to PostScript files to set the metadata before files are distilled.

For PDF files

  • manually edit the Document Information using Acrobat or other PDF products
  • pdftk: The PDF Toolkit provides command line read/write access (using the dump_data option) but does not use the XMP format.
  • Perl CPAN's Module: PDF::API2 provides read/write access using the XMP format to add Dublin Core and RDF metadata.

When using these tools, governments should consider adding other metadata to the PDF file.

Attached Files[edit]

Recommended Practice: When publishing using PDF, governments should consider adding source files, documentation, ADA compliant versions, and other relevant material to provide for longer term data recovery, integrity, and archiving requirements.

Unlike other data formats, PDF files can contain file attachments. This enables governments to provide users with alternative formats of the file's contents, its metadata, other documentation and ancillary files. By including alternative formats, this practice could be used to provide ADA compliant versions of the document's contents. When used with digital signaturing, files that normally do not support authentication can go along for the ride--obtaining the same authentication provided to the PDF file itself.

PDF/A is the PDF archival format used by governments.[16]

Fragment Identifier Access[edit]

Recommended Practice: See Named Destinations.

The Adobe plug-in provides the ability to open any PDF file on the web to specified pages and locations using the fragment identifier portion of URIs.[17] In addition, the fragment identifier mechanism for PDF provides the ability to search PDF files from the URI. This feature is automatically available for PDF files on the Internet without preparation of the documents prior to publication.

While xpointer provides similar capabilities for HTML and XML files, xpointer has only been implemented for the Firefox browser using a user-installable plug-in.

Governments that distill PostScript to PDF should consider adding named destinations to their PDF files to identify likely sub-document locations in order to provide human-readable sub-document links.

Named Destinations[edit]

Needs more info.

PDF as XML (MARS)[edit]

The Mars Project is an XML-friendly representation for PDF documents called PDFXML.

Human Accessibility Issues[edit]

Another risk in the PDFs world comes from issues is in the realm of the 508 Compliance standards. While many government web sites simply overlook 508 compliance, it is a federal requirement and may be one for your state as well [8]. Unless specific steps are taken, a typical PDF does not contain the markup necessary to meet 508 compliance standards [18], which in turn may leave you open to lawsuits or other legal liabilities.

A non-accessible PDF will cause issues for screen readers for the visually impaired, as well as potential navigation issues for individuals with motor skill impairments, just to name a few. While not a majority of your average web site user base, they are just a few of the individuals protected under the Americans With Disabilities Act, which itself takes a specific stance on IT issues.

Fortunately, this limitation can be worked around. PDF files do have a number of accessibility options built in, as which the PDF/UA project has been working to expand. However, one should note that accessibility options are not normally enabled by default for most any software, so special care needs to be taken when creating a PDF if you want to meet these standards.

In general, accessibility issues are addressed inside of a PDF by "tagging". The process of tagging a PDF is simply adding context to the data inside in a manner in which programs such as screen readers can use. Generally, this is done through Adobe's Acrobat suite of products, which includes a PDF editing tool. Few other products, commercial or not, support tagging functionality within PDF files.

Having said that, however, Microsoft Word, OpenOffice and InDesign all share the ability to create a tagged PDF that meats accessibility requirements. As such, you may already have the tools at hand to produce accessible PDF files. Check the options of on your PDF generating tools for "accessibility" or "tagging" tools to be sure.

For more information on how to check your PDF files to see if they are accessible, you may wish to visit the Web Accessibility Center for more details.

In September 2009, the Australian government launched a project to review the accessibility support of the Portable Document Format (PDF), for use on government websites. Policy advice on the use of PDF’s for government websites is due in early 2010 [19]

References[edit]

External links[edit]