Data quality: Difference between revisions

Content deleted Content added

Inline

Revision as of 02:26, 3 August 2014

Data are of high quality if, "they are fit for their intended uses in operations, decision making and planning." (J. M. Juran). Alternatively, data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes paramount, regardless of fitness for use for any particular external purpose; e.g., a person's age and birth date may conflict within different parts of the same database. The first views can often be in disagreement, even about the same set of data used for the same purpose. This article discusses the concept of data quality as it relates to business data processing, although of course other fields have their own data quality issues as well.

Definitions

This list is taken from the online book "Data Quality: High-impact Strategies".^[1] See also the glossary of data quality terms. ^[2]

Degree of excellence exhibited by the data in relation to the portrayal of the actual scenario.
The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.^[3]
The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data.^[4]
The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria.^[5]
Complete, standards based, consistent, accurate and time stamped.^[6]

History

Before the rise of the inexpensive server, massive mainframe computers were used to maintain name and address data so that mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry (NCOA). This technology saved large companies millions of dollars in comparison to manually correction of customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.

Companies with an emphasis on marketing often focus their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found in the enterprise. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking and shipping parts across a large organization.

While name and address data has a clear standard as defined by local postal authorities, other types of data have few recognized standards. There is a movement in the industry today to standardize certain non-address data. The non-profit group GS1 is among the groups spearheading this movement.

For companies with significant research efforts, data quality can include developing protocols for research methods, reducing measurement error, bounds checking of the data, cross tabulation, modeling and outlier detection, verifying data integrity, etc.

Overview

There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework, dubbed "Zero Defect Data" (Hansen, 1991) adapts the principles of statistical process control to data quality. Another framework seeks to integrate the product perspective (conformance to specifications) and the service perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in semiotics to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously (Wand and Wang, 1996).

A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. These lists commonly include accuracy, correctness, currency, completeness and relevance. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognise this as a similar problem to "ilities".

MIT has a Total Data Quality Management program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ). This program grew out of the work done by Hansen on the "Zero Defect Data" framework (Hansen, 1991).

In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence to customer relationship management and supply chain management. One industry study estimated the total cost to the US economy of data quality problems at over US$600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data migration and conversion projects.^[7]

In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed.^[8]

One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year.^[9]

In fact, the problem is such a concern that companies are beginning to set up a data governance team whose sole role in the corporation is to be responsible for data quality. In some^[who?] organizations, this data governance function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.

Problems with data quality don't only arise from incorrect data; inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.

Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.^[10]

The market is going some way to providing data quality assurance. A number of vendors make tools for analysing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:

Data profiling - initially assessing the data to understand its quality challenges
Data standardization - a business rules engine that ensures that data conforms to quality rules
Geocoding - for name and address data. Corrects data to US and Worldwide postal standards
Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. It might be able to manage 'householding', or finding links between spouses at the same address, for example. Finally, it often can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.
Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.
Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean.

There are several well-known authors and self-styled experts, with Larry English perhaps the most popular guru. In addition, the International Association for Information and Data Quality (IAIDQ) was established in 2004 to provide a focal point for professionals and researchers in this field.

ISO 8000 is the international standard for data quality.

Data quality control

Data quality control is the process of controlling the usage of data with known quality measurement—for an application or a process. This process is usually done after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction.

Data QA process provides following information to Data Quality Control (QC):

Severity of inconsistency
Incompleteness
Accuracy
Precision
Missing / Unknown

The Data QC process uses the information from the QA process, then it decides to use the data for analysis or in an application or business process. For example, if a Data QC process finds that the data contains too much error or inconsistency, then it prevents that data from being used for its intended process. The usage of incorrect data might crucially impact output. For example, providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing data QC process provides the protection of usage of data control and establishes safe information usage.

Optimum use of data quality

Data Quality (DQ) is a niche area required for the integrity of the data management by covering gaps of data issues. This is one of the key functions that aid data governance by monitoring data to find exceptions undiscovered by current data management operations. Data Quality checks may be defined at attribute level to have full control on its remediation steps.

DQ checks and business rules may easily overlap if an organization is not attentive of its DQ scope. Business teams should understand the DQ scope thoroughly in order to avoid overlap. Data quality checks are redundant if business logic covers the same functionality and fulfills the same purpose as DQ. The DQ scope of an organization should be defined in DQ strategy and well implemented. Some data quality checks may be translated into business rules after repeated instances of exceptions in the past.

Below are a few areas of data flows that may need perennial DQ checks:

Completeness and precision DQ checks on all data may be performed at the point of entry for each mandatory attribute from each source system. Few attribute values are created way after the initial creation of the transaction; in such cases, administering these checks becomes tricky and should be done immediately after the defined event of that attribute's source and the transaction's other core attribute conditions are met.

All data having attributes referring to Reference Data in the organization may be validated against the set of well-defined valid values of Reference Data to discover new or discrepant values through the validity DQ check. Results may be used to update Reference Data administered under Master Data Management (MDM).

All data sourced from a third party to organization's internal teams may undergo accuracy (DQ) check against the third party data. These DQ check results are valuable when administered on data that made multiple hops after the point of entry of that data but before that data becomes authorized or stored for enterprise intelligence.

All data columns that refer to Master Data may be validated for its consistency check. A DQ check administered on the data at the point of entry discovers new data for the MDM process, but a DQ check administered after the point of entry discovers the failure (not exceptions) of consistency.

As data transforms, multiple timestamps and the positions of that timestamps are captured and may be compared against each other and its leeway to validate its value, decay, operational significance against a defined SLA (service level agreement). This timeliness DQ check can be utilized to decrease data value decay rate and optimize the policies of data movement timeline.

In an organization complex logic is usually segregated into simpler logic across multiple processes. Reasonableness DQ checks on such complex logic yielding to a logical result within a specific range of values or static interrelationships (aggregated business rules) may be validated to discover complicated but crucial business processes and outliers of the data, its drift from BAU (business as usual) expectations, and may provide possible exceptions eventually resulting into data issues. This check may be a simple generic aggregation rule engulfed by large chunk of data or it can be a complicated logic on a group of attributes of a transaction pertaining to the core business of the organization. This DQ check requires high degree of business knowledge and acumen. Discovery of reasonableness issues may aid for policy and strategy changes by either business or data governance or both.

Conformity checks and integrity checks need not covered in all business needs, it’s strictly under the database architecture's discretion.

There are many places in the data movement where DQ checks may not be required. For instance, DQ check for completeness and precision on not–null columns is redundant for the data sourced from database. Similarly, data should be validated for its accuracy with respect to time when the data is stitched across disparate sources. However, that is a business rule and should not be in the DQ scope.

Data Quality Assurance

Data quality assurance is the process of profiling the data to discover inconsistencies and other anomalies in the data, as well as performing data cleansing activities (e.g. removing outliers, missing data interpolation) to improve the data quality .

These activities can be undertaken as part of data warehousing or as part of the database administration of an existing piece of applications software.

Criticism of existing tools and processes

The main reasons cited are:

Project costs: costs are typically in the hundreds of thousands of dollars
Time: lack of enough time to deal with large-scale data-cleansing software
Security: concerns over sharing information, giving an application access across systems, and effects on legacy systems

Professional associations

International Association for Information and Data Quality (IAIDQ)

References

^ "Data Quality: High-impact Strategies - What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors". Retrieved 5 February 2013.
^ Glossary of data quality terms published by IAIDQ
^ Government of British Columbia
^ REFERENCE-QUALITY WATER SAMPLE DATA: NOTES ON ACQUISITION, RECORD KEEPING, AND EVALUATION
^ istabg.org Data QualYtI – Do You Trust Your Data?
^ GS1.ORG dqf
^ http://www.information-management.com/issues/20060801/1060128-1.html
^ http://www.directionsmag.com/article.php?article_id=509
^ http://ribbs.usps.gov/move_update/documents/tech_guides/PUB363.pdf
^ E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47.

11. Tamraparni Dasu and Theodore Johnson. 2003. Exploratory Data Mining and Data Cleaning (1 ed.). John Wiley & Sons, Inc., New York, NY, USA.

@@ Line 1: / Line 1: @@
-[[Data]] are of high quality "if they are fit for their intended uses in [[Business operations|operations]], [[decision making]] and [[planning]]" ([[Joseph M. Juran|J. M. Juran]]<!-- I don't see Juran listed in the "References" below.  Looks like the intended ref is Juran, Joseph M. and A. Blanton Godfrey, Juran's Quality Handbook, Fifth Edition, p. 2.2, McGraw-Hill, 1999, but I don't have a copy so can't confirm. -->). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of ''[[data consistency|internal consistency]]'' within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database. The first views can often be in disagreement, even about the same set of data used for the same purpose.  This article discusses the concept as it related to business data processing, although of course other data have various quality issues as well.
+[[Data]] are of high quality if, "they are fit for their intended uses in [[Business operations|operations]], [[decision making]] and [[planning]]." ([[Joseph M. Juran|J. M. Juran]]<!-- I don't see Juran listed in the "References" below.  Looks like the intended ref is Juran, Joseph M. and A. Blanton Godfrey, Juran's Quality Handbook, Fifth Edition, p. 2.2, McGraw-Hill, 1999, but I don't have a copy so can't confirm. -->). Alternatively, data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of ''[[data consistency|internal consistency]]'' within data becomes paramount, regardless of fitness for use for any particular external purpose; e.g., a person's age and birth date may conflict within different parts of the same database. The first views can often be in disagreement, even about the same set of data used for the same purpose.  This article discusses the concept of data quality as it relates to business data processing, although of course other fields have their own data quality issues as well.
 == Definitions ==
-This list is taken from the online book "Data Quality: High-impact Strategies".<ref>{{cite web|url=http://de.scribd.com/doc/61341961/Data-Quality-High-impact-Strategies-What-You-Need-to-Know-Definitions-Adoptions-Impact-Benefits-Maturity-Vendors|title=Data Quality: High-impact Strategies - What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors|accessdate=5 February 2013}}</ref> See also the Glossary of data quality terms <ref>[http://iaidq.org/main/glossary.shtml Glossary of data quality terms] published by [http://iaidq.org/ IAIDQ]</ref>
+This list is taken from the online book "Data Quality: High-impact Strategies".<ref>{{cite web|url=http://de.scribd.com/doc/61341961/Data-Quality-High-impact-Strategies-What-You-Need-to-Know-Definitions-Adoptions-Impact-Benefits-Maturity-Vendors|title=Data Quality: High-impact Strategies - What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors|accessdate=5 February 2013}}</ref> See also the glossary of data quality terms. <ref>[http://iaidq.org/main/glossary.shtml Glossary of data quality terms] published by [http://iaidq.org/ IAIDQ]</ref>
 *Degree of excellence exhibited by the data in relation to the portrayal of the actual scenario.
@@ Line 11: / Line 11: @@
 == History ==
-Before the rise of the inexpensive server, massive [[Mainframe computer|mainframe]] computers were used to maintain name and address data so that the mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events.  Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry [[United States Postal Service|(NCOA)]].  This technology saved large companies millions of dollars compared to manually correcting customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.
+Before the rise of the inexpensive server, massive [[Mainframe computer|mainframe]] computers were used to maintain name and address data so that  mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events.  Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry [[United States Postal Service|(NCOA)]].  This technology saved large companies millions of dollars in comparison to manually correction of customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.
 Companies with an emphasis on marketing often focus their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found in the enterprise. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking and shipping parts across a large organization.
@@ Line 34: / Line 34: @@
 In fact, the problem is such a concern that companies are beginning to set up a [[data governance]] team whose sole role in the corporation is to be responsible for data quality. In some{{Who|date=June 2012}} organizations, this [[data governance]] function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.
-Problems with data quality don't only arise from ''incorrect'' data. ''Inconsistent'' data is a problem as well. Eliminating [[shadow system|data shadow systems]] and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.
+Problems with data quality don't only arise from ''incorrect'' data; ''inconsistent'' data is a problem as well. Eliminating [[shadow system|data shadow systems]] and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.
 Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.<ref>E. Curry, A. Freitas, and S. O’Riáin, [http://3roundstones.com/led_book/led-curry-et-al.html “The Role of Community-Driven Data Curation for Enterprises,”] in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47.</ref>
@@ Line 63: / Line 63: @@
 * Missing / Unknown
-The Data QC process uses the information from the QA process, then it decides to use the data for analysis or in an application or business process. For example, if a Data QC process finds the data contains too much error or inconsistency, it rejects the data to be processed. The usage of incorrect data could crucially impact output. For example, providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing data QC process provides the protection of usage of data control and establishes safe information usage.
+The Data QC process uses the information from the QA process, then it decides to use the data for analysis or in an application or business process. For example, if a Data QC process finds that the data contains too much error or inconsistency, then it prevents that data from being used for its intended process. The usage of incorrect data might crucially impact output. For example, providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing data QC process provides the protection of usage of data control and establishes safe information usage.
 ==Optimum use of data quality==
@@ Line 69: / Line 69: @@
 Data Quality (DQ) is a niche area required for the integrity of the data management by covering gaps of data issues. This is one of the key functions that aid data governance by monitoring data to find exceptions undiscovered by current data management operations. Data Quality checks may be defined at attribute level to have full control on its remediation steps.
-DQ checks and business rules gets easily overlapped if organization is not attentive of its DQ scope. Business teams should understand DQ scope thoroughly to avoid business rules creeping into DQ scope that decreases the value of effort. Data quality checks are redundant if '''business logic''' covers the same functionality at apt location in the process, segregates the exceptions for analysis, and weeds out issues. DQ scope of an organization should be defined in DQ strategy and well implemented. Some data quality checks may be translated into business rules after repeated instances of exceptions in the past.
+DQ checks and business rules may easily overlap if an organization is not attentive of its DQ scope. Business teams should understand the DQ scope thoroughly in order to avoid overlap. Data quality checks are redundant if '''business logic''' covers the same functionality and fulfills the same purpose as DQ. The DQ scope of an organization should be defined in DQ strategy and well implemented. Some data quality checks may be translated into business rules after repeated instances of exceptions in the past.
-Below are few areas of data flows that may need perennial DQ checks:
+Below are a few areas of data flows that may need perennial DQ checks:
-'''Completeness''' and '''precision''' DQ checks on all data may be performed at the point of entry for each mandatory attribute from each source system. Few attribute values are created way after the initial creation of the transaction, in such cases administering these checks becomes tricky and should be done immediately after the defined event of that attribute's source as well as transaction's other core attribute conditions are met.
+'''Completeness''' and '''precision''' DQ checks on all data may be performed at the point of entry for each mandatory attribute from each source system. Few attribute values are created way after the initial creation of the transaction; in such cases, administering these checks becomes tricky and should be done immediately after the defined event of that attribute's source and the transaction's other core attribute conditions are met.
 All data having attributes referring to ''Reference Data'' in the organization may be validated against the set of well-defined valid values of Reference Data to discover new or discrepant values through the '''validity''' DQ check. Results may be used to update ''Reference Data'' administered under ''Master Data Management (MDM)''.
-All data sourced from a ''third party'' to organization's internal teams may under go '''accuracy''' (DQ) check against the third party data. This DQ check results are valuable when administered on data that made multiple hops after the point of entry of that data but before that data becomes authorized or stored for enterprise intelligence.
+All data sourced from a ''third party'' to organization's internal teams may undergo '''accuracy''' (DQ) check against the third party data. These DQ check results are valuable when administered on data that made multiple hops after the point of entry of that data but before that data becomes authorized or stored for enterprise intelligence.
-All data columns that refer to ''Master Data'' may be validated for its '''consistency''' check. DQ check administered on the data at the point of entry discovers new data for the MDM process but DQ check administered after the point of entry discovers the failure (not exceptions) of consistency.
+All data columns that refer to ''Master Data'' may be validated for its '''consistency''' check. A DQ check administered on the data at the point of entry discovers new data for the MDM process, but a DQ check administered after the point of entry discovers the failure (not exceptions) of consistency.
 As data transforms, multiple timestamps and the positions of that timestamps are captured and may be compared against each other and its leeway to validate its value, decay, operational significance against a defined SLA (service level agreement). This '''timeliness''' DQ check can be utilized to decrease data value decay rate and optimize the policies of data movement timeline.
@@ Line 87: / Line 87: @@
 '''Conformity''' checks and '''integrity checks''' need not covered in all business needs, it’s strictly under the database architecture's discretion.
-There are many places in the data movement where DQ checks may not be required. For instance, DQ check for completeness and precision on not–null columns is redundant for the data sourced from database. Similarly, Data should be validated for its accuracy with respect to time when the data is stitched across disparate sources but that is a business rule and should not be in DQ scope.
+There are many places in the data movement where DQ checks may not be required. For instance, DQ check for completeness and precision on not–null columns is redundant for the data sourced from database. Similarly, data should be validated for its accuracy with respect to time when the data is stitched across disparate sources. However, that is a business rule and should not be in the DQ scope.
 ==Data Quality Assurance==
@@ Line 98: / Line 98: @@
 The main reasons cited are:
-* '''Project costs''': costs typically in the hundreds of thousands of dollars
+* '''Project costs''': costs are typically in the hundreds of thousands of dollars
 * '''Time''': lack of enough time to deal with large-scale data-cleansing software
 * '''Security''': concerns over sharing information, giving an application access across systems, and effects on legacy systems