Data classification (data management)

In the field of data management, data classification as a part of Information Lifecycle Management (ILM) process can be defined as a tool for categorization of data to enable/help organization to effectively answer following questions:

What data types are available?
Where are certain data located?
What access levels are implemented?
What protection level is implemented and does it adhere to compliance regulations?

When implemented it provides a bridge between IT professionals and process or application owners. IT staff is informed about the data value and on the other hand management (usually application owners) understands better to what segment of data centre has to be invested to keep operations running effectively. This can be of particular importance in risk management, legal discovery, and compliance with government regulations. Data classification is typically a manual process; however, there are many tools from different vendors that can help gather information about the data.

How to start process of data classification

Note that this classification structure is written from a Data Management perspective and therefore has a focus for text and text convertible binary data sources. Images, videos, and audio files are highly structured formats built for industry standard API's and do not readily fit within the classification scheme outlined below.

First step is to evaluate and divide the various applications and data into their respective category as follows:

Relational or Tabular data (around 15% of non audio/video data)
- Generally describes proprietary data which can be accessible only through application or application programming interfaces (API)
- Applications that produce structured data are usually database applications.
- This type of data usually brings complex procedures of data evaluation and migration between the storage tiers.
- To ensure adequate quality standards, the classification process has to be monitored by subject matter experts.
Semi-structured or Poly-structured data (all other non audio/video data that does not conform to a system or platform defined Relational or Tabular form).
- Generally describes data files that have a dynamic or non-relational semantic structure (e.g. documents,XML,JSON,Device or System Log output,Sensor Output).
- Relatively simple process of data classification is criteria assignment.
- Simple process of data migration between assigned segments of predefined storage tiers.

Types of data classification - note that this designation is entirely orthogonal to the application centric designation outlined above. Regardless of structure inherited from application, data may be of the types below

1. Geographical : i.e. according to area (supposing the rice production of a state or country etc.) 2. Chronological: i.e. according to time (sale of last 3 months) 3. Qualitative : i.e. according to distinct categories. (E.g.: population on the basis of poor and rich) 4. Quantitative : i.e. according to magnitude(a) discrete and b)continuous

Basic criteria for semi-structured or poly-structured data classification

Time criteria is the simplest and most commonly used where different type of data is evaluated by time of creation, time of access, time of update, etc.
Metadata criteria as type, name, owner, location and so on can be used to create more advanced classification policy
Content criteria which involve usage of advanced content classification algorithms are most advanced forms of unstructured data classification

Note that any of these criteria may also apply to Tabular or Relational data as "Basic Criteria". These criteria are application specific, rather than inherent aspects of the form in which the data is presented..

Basic criteria for relational or Tabular data classification

These criteria are usually initiated by application requirements such as:

Disaster recovery and Business Continuity rules
Data centre resources optimization and consolidation
Hardware performance limitations and possible improvements by reorganization

Note that any of these criteria may also apply to semi/poly structured data as "Basic Criteria". These criteria are application specific, rather than inherent aspects of the form in which the data is presented.

Benefits of data classification

Benefits of effective implementation of appropriate data classification can significantly improve ILM process and save data centre storage resources. If implemented systemically it can generate improvements in data centre performance and utilization. Data classification can also reduce costs and administration overhead. "Good enough" data classification can produce these results:

Data compliance and easier risk management. Data are located where expected on predefined storage tier and "point in time"
Simplification of data encryption because all data need not be encrypted. This saves valuable processor cycles and all related consecutiveness.
Data indexing to improve user access times
Data protection is redefined where RTO (Recovery Time Objective) is improved.

References

Josh Judd and Dan Kruger (2005), Principles of SAN Design. Infinity Publishing
Stephen J. Bigelown (November 2005), SearchStorage.com, http://searchstorage.techtarget.com/news/article/0,289142,sid5_gci1139240,00.html

How to start process of data classification

Basic criteria for semi-structured or poly-structured data classification

Basic criteria for relational or Tabular data classification

Benefits of data classification

See also

References