Tidy data

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Tidy data is an alternate name for the common statistical form called a model matrix or data matrix. A data matrix is defined in [1] as follows:

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.

Hadley Wickham later defined "Tidy Data" as data sets that are arranged such that each variable is a column and each observation (or case) is a row.[2] (originally with additional per-table conditions that made the definition equivalent to the Boyce–Codd 3rd normal form).

Data arrangement is an important consideration in data processing, but should not be confused with the also important task of data cleansing.

Other relevant formulations include denormalization prior to machine learning modeling (informally denoting moving data to a "wide form" where all possible measurements are in a given row), and use of semantic triples as intermediate representation (informally a "tall" or "long" form, where measurements about a single instance are spread across many rows).

References[edit]

  1. ^ Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994
  2. ^ Wickham, Hadley (20 February 2013). "Tidy Data" (PDF). Journal of Statistical Software.