Tidy data

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Tidy data is the data obtained as a result of a process called data tidying. It is one of the important cleaning processes during big data processing and is a recognized step in the practice of data science. Tidy data sets have structure and working with them is easy; they’re easy to manipulate, model and visualize. Tidy data sets main concept is to arrange data in a way that each variable is a column and each observation (or case) is a row.[1][2]

Tidy data provide standards and concepts for data cleaning, and with tidy data there’s no need to start from scratch and reinvent new methods for data cleaning.

Characteristics[edit]

Jeff Leek in his book The Elements of Data Analytic Style summarizes the characteristics of tidy data as the points:[3]

  1. Each variable you measure should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each "kind" of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.

References[edit]

  1. ^ Wickham, Hadley (20 February 2013). "Tidy Data" (PDF). Journal of Statistical Software. 
  2. ^ Wickham, Hadley. "Tidy data" (PDF). Journal of Statistical Software. VV (II). 
  3. ^ Jeff Leek, The Elements of Data Analytic Style, Leanpub, 2015-03-02