Tidy data

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Tidy data is the data obtained as a result of a process called data tidying. It is one of the important cleaning processes during big data processing and is a recognized step in the practice of data science. Tidy data sets have structure and working with them is easy; they’re easy to manipulate, model and visualize. Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row.[1][2]

Tidy data provide standards and concepts for data cleaning, and with tidy data there’s no need to start from scratch and reinvent new methods for data cleaning.


Jeff Leek in his book The Elements of Data Analytic Style summarizes the characteristics of tidy data as the points:[3]

  1. Each variable you measure should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each "kind" of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.


  1. ^ Wickham, Hadley (20 February 2013). "Tidy Data" (PDF). Journal of Statistical Software.
  2. ^ Wickham, Hadley. "Tidy data" (PDF). Journal of Statistical Software. VV (II).
  3. ^ Jeff Leek, The Elements of Data Analytic Style, Leanpub, 2015-03-02