Jump to content

Exploratory data analysis

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Farcaster (talk | contribs) at 14:59, 16 September 2014 (EDA development). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA),[1] which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Overview

Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems 'S'-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study.

Tukey's EDA was related to two other developments in statistical theory: Robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron's bootstrap, which are nonparametric and robust (for many problems).

Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's emphasis on exponential families.[2]

EDA development

Data science process flowchart

John W. Tukey wrote the book "Exploratory Data Analysis" in 1977.[3] Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

The objectives of EDA are to:

Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.[4]

Techniques

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.[5]

Typical graphical techniques used in EDA are:

Typical quantitative techniques are:

History

Many EDA ideas can be traced back to earlier authors, for example:

The Open University course Statistics in Society (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.

Example

Findings from EDA are often orthogonal to the primary analysis task. This is an example, described in more detail in.[6] The analysis task is to find the variables which best predict the tip that a dining party will give to the waiter. The variables available are tip, total bill, gender, smoking status, time of day, day of the week and size of the party. The analysis task requires that a regression model be fit with either tip or tip rate as the response variable. The fitted model is

tip rate = 0.18 - 0.01×size

which says that as the size of the dining party increase by one person tip will decrease by 1%. Making plots of the data reveals other interesting features not described by this model.

What is learned from the graphics is different from what could be learned by the modeling. You can say that these pictures help the data tell us a story, that we have discovered some features of tipping that perhaps we didn't anticipate in advance.

Software

  • R is an open source programming language and software environment for statistical computing and graphics
  • GGobi is a free software for interactive data visualization
  • OpenSHAPA (modern open source successor to MacSHAPA), permits analysis of various media files (e.g. video, sound).
  • CMU-DAP (Carnegie-Mellon University Data Analysis Package, FORTRAN source for EDA tools with English-style command syntax, 1977).
  • Data Applied, a comprehensive web-based data visualization and data mining environment.
  • Fathom (for high-school and intro college courses).
  • High-D for multivariate analysis using parallel coordinates.
  • JMP, an EDA package from SAS Institute.
  • QUADRIGRAM A toolkit for exploring, analyzing and visualizing data based on visual programming.
  • KNIME Konstanz Information Miner – Open-Source data exploration platform based on Eclipse.
  • Orange, an open-source data mining software suite.
  • PanXpan, a platform on online data analysis modules.
  • SAS Visual Analytics, also from the SAS Institute, includes a web-based EDA application called SAS Visual Analytics Explorer (VAE).
  • SOCR provides a large number of free Internet-accessible.
  • TinkerPlots (for upper elementary and middle school students).
  • Tanagra is an open source data mining software for academic and research purposes. It includes exploratory data analysis.
  • VisuMap for interactive exploration of high-dimensional multivariate data.
  • Weka an open source data mining package that includes visualisation and EDA tools such as targeted projection pursuit
  • curios.IT for interactive 3D exploration of high-dimensional business data.
  • dotplot designer is a data analysis software with data visualization features. Both for academic and business purposes.

See also

References

  1. ^ Chatfield, C. (1995). Problem Solving: A Statistician's Guide (2nd ed.). Chapman and Hall. ISBN 0412606305.
  2. ^ "Conversation with John W. Tukey and Elizabeth Tukey, Luisa T. Fernholz and Stephan Morgenthaler". Statistical Science. 15 (1): 79–94. 2000. doi:10.1214/ss/1009212675.
  3. ^ Tukey, John W. (1977). Exploratory Data Analysis. Pearson. ISBN 978-0201076165.
  4. ^ Konold, C. (1999). "Statistics goes to school". Contemporary Psychology. 44 (1): 81–82. doi:10.1037/001949.
  5. ^ Tukey, John W. (1980). "We need both exploratory and confirmatory". The American Statistician. 34 (1): 23–25. doi:10.1080/00031305.1980.10482706.
  6. ^ Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007) ″Interactive and Dynamic Graphics for Data Analysis: With R and GGobi″ Springer, 978-0387717616

Bibliography

  • Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5
  • Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616. {{cite book}}: Cite has empty unknown parameter: |coauthors= (help)CS1 maint: multiple names: authors list (link)
  • Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 0-471-09776-4. {{cite book}}: Cite has empty unknown parameter: |coauthors= (help)CS1 maint: multiple names: authors list (link)
  • Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 0-471-09777-2. {{cite book}}: Cite has empty unknown parameter: |coauthors= (help)CS1 maint: multiple names: authors list (link)
  • Leinhardt, G., Leinhardt, S., Exploratory Data Analysis: New Tools for the Analysis of Empirical Data, Review of Research in Education, Vol. 8, 1980 (1980), pp. 85–157.
  • Martinez, W. L., Martinez, A. R., and Solka, J. (2010). Exploratory Data Analysis with MATLAB, second edition. Chapman & Hall/CRC. ISBN 9781439812204. {{cite book}}: Invalid |ref=harv (help)CS1 maint: multiple names: authors list (link)
  • Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL, ISBN 978-1-58488-594-8
  • Tucker, L; MacCallum, R. (1993). Exploratory Factor Analysis. [1]. {{cite book}}: Cite has empty unknown parameter: |authorling= (help); External link in |location= (help)CS1 maint: location missing publisher (link) CS1 maint: multiple names: authors list (link)
  • Tukey, John Wilder (1977). Exploratory Data Analysis. Addison-Wesley. ISBN 0-201-07616-0. {{cite book}}: Cite has empty unknown parameters: |origmonth=, |month=, |chapterurl=, |origdate=, and |coauthors= (help)
  • Template:Cite isbn
  • Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1