Data science

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Not to be confused with information science.

Data science, also known as data-driven science, is an interdisciplinary field about scientific processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics,[3] similar to Knowledge Discovery in Databases (KDD).

Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.[4][5]


Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, operations research,[6] information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing. Methods that scale to big data are of particular interest in data science, although the discipline is not generally considered to be restricted to such big data, and big data technologies are often focused on organizing and preprocessing the data instead of analysis. The development of machine learning has enhanced the growth and importance of data science.

Data science affects academic and applied research in many domains, including machine translation, speech recognition, robotics, search engines, digital economy, but also the biological sciences, medical informatics, health care, social sciences and the humanities. It heavily influences economics, business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis.[7]

Data scientist[edit]

Data scientists use their data and analytical ability to find and interpret rich data sources; manage large amounts of data despite hardware, software, and bandwidth constraints; merge data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings. They are often expected to produce answers in days rather than months, work by exploratory analysis and rapid iteration, and to produce and present results with dashboards (displays of current values) rather than papers/reports, as statisticians normally do.[8]

"Data Scientist" has become a popular occupation with Harvard Business Review dubbing it "The Sexiest Job of the 21st Century" [9] and McKinsey & Company projecting a global excess demand of 1.5 million new data scientists.[10] Universities are offering masters courses in data science.[11] Shorter private bootcamps are also offering data science certificates including student-paid programs like General Assembly to employer-paid programs like The Data Incubator.[12]


Data science process flowchart

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of applications. In 1996, members of the International Federation of Classification Societies (IFCS) met in Kobe for their biennial conference. Here, for the first time, the term data science is included in the title of the conference ("Data Science, classification, and related methods").[13]

In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?"[14] for his appointment to the H. C. Carver Professorship at the University of Michigan.[15] In this lecture, he characterized statistical work as a trilogy of data collection, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists.[14] Later, he presented his lecture entitled "Statistics = Data Science?" as the first of his 1998 P.C. Mahalanobis Memorial Lectures.[16] These lectures honor Prasanta Chandra Mahalanobis, an Indian scientist and statistician and founder of the Indian Statistical Institute.

In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate "advances in computing with data" in his article "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics," which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique.[17] In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.

In April 2002, the International Council for Science: Committee on Data for Science and Technology (CODATA)[18] started the Data Science Journal,[19] a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues.[20] Shortly thereafter, in January 2003, Columbia University began publishing The Journal of Data Science,[21] which provided a platform for all data workers to present their views and exchange ideas. The journal was largely devoted to the application of statistical methods and quantitative research. In 2005, The National Science Board published "Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century" defining data scientists as "the information and computer scientists, database and software and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection" whose primary activity is to "conduct creative inquiry and analysis."[22] In 2013, the IEEE Task Force on Data Science and Advanced Analytics [23] was launched, and the first international conference: IEEE International Conference on Data Science and Advanced Analytics was launched in 2014.[24] In 2015, the International Journal on Data Science and Analytics [25] was launched by Springer to publish original work on data science and big data analytics.

In 2008,[citation needed] DJ Patil and Jeff Hammerbacher used the term "data scientist" to define their jobs at LinkedIn and Facebook, respectively.[26]


Although use of the term "data science" has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced “business analytics” in contexts such as graduate degree programs.[27] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, “I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”[28]


In the 2010-2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[29] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms.[30][31][32]


  1. ^ Dhar, V. (2013). "Data science and prediction". Communications of the ACM. 56 (12): 64. doi:10.1145/2500499. 
  2. ^ Jeff Leek (2013-12-12). "The key word in "Data Science" is not Data, it is Science". Simply Statistics. 
  3. ^ "Predictive Analytics Degree: Northwestern SPS". Northwestern University. Retrieved 28 May 2016. The Master of Science in Predictive Analytics (MSPA) program, established in 2011, is a fully online part-time graduate program, one of the first to offer dedicated training in data science 
  4. ^ Stewart Tansley; Kristin Michele Tolle (2009). The Fourth Paradigm: Data-intensive Scientific Discovery. Microsoft Research. ISBN 978-0-9825442-0-4. 
  5. ^ Bell, G.; Hey, T.; Szalay, A. (2009). "COMPUTER SCIENCE: Beyond the Data Deluge". Science. 323 (5919): 1297–1298. doi:10.1126/science.1170411. ISSN 0036-8075. 
  6. ^ Foreman, John (2013). Data Smart: Using Data Science to Transform Information into Insight. John Wiley & Sons. p. xiv. ISBN 9781118839867. 
  7. ^ LaPonsie, Maryalene. "Data scientists: The Hottest Job You Haven't Heard Of". Retrieved 7 October 2012. 
  8. ^ Nguyen, Thomson. "Data scientists vs data analysts: Why the distinction matters". Retrieved 2 October 2015. 
  9. ^ "Data Scientist: The Sexiest Job of the 21st Century". 
  10. ^ "Big data: The next frontier for innovation, competition, and productivity". 
  11. ^ "Big Data Analytics Masters". Information Week. Retrieved 2016-02-22. 
  12. ^ "NY gets new bootcamp for data scientists: It's free, but harder to get into than Harvard". Venture Beat. Retrieved 2016-02-22. 
  13. ^ Press, Gil. "A Very Short History Of Data Science". 
  14. ^ a b Wu, C. F. J. (1997). "Statistics = Data Science?" (PDF). Retrieved 9 October 2014. 
  15. ^ "Identity of statistics in science examined". The University Records, 9 November 1997, The University of Michigan. Retrieved 12 August 2013. 
  16. ^ "P.C. Mahalanobis Memorial Lectures, 7th series". P.C. Mahalanobis Memorial Lectures, Indian Statistical Institute. Retrieved 18 August 2013. 
  17. ^ Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International Statistical Review / Revue Internationale de Statistique, 21–26
  18. ^ International Council for Science : Committee on Data for Science and Technology. (2012, April). CODATA, The Committee on Data for Science and Technology. Retrieved from International Council for Science : Committee on Data for Science and Technology:
  19. ^ Data Science Journal. (2012, April). Available Volumes. Retrieved from Japan Science and Technology Information Aggregator, Electronic:
  20. ^ Data Science Journal. (2002, April). Contents of Volume 1, Issue 1, April 2002. Retrieved from Japan Science and Technology Information Aggregator, Electronic:
  21. ^ The Journal of Data Science. (2003, January). Contents of Volume 1, Issue 1, January 2003. Retrieved from
  22. ^ National Science Board. "Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century". National Science Foundation. Retrieved 30 June 2013. 
  23. ^ "IEEE Task Force on Data Science and Advanced Analytics". 
  24. ^ "2014 IEEE International Conference on Data Science and Advanced Analytics". 
  25. ^ "Journal on Data Science and Analytics". 
  26. ^ "Tim O'Reilly: The World's 7 Most Powerful Data Scientists". 
  27. ^ "Data Science: What's The Half-Life Of A Buzzword?". Forbes. 2013-08-19. 
  28. ^ "Nate Silver: What I need from statisticians". Statistics Views. 23 Aug 2013. 
  29. ^ Chalef, Daniel (2016-03-20). "Data Science Tools – Are Proprietary Vendors Still Relevant?". Retrieved 2016-11-07. 
  30. ^ Asay, Matt. "For data scientists, the big money is in open source". TechRepublic. Retrieved 6 November 2016. 
  31. ^ Jones, M. Tim. "Data science and open source". IBM DeveloperWorks. IBM. Retrieved 6 November 2016. 
  32. ^ Talbert, Neera. "Open Source Software Fuels a Revolution in Data Science". insideBIGDATA. Retrieved 6 November 2016. 

Further reading[edit]