Data science: Difference between revisions
Grammar and clarity fixups for first paragraph. |
The first of the two popularization tags seems to corrospond more closely to history than popularization. |
||
Line 11: | Line 11: | ||
Data Science has existed for over a decade. An early claimant to the term Data Science is William S. Cleveland<ref>http://www.stat.purdue.edu/~wsc/</ref> who wrote Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique<ref>Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International Statistical Review / Revue Internationale de Statistique, 21-26</ref>. About a year later, the International Council for Science: Committee on Data for Science and Technology<ref>International Council for Science : Committee on Data for Science and Technology. (2012, April). CODATA, The Committee on Data for Science and Technology. Retrieved from International Council for Science : Committee on Data for Science and Technology: http://www.codata.org/</ref> started publishing the CODATA Data Science Journal<ref>Data Science Journal. (2012, April). Available Volumes. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/_vols</ref> beginning April 2002<ref>Data Science Journal. (2002, April). Contents of Volume 1, Issue 1, April 2002. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/1/0/_contents</ref>. Shortly thereafter, in January of 2003, Columbia University began publishing The Journal of Data Science<ref>The Journal of Data Science. (2003, January). Contents of Volume 1, Issue 1, January 2003. Retrieved from http://www.jds-online.com/v1-1</ref>. |
Data Science has existed for over a decade. An early claimant to the term Data Science is William S. Cleveland<ref>http://www.stat.purdue.edu/~wsc/</ref> who wrote Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique<ref>Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International Statistical Review / Revue Internationale de Statistique, 21-26</ref>. About a year later, the International Council for Science: Committee on Data for Science and Technology<ref>International Council for Science : Committee on Data for Science and Technology. (2012, April). CODATA, The Committee on Data for Science and Technology. Retrieved from International Council for Science : Committee on Data for Science and Technology: http://www.codata.org/</ref> started publishing the CODATA Data Science Journal<ref>Data Science Journal. (2012, April). Available Volumes. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/_vols</ref> beginning April 2002<ref>Data Science Journal. (2002, April). Contents of Volume 1, Issue 1, April 2002. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/1/0/_contents</ref>. Shortly thereafter, in January of 2003, Columbia University began publishing The Journal of Data Science<ref>The Journal of Data Science. (2003, January). Contents of Volume 1, Issue 1, January 2003. Retrieved from http://www.jds-online.com/v1-1</ref>. |
||
== |
==History== |
||
The term “Data Science” (originally used interchangeably with “Datalogy”) has existed for over thirty years and emerged as a substitute for “computer science” by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods (Citation Link), which freely used the term ‘data science’ in its survey of the contemporary data processing methods that are used in a wide range of applications. In 1996, Members of the International Federation of Classification Societies (IFCS) meet in Tokyo for their biennial conference. For the first time, the term “data science” is included in the title of the conference (“Data science, classification, and related methods”). The following year (1997), the journal “Knowledge Discovery and Data Minig |
The term “Data Science” (originally used interchangeably with “Datalogy”) has existed for over thirty years and emerged as a substitute for “computer science” by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods (Citation Link), which freely used the term ‘data science’ in its survey of the contemporary data processing methods that are used in a wide range of applications. In 1996, Members of the International Federation of Classification Societies (IFCS) meet in Tokyo for their biennial conference. For the first time, the term “data science” is included in the title of the conference (“Data science, classification, and related methods”). The following year (1997), the journal “Knowledge Discovery and Data Minig |
||
Revision as of 15:10, 19 October 2012
Data science is a discipline that incorporates varying elements and builds on techniques and theories from many fields, including Math, Statistics, Data Engineering, Pattern Recognition and Learning, Advanced Computing, Visualization, Uncertainty Modeling, Data Warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Data Science is a novel term that is often used interchangeably with "Competitive Intelligence" or "Business Analytics," although it is becoming more common. Data Science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.
A practitioner of Data Science is called a Data Scientist. Data Scientist solve complex data problems through employing deep expertise in some scientific discipline.It is generally expected that Data Scientists are able to work with various elements of mathematics, statistics and computer science, although expertise in these subjects are not required. However, a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. There is probably no living person who is an expert in all of these disciplines and an extremely rare person would be proficient in all of these disciplines. This means that data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.
Good Data Scientists are able to apply their skills to achieve a broad spectrum of end results. Some of these include the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data and building rich tools that enable others to work effectively. According to some experts, the best data scientist tend to be "hard scientists," particularly physicists, rather than those with backgrounds in computer science. The skill-sets and competencies that Data Scientist employ vary widely. Data scientists are an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and analysis, that can help businesses gain a competitive edge.[1] Generally, Data Scientist are able to apply their skills to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data and rich tools that enable others to work with data effectively.
A major goal of Data Science is to make it easier for other to find and coalesce data with greater ease. Data science technologies impact how we access data and conduct research across various domains, including the biological sciences, medical informatics, Social Sciences and the humanities. “From intelligence search that integrates better understanding of the text and the user’s intentions, to integrating multiple modalities when accessing information, to the ability to actually aggregate information from multiple sources and answer users queries, the possibilities are endless.”
Origins
Data Science has existed for over a decade. An early claimant to the term Data Science is William S. Cleveland[2] who wrote Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique[3]. About a year later, the International Council for Science: Committee on Data for Science and Technology[4] started publishing the CODATA Data Science Journal[5] beginning April 2002[6]. Shortly thereafter, in January of 2003, Columbia University began publishing The Journal of Data Science[7].
History
The term “Data Science” (originally used interchangeably with “Datalogy”) has existed for over thirty years and emerged as a substitute for “computer science” by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods (Citation Link), which freely used the term ‘data science’ in its survey of the contemporary data processing methods that are used in a wide range of applications. In 1996, Members of the International Federation of Classification Societies (IFCS) meet in Tokyo for their biennial conference. For the first time, the term “data science” is included in the title of the conference (“Data science, classification, and related methods”). The following year (1997), the journal “Knowledge Discovery and Data Minig
In 2001, William S. Cleveland introduced the notion of “Data Science” as an independent discipline, extending the field of statistics to incorporate “advances in computing with data” in his article “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics,” which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique[2]. In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: Multidisciplinary Investigations, Models and Methods for Data, Computing with Data, Pedagogy, Tool Evaluation, and Theory.
In April 2002, the International Council for Science: Committee on Data for Science and Technology (CODATA)[3] started the Data Science Journal,[4] a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues.[5]. Shortly thereafter, in January of 2003, Columbia University began publishing The Journal of Data Science,[6] which provided a platform for all data workers to present their views and exchange ideas. The journal was largely devoted to the application of statistical methods and quantitative research. In 2005, The National Science Board published “Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century” defining data scientists as “the information and computer scientists, database and software and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection.”
Popularization
Troy Sadkowsky, a scientific researcher developing applications in support of large-scale , “Big Data” scientific research was the first to establish an online community group, the “Data Scientist Group” on LinkedIn in 2009. Mike Loukides Vice President of Content Strategy for O'Reilly Media helped to bring Data Science into the mainstream vernacular in 2010 with his article What is data science?. In the last few years, data science is increasingly being associated with the analysis of Big data. In the mid-2000s, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook created data science teams specifically to derive business value out of the extremely large data being generated at by their websites. There are now several ongoing conferences devoted to big data and data science, such as O'Reilly's Strata Conferences and Greenplum's Data Science Summits.
The job title has similarly become very popular. On one heavily used employment site, the number of job postings for "data scientist" has increased 4,000 percent between 2010 and 2012. In 2012, Kaggle, an internet-based predictive modeling and analytics company, began ranking data scientist who participated in their pattern-finding challenges. As of October, 2010, the current leader is Alexander D’Yakonov, a professor of computational mathematics and cybernetics at Moscow State University. Sergey Yurgenson, a physics Ph.D. how designs photon microscopes at the neurobiology department at the Harvard Medical School ranked second. Vivek Sharma, a software consultant in financial services based in New Delhi, India who possesses a master’s degree in computer science, ranked third.
Domain Specific Interests
Data Science is the practice of deriving valuable insights from data. Data Science is emerging to meet the challenges of processing very large data sets i.e. "Big Data" consisting of structured, unstructured or semi-structured data that large enterprises produce. A domain at center stage of Data Science is the explosion of new data generated from smart devices, web, mobile and social media. Data Science requires a versatile skill-set. Many practicing data scientist commonly specialize in specific domains such as marketing, medical, security, fraud and finance fields. However, data scientists rely heavily upon elements of statistics, machine learning, text retrieval and natural language processing to analyze data and interpret results.
Research Areas
As an interdisciplinary subject, Data Science draws scientific inquiry from a broad range of academic subject areas, mostly related to the hard scientist. Some areas of research are”
- Cloud Computing
- Databases and Information Integration
- Learning, Natural Language Processing and Information Extraction
- Computer Vision
- Information Retrieval and Web Information Access
- Knowledge Discovery in Social and Information Networks
Security Data Science
Data Science has a long and rich history in security and fraud monitoring. Paul Braxton founder of securitydatascience.org coined the term Security Data Science and defined it as the application of advanced analytics to activity and access data to uncover unknown risks. Security Data Science is focused on advancing information security through practical applications of exploratory data analysis, statistics, machine learning and data visualization. Although the tools and techniques are no different that those used in data science in any data domain, this group has a micro-focus on reducing risk, identifying fraud or malicious insiders using data science. The information security and fraud prevention industry have been evolving Security Data Science in order to tackle the challenges of managing and gaining insights from huge streams of log data, discover insider threats and prevent fraud. Security Data Science is "data driven" meaning that new insights and value comes directly from data. [8]
Academic Programs
At present, very few academic degree programs are offered with the “Data Science” designation. This may be attributed to the interdisciplinary nature inherent of Data Science. However, there are a good number of academic programs that offer curriculum's which blend together various subjects and research areas to prepare students for Data Science careers. The most common “Data Science” academic program is “Analytics”. Some examples of these programs are: the Institute for Advanced Analytics at North Carolina State University, the McCormick School of Engineering at Northwestern University, the six-week summer program at the University of Illinois and a new degree program in “Social Data Analytics” at Penn State University Other academic programs include:
Master’s Degree Programs
- The University of Dundee will offer an Master of Science in Data Science starting January 2013.
- New York University’s Stern School of Business is offering a Master of Arts in Business Analytics
- The University of Texas McCombs School of Business is offering a Master of Science in Business Analytics
- University of Michigan’s Dearborn College is offering a Master of Arts in Business Analytics
- DePaul University is offering a Master of Science in Predictive Analysis and courses in Data Curation
- Steven’s Institute’s Wesley J. Howe School of Technology Management, is offering a Master program in Business Intelligence and Analytics.
- The University of Auckland, New Zealand, is introducing a Master Programme of Professional Studies in Data Science in 2013.
- North Carolina State University Institute for Advanced Analytics is offering a Master of Science in Analytics and a Master of Science in Analytics
- The University of San Francisco is offering a Master of Science in Analytics
Doctorate Degree Programs
- George Mason University is offering a PhD in Computational Science and Informatics
- Purdue University’s Indianapolis School of Informatics is offering a PhD in Informatics
Specializations
- The University of Illinois at Urbana-Champaign’s Graduate School of Library and Information Science has been offering a specialization in Data Curation (DCEP) since 2007 and will soon offer a specialization in Socio-Technical Data Analytics in 2013 (SoDA). The Library & Information Science program is ranked number 1 in the nation.
Certificate Programs
- Syracuse University's School of Information Studies has been offering a Certificate in Advanced Studies in Data Science since early 2012.
Courses
- UC Berkeley is offering an “Introduction to Data Science” course
- Columbia University is offering an “Introduction to Data Science” course
- Bentley University is offering an “Introduction to Data Science” course
- Data Science Institute, a French and Canadian private university, provides courses and master classes about Data Science in Paris and Montreal, to help existing BI professional to upgrade their competency to Data Science.
Domain Specific Organizations
Data scientist work in many industries across many data domains however specialization in some domains have emerged. One such is security. Association of Security Data Scientist has formed and is promoting Security Data Science as a sub-discipline of Information Security.
Internet Forums
As the "Data Science" buzzword spreads, more online communities are being established to draw practitioners together and generate public interests. Some online groups and communities are:
- Twitter: Data Science Central
Professional Organizations
A few professional organizations have sprung up recently. Data Science Central, kaggle and ScraperWiki are examples. Greenplum, a division of EMC driving research and innovation in the field of "Big Data Analytics" and the Data Science industry. A "Chief Data Scientists Summit" is scheduled for November 2012 in Chicago, USA.
Companies and Organizations Seeking Data Scientists
Further Reading
- Jeffrey M. Stanton (20 May 2012). "Introduction to Data Science". Syracuse University School of Information Studies. Retrieved 08 August 2012.
{{cite web}}
: Check date values in:|accessdate=
(help) - Calvin Andrus (2012). "Data Science: An Introduction". Wikibooks.org. Retrieved 08 August 2012.
{{cite web}}
: Check date values in:|accessdate=
(help)[9] - [10]
- Jeffrey M. Stanton (20 May 2012). "Introduction to Data Science". Syracuse University School of Information Studies. Retrieved 08 August 2012.
- Calvin Andrus (2012). "Data Science: An Introduction". Wikibooks.org. Retrieved 08 August 2012.
- Drew Conway, John Miles White. “Machine Learning for Hackers”. O’Reilly Media, Inc.
References
- ^ LaPonsie, Maryalene. "Data Scientists: The Hottest Job You Haven't Heard Of". Retrieved 7 October 2012.
- ^ http://www.stat.purdue.edu/~wsc/
- ^ Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International Statistical Review / Revue Internationale de Statistique, 21-26
- ^ International Council for Science : Committee on Data for Science and Technology. (2012, April). CODATA, The Committee on Data for Science and Technology. Retrieved from International Council for Science : Committee on Data for Science and Technology: http://www.codata.org/
- ^ Data Science Journal. (2012, April). Available Volumes. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/_vols
- ^ Data Science Journal. (2002, April). Contents of Volume 1, Issue 1, April 2002. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/1/0/_contents
- ^ The Journal of Data Science. (2003, January). Contents of Volume 1, Issue 1, January 2003. Retrieved from http://www.jds-online.com/v1-1
- ^ http://www.securitydatascience.org
- ^ Anderson, Janna. "The Future of The Internet" (PDF). Pew Research Center. Retrieved 7 October 2012.
- ^ West, Darrell. "Big Data For Education: Data Mining, Data Analytics, and Web Dashboards" (PDF). The Brookings Institution. Retrieved 7 October 2012.
- ^ Davenport, Thomas. "The Human Side of Big Data and High-Performance Analytics" (PDF). International Institute for Analytics. Retrieved 7 October 2012.
- ^ Hellerstein, Joseph. "The MADlib Analytics Library or MAD Skills, the SQL" (PDF). University of California at Berkeley. Retrieved 7 October 2012.
- ^ Stodder, David. "Customer Analytics In the Age of Social Media" (PDF). TDWI Research. Retrieved 7 October 2012.
--Variable12 (talk) 07:43, 7 October 2012 (UTC)Elias