Big data
|
|
This article appears to be in both diffused categories and their subcategories, or has an overbroad categorization, and may need cleanup. Please help improve this page to make sure its categories are appropriate and meet Wikipedia's quality standards. (October 2011) |
In information technology, big data[1] consists of datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[2] search, sharing, analytics,[3] and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime."[4] Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.[5] Scientists regularly encounter this problem in meteorology, genomics,[6] connectomics, complex physics simulations,[7] biological and environmental research,[8] Internet search, finance and business informatics. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, Radio-frequency identification readers, and wireless sensor networks.[9][10] Every day, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years.[11]
One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers".[12] The size of "big data" varies depending on the capabilities of the organization managing the set. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."[13]
Contents |
[edit] Definition
Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.
In a 2001 research report[14] and related conference presentations, then META Group (now Gartner) analyst, Doug Laney, defined data growth challenges (and opportunities) as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources). Gartner continues to use this model for describing big data.[15]
[edit] Examples
Examples include web logs; RFID; sensor networks; social networks; social data (due to the Social data revolution), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large-scale eCommerce.
[edit] Technologies
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to big data include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.[citation needed]
Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, backup, and optimize the use of the large data tables in the RDBMS.[16][17]
The practitioners of Big Data Analytics processes are generally hostile to shared storage. They prefer direct-attached storage (DAS) in its various forms from solid state disk (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—SAN and NAS—is that they are relatively slow, complex, and above all, expensive. These qualities are not consistent with Big Data Analytics systems that thrive on system performance, commodity infrastructure, and low cost.
Real or near-real time information delivery is one of the defining characteristics of Big Data Analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good. Data on spinning disk at the other end of a FC SAN connection is not. But perhaps worse than anything else, the cost of a SAN at the scale needed for analytics applications is thought to be prohibitive.
There is a case to be made for shared storage in Big Data Analytics. But storage vendors and the storage community in general have yet to make that case to Big Data Analytics practitioners.[18]
[edit] Impact
When the Sloan Digital Sky Survey (SDSS) began collecting data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of data every five days.[19] In total, the four main detectors at the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010 (13,000 terabytes).[20]
More Big Data Impacts:
- Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data - the equivalent of 167 times the information contained in all the books in the US Library of Congress.
- Facebook handles 40 billion photos from its user base.
- Decoding the human genome originally took 10 years to process; now it can be achieved in one week.[19]
The impact of “Big Data” has increased the demand of information management specialists in that Oracle, IBM, Microsoft, and SAP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole.[19]
Big Data has emerged because we are living in a society which makes increasing use of data intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Basically, there are more people interacting with data or information than ever before.[19] Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. Cisco predicts that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.[19]
[edit] Critique
danah boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data.[21] This approach may lead to results biased in one way or another. Integration across heterogeneous data resources - some that might be considered “big data” and others not - presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.[22]
[edit] See also
|
[edit] Architecture comparison
|
|
This article's use of external links may not follow Wikipedia's policies or guidelines. Please improve this article by removing excessive or inappropriate external links, and converting useful links where appropriate into footnote references. (January 2012) |
- Survey Distributed Databases
- Marin Dimitrov's Comparison on PNUTS, Dynamo, Voldemort, BigTable, HBase, Cassandra and CouchDB May 2010
- Why Use HBase-1: from Million Mark to Billion Mark
- Why Use HBase-2: Demystifying HBase Data integrity, Availability and Performance
- Beyond Hadoop: Next-Generation Big Data Architectures by By Bill McColl 23 October 2010 about "Not Only Hadoop".
- MPI and BSP See wiki about Bulk Synchronous Parallel and Apache HAMA on Hadoop cluster.
[edit] Performance evaluation
Existing work done by community
[edit] References
- ^ White, Tom. Hadoop: The Definitive Guide. 2009. 1st Edition. O'Reilly Media. Pg 3.
- ^ Kusnetzky, Dan. What is "Big Data?". ZDNet. http://blogs.zdnet.com/virtualization/?p=1708
- ^ Vance, Ashley. Start-Up Goes After Big Data With Hadoop Helper. New York Times Blog. 22 April 2010. http://bits.blogs.nytimes.com/2010/04/22/start-up-goes-after-big-data-with-hadoop-helper/?dbk
- ^ Cukier, K. (25 February 2010). Data, data everywhere. The Economist. http://www.economist.com/specialreports/displaystory.cfm?story_id=15557443
- ^ Horowitz, Mark. Visualizing Big Data: Bar Charts for Words. Wired Magazine. Vol 16 (7). 23 June 2008. http://www.wired.com/science/discoveries/magazine/16-07/pb_visualizing##ixzz0llT2DN5j. Volu 16(7)
- ^ Community cleverness required. Nature, 455(7209), 1. 2008. http://www.nature.com/nature/journal/v455/n7209/full/455001a.html
- ^ Sandia sees data management challenges spiral. HPC Projects. 4 August 2009. http://www.hpcprojects.com/news/news_story.php?news_id=922
- ^ Reichman,O.J., Jones, M.B., and Schildhauer, M.P. 2011. Challenges and Opportunities of Open Data in Ecology. Science 331(6018): 703-705.DOI:10.1126/science.1197962
- ^ Hellerstein, Joe. Parallel Programming in the Age of Big Data. Gigaom Blog. 9 November 2008. http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/
- ^ Segaran, Toby and Hammerbacher, Jeff. Beautiful Data. 1st Edition. O'Reilly Media. Pg 257.
- ^ http://www-01.ibm.com/software/data/bigdata/
- ^ Jacobs, A. (6 July 2009). The Pathologies of Big Data. ACMQueue. http://queue.acm.org/detail.cfm?id=1563874
- ^ Magoulas, Roger., Lorica, Ben. (Feb 2009) Introduction to Big Data. Release 2.0. Issue 11. Sebastopol, CA: O’Reilly Media. http://radar.oreilly.com/r2/release2-0-11.html
- ^ Douglas, Laney. "3D Data Management: Controlling Data Volume, Velocity and Variety". Gartner. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Retrieved 6 February 2001.
- ^ Beyer, Mark. "Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data". Gartner. http://www.gartner.com/it/page.jsp?id=1731916. Retrieved 13 July 2011.
- ^ Monash, Curt eBay’s two enormous data warehouses, 30 April 2009 http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
- ^ Monash, Curt eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more, 6 October 2010 http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/
- ^ How New Analytic Systems will Impact Storage Sept, 2011 http://www.evaluatorgroup.com/document/big-data-how-new-analytic-systems-will-impact-storage-2/
- ^ a b c d e http://www.economist.com/node/15557443
- ^ Geoff Brumfiel (19 January 2011). "High-energy physics: Down the petabyte highway". Nature 469: pp. 282–283. doi:10.1038/469282a. http://www.nature.com/news/2011/110119/full/469282a.html. Retrieved 2 October 2011.
- ^ Danah Boyd (2010-04-29). "Privacy and Publicity in the Context of Big Data". WWW 2010 conference. http://www.danah.org/papers/talks/2010/WWW2010.html. Retrieved 2011-04-18.
- ^ Jones MB, Schildhauer MP, Reichman OJ, and Bowers S. 2006. The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere. Annual Review of Ecology, Evolution, and Systematics 37(1):519-544
|
|||||||||||
|
||||||||||||||||||||
|
|||||||||||||||||||||||||||||||
[edit] Additional reading
- Manyika, James; Michael Chui, Jaques Bughin, Brad Brown, Richard Dobbs, Charles Roxburgh, Angela Hung Byers (May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
- Hilbert, Martin; Priscila Lopez (2011). "The World’s Technological Capacity to Store, Communicate, and Compute Information". Science 332 (6025): 60-65. doi:10.1126/science.1200970.