Big data

From Wikipedia, the free encyclopedia
Jump to: navigation, search
A data visualization created by IBM shows that big data such as Wikipedia edits by bot Pearle are more meaningful when enhanced with colors and position.

In information technology, big data[1] consists of datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[2] search, sharing, analytics,[3] and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime."[4] Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.[5] Scientists regularly encounter this problem in meteorology, genomics,[6] connectomics, complex physics simulations,[7] biological and environmental research,[8] Internet search, finance and business informatics. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, Radio-frequency identification readers, and wireless sensor networks.[9][10] Every day, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years.[11]

One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers".[12] The size of "big data" varies depending on the capabilities of the organization managing the set. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."[13]

Contents

[edit] Definition

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.

In a 2001 research report[14] and related conference presentations, then META Group (now Gartner) analyst, Doug Laney, defined data growth challenges (and opportunities) as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources). Gartner continues to use this model for describing big data.[15]

[edit] Examples

Examples include web logs; RFID; sensor networks; social networks; social data (due to the Social data revolution), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large-scale eCommerce.

[edit] Technologies

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to big data include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.[citation needed]

Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, backup, and optimize the use of the large data tables in the RDBMS.[16][17]

The practitioners of Big Data Analytics processes are generally hostile to shared storage. They prefer direct-attached storage (DAS) in its various forms from solid state disk (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—SAN and NAS—is that they are relatively slow, complex, and above all, expensive. These qualities are not consistent with Big Data Analytics systems that thrive on system performance, commodity infrastructure, and low cost.

Real or near-real time information delivery is one of the defining characteristics of Big Data Analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good. Data on spinning disk at the other end of a FC SAN connection is not. But perhaps worse than anything else, the cost of a SAN at the scale needed for analytics applications is thought to be prohibitive.

There is a case to be made for shared storage in Big Data Analytics. But storage vendors and the storage community in general have yet to make that case to Big Data Analytics practitioners.[18]

[edit] Impact

When the Sloan Digital Sky Survey (SDSS) began collecting data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of data every five days.[19] In total, the four main detectors at the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010 (13,000 terabytes).[20]

More Big Data Impacts:

  • Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data - the equivalent of 167 times the information contained in all the books in the US Library of Congress.
  • Facebook handles 40 billion photos from its user base.
  • Decoding the human genome originally took 10 years to process; now it can be achieved in one week.[19]

The impact of “Big Data” has increased the demand of information management specialists in that Oracle, IBM, Microsoft, and SAP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole.[19]

Big Data has emerged because we are living in a society which makes increasing use of data intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Basically, there are more people interacting with data or information than ever before.[19] Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. Cisco predicts that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.[19]

[edit] Critique

danah boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data.[21] This approach may lead to results biased in one way or another. Integration across heterogeneous data resources - some that might be considered “big data” and others not - presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.[22]

[edit] See also

 
Search Wikimedia Commons
  Wikimedia Commons has media related to:

[edit] Architecture comparison

[edit] Performance evaluation

Existing work done by community

[edit] References

  1. ^ White, Tom. Hadoop: The Definitive Guide. 2009. 1st Edition. O'Reilly Media. Pg 3.
  2. ^ Kusnetzky, Dan. What is "Big Data?". ZDNet. http://blogs.zdnet.com/virtualization/?p=1708
  3. ^ Vance, Ashley. Start-Up Goes After Big Data With Hadoop Helper. New York Times Blog. 22 April 2010. http://bits.blogs.nytimes.com/2010/04/22/start-up-goes-after-big-data-with-hadoop-helper/?dbk
  4. ^ Cukier, K. (25 February 2010). Data, data everywhere. The Economist. http://www.economist.com/specialreports/displaystory.cfm?story_id=15557443
  5. ^ Horowitz, Mark. Visualizing Big Data: Bar Charts for Words. Wired Magazine. Vol 16 (7). 23 June 2008. http://www.wired.com/science/discoveries/magazine/16-07/pb_visualizing##ixzz0llT2DN5j. Volu 16(7)
  6. ^ Community cleverness required. Nature, 455(7209), 1. 2008. http://www.nature.com/nature/journal/v455/n7209/full/455001a.html
  7. ^ Sandia sees data management challenges spiral. HPC Projects. 4 August 2009. http://www.hpcprojects.com/news/news_story.php?news_id=922
  8. ^ Reichman,O.J., Jones, M.B., and Schildhauer, M.P. 2011. Challenges and Opportunities of Open Data in Ecology. Science 331(6018): 703-705.DOI:10.1126/science.1197962
  9. ^ Hellerstein, Joe. Parallel Programming in the Age of Big Data. Gigaom Blog. 9 November 2008. http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/
  10. ^ Segaran, Toby and Hammerbacher, Jeff. Beautiful Data. 1st Edition. O'Reilly Media. Pg 257.
  11. ^ http://www-01.ibm.com/software/data/bigdata/
  12. ^ Jacobs, A. (6 July 2009). The Pathologies of Big Data. ACMQueue. http://queue.acm.org/detail.cfm?id=1563874
  13. ^ Magoulas, Roger., Lorica, Ben. (Feb 2009) Introduction to Big Data. Release 2.0. Issue 11. Sebastopol, CA: O’Reilly Media. http://radar.oreilly.com/r2/release2-0-11.html
  14. ^ Douglas, Laney. "3D Data Management: Controlling Data Volume, Velocity and Variety". Gartner. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Retrieved 6 February 2001. 
  15. ^ Beyer, Mark. "Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data". Gartner. http://www.gartner.com/it/page.jsp?id=1731916. Retrieved 13 July 2011. 
  16. ^ Monash, Curt eBay’s two enormous data warehouses, 30 April 2009 http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
  17. ^ Monash, Curt eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more, 6 October 2010 http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/
  18. ^ How New Analytic Systems will Impact Storage Sept, 2011 http://www.evaluatorgroup.com/document/big-data-how-new-analytic-systems-will-impact-storage-2/
  19. ^ a b c d e http://www.economist.com/node/15557443
  20. ^ Geoff Brumfiel (19 January 2011). "High-energy physics: Down the petabyte highway". Nature 469: pp. 282–283. doi:10.1038/469282a. http://www.nature.com/news/2011/110119/full/469282a.html. Retrieved 2 October 2011. 
  21. ^ Danah Boyd (2010-04-29). "Privacy and Publicity in the Context of Big Data". WWW 2010 conference. http://www.danah.org/papers/talks/2010/WWW2010.html. Retrieved 2011-04-18. 
  22. ^ Jones MB, Schildhauer MP, Reichman OJ, and Bowers S. 2006. The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere. Annual Review of Ecology, Evolution, and Systematics 37(1):519-544

[edit] Additional reading

  • Hilbert, Martin; Priscila Lopez (2011). "The World’s Technological Capacity to Store, Communicate, and Compute Information". Science 332 (6025): 60-65. doi:10.1126/science.1200970. 
Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages