Greenplum

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Greenplum
Type Division of EMC Corporation
Industry Big Data technologies
Founded 2003
Headquarters Flag of the United States.svg San Mateo, California, United States
Key people Bill Cook, President
Scott Yara, SVP of Products & Co-Founder
Luke Lonergan, Co-Founder
David Wittenkamp, Chief Financial Officer
Charles McDevitt, Chief Architect
Products Unfied Analytics Platform (UAP), Database Software, Chorus Software, Enterprise-Ready Hadoop, Data Computing Appliance (DCA), Analytics Labs
Website www.greenplum.com

Greenplum is a Big Data analytics company in San Mateo, California. Greenplum's products include Greenplum Unified Analytics Platform, Greenplum Data Computing Appliance, Greenplum Analytics Lab, Greenplum Database, Greenplum HD and Greenplum Chorus. Greenplum was acquired by EMC in July 2010,[1] becoming the foundation of EMC's Big Data Division.

Contents

History [edit]

Timeline information taken from Greenplum company history unless otherwise cited.[2]

  • 2003: Founded by Scott Yara and Luke Lonergan [3]
  • 2005: Greenplum Database Released
  • 2006: Sun Partnership
  • 2008: Greenplum Map Reduce
  • 2010: EMC Acquisition,[4] Greenplum Data Computing Appliance Introduced
  • 2011: Greenplum Community Edition Released, Greenplum | SAS Strategic Partnership, Greenplum HD (Enterprise-Ready Hadoop) Introduced, First Annual Data Science Summit, MADLib, Modular Greenplum DCA Product Line, Greenplum Unified Analytics Platform UAP Launch
  • 2012: Greenplum Chorus released, Second Annual Data Science Summit, Greenplum Analytics Workbench Goes Live

Technology and Products [edit]

Greenplum Unified Analytics Platform (UAP) [edit]

The Greenplum UAP solution includes Greenplum Database, Greenplum HD, Greenplum Chorus, Greenplum Command Center, Greenplum DCA and Greenplum Analytics Lab (Data Scientist team).

Greenplum Database [edit]

The Greenplum Database builds on the foundations of open source database PostgreSQL.[5] It primarily functions as a data warehouse and utilizes a shared-nothing, massively parallel processing (MPP) architecture. In this architecture, data is partitioned across multiple segment servers, and each segment owns and manages a distinct portion of the overall data; there is no disk-level sharing nor data contention among segments.

Parallel query optimizer [edit]

Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan.[6] Greenplum's optimizer uses a cost-based optimization algorithm[6] to evaluate potential execution plans, takes a global view of execution across the cluster, and factors in the cost of moving data between nodes in any candidate plan. The resulting query plans contain traditional physical operations - scans, joins, sorts, aggregations, etc. - as well as parallel "motion" operations that describe when and how data should be transferred between nodes during query execution. Greenplum Database has three kinds of "motion" operations that may be found in a query plan: [7]

  • Broadcast Motion (N:N) - Every segment sends the target data to all other segments
  • Redistribute Motion (N:N) - Every segment rehashes the target data (by join column) and redistributes each row to the appropriate segment
  • Gather Motion (N:1) - Every segment sends the target data to a single node (usually the master)

Within the execution of each node in the query plan, multiple relational operations are processed by pipelining. Pipelining is the ability to begin a task before its predecessor task has completed, and this ability is key to increasing basic query parallelism. For example, while a table scan is taking place, rows selected can be pipelined into a join process.

gNet Software Interconnect [edit]

In shared-nothing MPP database systems, data often needs to be moved whenever there is a join or an aggregation process for which the data requires repartitioning across the segments. Greenplum’s gNet interconnect optimizes the flow of data to allow continuous pipelining of processing without blocking on all nodes of the system. The gNet interconnect is tuned and optimized to scale to 10,000s of processors and leverages commodity Gigabit Ethernet and 10GigE switch technology.

The gNet software interconnect is a supercomputing-based ‘soft-switch’ that is responsible for efficiently pumping streams of data between motion nodes during query-plan execution. It delivers messages, moves data, collects results, and coordinates work among the segments in the system. [8]

Parallel Dataflow Engine [edit]

The Parallel Dataflow Engine is an optimized parallel processing infrastructure that is designed to process data as it flows from disk, from external files or applications, or from other segments over the gNet interconnect. The engine is inherently parallel – it spans all segments of a Greenplum Database cluster and can scale effectively to thousands of commodity processing cores. The engine was designed based on supercomputing principles, with the theory that large volumes of data have ‘weight’ (i.e. aren’t easily moved around) and so processing should be pushed as close as possible to the data.[9]

Multi-level fault tolerance [edit]

Internally, the Greenplum system utilizes log shipping and segment-level replication to achieve redundancy, and provides automated failover. The system also provides multiple levels of redundancy and integrity checking. At the lowest level, Greenplum Database utilizes RAID-0+1 or RAID-5 storage to detect and mask disk failures. At the system level, Greenplum continuously replicates all segment and master data to other nodes within the system to ensure that the loss of a machine will not impact the overall database availability. Greenplum also utilizes redundant network interfaces on all systems, and specifies redundant switches in all reference configurations. [10]

MPP Scatter/Gather Streaming technology [edit]

Greenplum’s SG Streaming technology ensures parallelism by 'scattering' data from all source systems across hundreds or thousands of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible[clarification needed] impact on concurrent database operations. Data can be transformed and processed in-flight, utilizing all nodes of the database in parallel, for high-performance ELT (extract-load-transform) and ETLT (extract-transform-load-transform) loading pipelines. Final 'gathering' and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed. This technology is exposed to the DBA via a programmable "external table" interface and a traditional command-line loading interface. [11]

Greenplum MapReduce [edit]

Greenplum execute both MapReduce and SQL directly within Greenplum’s parallel dataflow engine, enabling programmers to run analytics against petabyte-scale datasets stored in and outside of the Greenplum Database. It has the ability to directly execute all necessary SQL building blocks, including performance-critical operations such as hash-join, multi-stage hash-aggregation, SQL 2003 windowing (which is part of the SQL 2003 OLAP extensions that are implemented by Greenplum), and arbitrary MapReduce programs. [12]

Polymorphic data storage [edit]

For each table (or partition of a table), the database administrators can select the storage, execution and compression settings that suit the way that table will be accessed. Greenplum DB transparently abstracts the details of any table or partition, allowing a wide variety of underlying models: [13]

  • Read/Write Optimized — Traditional 'slotted page' row-oriented table (based on PostgreSQL's native table type), optimized for fine-grained CRUD operations.
  • Row-oriented / Read-Mostly Optimized—Optimized for read-mostly scans and bulk append loads. DDL allows optional compression ranging from fast/light to deep/archival.
  • Column-oriented / Read-Mostly Optimized — Added as a feature in Greenplum's latest 3.3.4 release, providing a true column-store just by specifying 'WITH (orientation=column)' on a table. Data is vertically partitioned, and each column is stored in a series of large densely-packed blocks that can be efficiently compressed from fast/light to deep/archival (and tend to see notably higher compression ratios than row-oriented tables). Performance is excellent for those workloads suited to column-store — Greenplum's implementation only scans those columns required by the query, doesn't have the overhead of per-tuple IDs, and does efficient early materialization using an optimized 'columnar append' operator.

When combined with Greenplum's multi-level table partitioning, database administrators can tune the storage types and compression settings of different partitions within the same table. I.e. a single partitioned table could (for example) have older data stored as 'column-oriented with deep/archival compression', more recent data as 'column-oriented with fast/light compression', and the most recent data as 'read/write optimized' to support fast updates and deletes. [14]

Greenplum Hadoop (HD) [edit]

Greenplum HD is a supported version of the Apache Hadoop stack. It includes Hadoop Distributed File System (HDFS), MapReduce, Hive, Pig, HBase, and ZooKeeper. Greenplum HD’s packaged Hadoop distribution removes the need in building out a Hadoop cluster from scratch, which is required with other distributions. Isilon scale out file system technologies can also be deployed in Greenplum HD Enterprise edition to improve the reliability of Greenplum HD systems. [15]

Greenplum Chorus [edit]

Greenplum Chorus is a social network portal for the data science team to search, explore, visualize, share, import data from both within the organization and external sources. Furthermore, Chorus utilizes technologies from VMware for data science teams to create virtual "sandbox" for data exploration without the fear of corrupting the data source. [16]

Greenplum Data Computing Appliance (DCA) [edit]

The EMC Greenplum Data Computing Appliance (DCA) is a physical appliance built to integrate structured data, unstructured data, and Greenplum partner applications such as business intelligence (BI), and extract, transform and load (ETL).[17] A special version of DCA with integration of SAS software was released in 2011. [18]

Greenplum Command Center [edit]

Greenplum Command Center is a single console with a set of interactive dashboards, enabling administrators to collect performance metrics and manage system health for Greenplum products. Monitored data is also stored for historical reporting. Greenplum Command Center also has data collection agents that run on the Master Server and each Segment Server. The agents collect performance data on query execution and system utilization and send it to the Greenplum Command Center at regular intervals. [19]

Greenplum Analytics Lab [edit]

Greenplum Analytics Lab is a data science consultation service provided by Greenplum’s team of data scientists. Various packages are available depending on client’s budget and requirements. [20]

  • Analytics Lab Primer: 1 day workshop as introduction to Big Data Analytics.
  • Analytics Lab 100: 2 week consultation for quick insights, training on analytics methods and tools by a Greenplum Data Scientist.
  • Analytics Lab 600: 6 week consultation that extends on Lab 100 and focus on the creation of a ready-do-deploy analytical model and the training and documentation to support future use.
  • Analytics Lab 1200: 12 week consultation that extends on Lab 600 for deployment of more complex models.

Platform support [edit]

  • Greenplum Database supported for production use on SUSE Linux Enterprise Server 10.2 (64-bit), Red Hat Enterprise Linux 5.x (64-bit), CentOS Linux 5.x (64-bit) and Sun Solaris 10U5+ (64-bit). Greenplum Database 3.3 is supported on server hardware from a range of vendors including HP, Dell, Sun and IBM.[7]
  • Greenplum Database is supported for non-production (development and evaluation) use on Mac OSX 10.5, Red Hat Enterprise Linux 5.2 or higher (32-bit) and CentOS Linux 5.2 or higher (32-bit).[21]

Customers [edit]

Greenplum has over 500 customers in verticals from financial services, telecommunications, Internet, retail, transportation and pharmaceuticals industries. ,[22] Greenplum customers include Silver Spring Networks, Zions Bancorporation, Reliance Communications, NYSE Euronext, Bakrie Telecom, Orbitz, Havas Digital, China Unicom, ClickFox, Frank Templeton Investments and Tagged. [23]

Community Edition [edit]

Like many other enterprise software, Greenplum also provides a community edition of its "cutting edge" database,[24] as well as community forums.[25] However the community forums are very poorly supported in comparison with MySQL communities.[26]

Criticism [edit]

Greenplum DB has a limitation on indexing: Unique index and primary key index cannot be used at same time on a table.[27]

Partnerships [edit]

They include XBRLWorks,[28][29] Cisco, Brocade Communications Systems, SAS (software), Factual, Alpine Data Labs, MicroStrategy, and Informatica.[30]

Competition [edit]

Greenplum's main competitors are vendors such as IBM, Oracle Exadata, Teradata, Netezza, SAP, and Vertica. [31]

See also [edit]

References [edit]

  1. ^ "EMC to Acquire Greenplum". 2010-07-06. Retrieved 2012-07-05. 
  2. ^ "History". Greenplum company website. Retrieved 7 August 2012. 
  3. ^ "Management Team". Retrieved August 7, 2012. 
  4. ^ Cite error: Invalid <ref> tag; no text was provided for refs named acquired (see the help page).
  5. ^ Gonsalves, Antone (February 22, 2008). "Greenplum Updates Open-Source Based Database". 
  6. ^ a b "Understanding Greenplum". Retrieved August 7, 2012. 
  7. ^ a b "Greenplum Database Release 4.2.1.0 Documentation". Retrieved August 7, 2012. 
  8. ^ "gNet Software Interconnect". Retrieved August 7, 2012. 
  9. ^ "Parallel Dataflow Engine". Retrieved August 7, 2012. 
  10. ^ "Multi-Level Fault Tolerance". Retrieved August 7, 2012. 
  11. ^ "Greenplum aims to eliminate massive data load 'choke points' with Scatter/Gather technology". Retrieved August 7, 2012. 
  12. ^ "Greenplum Brings MapReduce to the Enterprise". Retrieved August 7, 2012. 
  13. ^ "Greenplum is going hybrid columnar as well". Retrieved August 7, 2012. 
  14. ^ "Greenplum Adds Column-Oriented Table Feature to Greenplum Database". Retrieved August 7, 2012. 
  15. ^ "EMC Marries Isilon with Greenplum Hadoop Distribution". Retrieved August 7, 2012. 
  16. ^ "EMC Marries Social Networking And Big Data". Retrieved August 7, 2012. 
  17. ^ "Greenplum appliances swing both ways". Retrieved August 7, 2012. 
  18. ^ "EMC Greenplum inks SAS partnership, launches new appliances". Retrieved August 7, 2012. 
  19. ^ "EMC cranks Greenplum database to 4.2". Retrieved August 7, 2012. 
  20. ^ "Greenplum Analytics Lab". Retrieved August 7, 2012. 
  21. ^ "Greenplum Database: Community Edition". Retrieved August 7, 2012. 
  22. ^ "ebay's two enormous data warehouses". Retrieved August 7, 2012. 
  23. ^ "Our Customers". Retrieved August 7, 2012. 
  24. ^ "Greenplum DB Community Edition". 
  25. ^ "Greenplum Communities". 
  26. ^ "MySQL Community Forum". 
  27. ^ "Greenplum db - may not be as ready as you think". 
  28. ^ "XBRLWorks - Data Science Partners". 
  29. ^ "XBRLWorks". 
  30. ^ "Greenplum Partners". Retrieved October 29, 2010. 
  31. ^ "Magic Quadrant for Data Warehouse Database Management Systems". Retrieved August 7, 2012. 

External links [edit]