|Developer(s)||HPCC Systems, LexisNexis Risk Solutions|
|Written in||C++, ECL|
|License||Apache License 2.0|
HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.
Many organizations have large amounts of data which has been collected and stored in massive datasets which needs be processed and analyzed to provide business intelligence, improve products and services for customers, or to meet other internal data processing requirements. For example, Internet companies need to process data collected by Web crawlers as well as logs, click data, and other information generated by Web services. Parallel relational database technology has not proven to be cost-effective or provide the high-performance needed to analyze massive amounts of data in a timely manner. As a result several organizations developed technology to utilize large clusters of commodity servers to provide high-performance computing capabilities for processing and analysis of massive datasets. Clusters can consist of hundreds or even thousands of commodity machines connected using high-bandwidth networks. Examples of this type of cluster technology include Google’s MapReduce, Apache Hadoop, Aster Data Systems, Sector/Sphere, and LexisNexis HPCC platform.
High-performance computing (HPC) refers to describe computing environments which utilize supercomputers and computer clusters to address complex computational requirements, support applications with significant processing time requirements, or require processing of significant amounts of data. Supercomputers have generally been associated with scientific research and compute-intensive types of problems, but more and more supercomputer technology is appropriate for both compute-intensive and data-intensive applications. A new trend in supercomputer design for high-performance computing is using clusters of independent processors connected in parallel. Many computing problems are suitable for parallelization, often problems can be divided in a manner so that each independent processing node can work on a portion of the problem in parallel by simply dividing the data to be processed, and then combining the final processing results for each portion. This type of parallelism is often referred to as data-parallelism, and data-parallel applications are a potential solution to petabyte scale data processing requirements. Data-parallelism can be defined as a computation applied independently to each data item of a set of data which allows the degree of parallelism to be scaled with the volume of data. The most important reason for developing data-parallel applications is the potential for scalable performance in high-performance computing, and may result in several orders of magnitude performance improvement.
Commodity computing clusters
The resulting economies of scale in using multiple independent processing nodes for supercomputer design to address high-performance computing requirements led directly to the implementation of commodity computing clusters. A computer cluster is a group of shared individual computers, linked by high-speed communications in a local area network topology using technology such as gigabit network switches or InfiniBand, and incorporating system software which provides an integrated parallel processing environment for applications with the capability to divide processing among the nodes in the cluster. Cluster configurations can not only improve the performance of applications which use a single computer, but provide higher availability and reliability, and are typically much more cost-effective than single supercomputer systems with equivalent performance. The key to the capability, performance, and throughput of a computing cluster is the system software and tools used to provide the parallel job execution environment. Programming languages with implicit parallel processing features and a high-degree of optimization are also needed to ensure high-performance results as well as high programmer productivity. Clusters allow the data used by an application to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data.
Commodity computing clusters are configured using commercial off-the-shelf server computers. Rack-mounted servers or blade servers each with local memory and disk storage are often used as processing nodes to allow high-density configurations which facilitate the use of very high-speed communications equipment to connect the nodes (Figure 1). Linux is widely used as the operating system for computer clusters.
The HPCC system architecture includes two distinct cluster processing environments, each of which can be optimized independently for its parallel data processing purpose. The first of these platforms is called a data refinery whose overall purpose is the general processing of massive volumes of raw data of any type for any purpose but typically used for data cleansing and hygiene, extract, transform, load processing of the raw data, record linking and entity resolution, large-scale ad-hoc complex analytics, and creation of keyed data and indexes to support high-performance structured queries and data warehouse applications. The data refinery is also referred to as Thor, a reference to the mythical Norse god of thunder with the large hammer symbolic of crushing large amounts of raw data into useful information. A Thor cluster is similar in its function, execution environment, filesystem, and capabilities to the Google and Hadoop MapReduce platforms.
Figure 2 shows a representation of a physical Thor processing cluster which functions as a batch job execution engine for scalable data-intensive computing applications. In addition to the Thor master and slave nodes, additional auxiliary and common components are needed to implement a complete HPCC processing environment.
The second of the parallel data processing platforms is called Roxie and functions as a rapid data delivery engine. This platform is designed as an online high-performance structured query and analysis platform or data warehouse delivering the parallel data access processing requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times. Roxie utilizes a distributed indexed filesystem to provide parallel processing of queries using an optimized execution environment and filesystem for high-performance online processing. A Roxie cluster is similar in its function and capabilities to Hadoop with HBase and Hive capabilities added, and provides for near real time predictable query latencies. Both Thor and Roxie clusters utilize the ECL programming language for implementing applications, increasing continuity and programmer productivity.
Figure 3 shows a representation of a physical Roxie processing cluster which functions as an online query execution engine for high-performance query and data warehousing applications. A Roxie cluster includes multiple nodes with server and worker processes for processing queries; an additional auxiliary component called an ESP server which provides interfaces for external client access to the cluster; and additional common components which are shared with a Thor cluster in an HPCC environment. Although a Thor processing cluster can be implemented and used without a Roxie cluster, an HPCC environment which includes a Roxie cluster should also include a Thor cluster. The Thor cluster is used to build the distributed index files used by the Roxie cluster and to develop online queries which will be deployed with the index files to the Roxie cluster.
The HPCC software architecture incorporates the Thor and Roxie clusters as well as common middleware components, an external communications layer, client interfaces which provide both end-user services and system management tools, and auxiliary components to support monitoring and to facilitate loading and storing of filesystem data from external sources. An HPCC environment can include only Thor clusters, or both Thor and Roxie clusters. The overall HPCC software architecture is shown in Figure 4.
- High-performance computing
- Computer cluster
- List of important publications in concurrent, parallel, and distributed computing
- Parallel Computing
- Distributed Computing
- Parallel programming model
- Data parallelism
- Big Data
- Implicit parallelism
- Declarative programming
- Data-intensive computing
- Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.
- "MapReduce: A Flexible Data Processing Tool," by J. Dean, and S. Ghemawat. Communications of the ACM, Vol. 53, No. 1, 2010.
- "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets," by R. Chaiken, B. Jenkins, P.A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Proceedings of the VLDB Endowment, 2008.
- "MapReduce and Parallel DBMSs: Friends or Foes?" by M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Communications of the ACM, Vol. 53, No. 1, 2010, pp. 64-71.
- "MapReduce: Simplified Data Processing on Large Clusters," by J. Dean and S. Ghemawat. Proceedings of the Sixth Symposium on Operating Systems Design and Implementation (OSDI), 2004.
- "Pro Hadoop," by J. Venner. Apress, 2009.
- ""Hadoop: The Definitive Guide," by T. White. O'Reilly Media Inc., 2009
- "Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere," by R. Grossman, and Y. Gu. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.
- "High Performance Cluster Computing," by R. Buyya. Prentice Hall, 1999
- "A design Methodology for Data-Parallel Applications," by L.S. Nyland, J.F. Prins, A. Golderg, and P.H. Mills. IEEE Transactions on Software Engineering, Vol. 26, No. 4, 2000, pp. 293-314.
- "The Terascale Challenge," by D. Ravichandran, P. Pantel, and E. Hovy. Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004
- "Introduction to Hadoop," by O. O'Malley. 2008.
- "Linux Clustering: Building and Maintaining Linux Clusters," by C. Bookman. New Riders Publishing, 2003
- "High Performance Linux Clusters," by J.D. Sloan. O'Reilly Media Inc, 2005
- Sandia sees data management challenges spiral
- Sandia National Laboratories Leverages the Data Analytics Supercomputer (DAS) by LexisNexis Risk & Information Analytics Group, Which Offers Breakthrough High Performance Computing to Address Data Management and Analysis Challenges
- Programming models for the LexisNexis High Performance Computing Cluster
- LexisNexis Data Analytics Supercomputer
- LexisNexis HPCC Systems
- Reference to the term BORPS (Billions of Records Per Second)
- LexisNexis Brings Its Data Management Magic To Bear on Scientific Data