Clusterpoint

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Clusterpoint
Developer(s) Clusterpoint Ltd.
Initial release 2006
Stable release 2.2 / February 15, 2013 (2013-02-15)
Development status Active
Written in C++
Operating system Cross-platform
Available in English
Type NoSQL document-oriented database
License multi-license (free and commercial)
Website www.clusterpoint.com

Clusterpoint is a commercial enterprise NoSQL database server software for the design, search and secure operational management of distributed document-oriented (XML/JSON) data stores.[1][2][3] Software enables to create a high-performance computing cluster for database operations where each cluster node is storing part of a database content while the entire database uses combined local computing and storage resources from all cluster nodes. Cluster database replicates into multiple local copies for search performance scalability or into multiple datacenters for high availability. The software runs on commercial off-the-shelf (COTS) hardware equipment.[4]

Clusterpoint DBMS delivers scalable, high-performance Internet-search-like and SQL-like query across all database content within a single API.[5] Sub-second response-time latency for search queries is achieved using pre-sorted indexation method.[6]

Technology is based on customizable ranking index that can be tuned to match the natural language terms in queries to the most relevant data content in a customer database. When querying a distributed cluster database with free format natural language keywords or phrases, ranking index sorts the most relevant data upfront, thus cutting the amount of data to be read for every query in larger deployments.

As a result, fast and relevant full text search[7] is a preferred information access method in Clusterpoint databases while maintaining capability to flexibly query the database structure. Both search methods can be combined in a single query for unified data access.

Use cases[edit]

Technology addresses information overflow and latency problem for interactive web and mobile GUI-based database applications where limited-size screens and bandwidth restrictions prevent users requesting and processing large size query responses.

Scalable ranking index sorts relevant data and returns information page by page in decreasing relevance. This computing model delivers predictable sub-second search latency in very large database systems, including those with billions of data objects, and without overwhelming irrelevant data in search results. Customers can search distributed cluster databases without experiencing performance degradation characteristic to SQL databases when their data volume grows.

Clusterpoint works as a NoSQL data hub platform integrating data from different sources and providing web and mobile interfaces with fast and relevance-sorted database search and navigation functionality. Customers can unify hybrid data types from other databases into open, vendor-neutral, cross-platform XML data format, combining into a single database structured data (date, numeric, character), unstructured data (textual) and semi-structured data such as meta data extracted from blobs, images, voice, video files and stored along the original data files. All database content is indexed with a single ranking index for unified access, search and analysis.[8]

Distinctive technology[edit]

Programmable ranking index rules are established in the Policy file,[9] an XML configuration file accompanying each Clusterpoint database. Adjusting ranking rules, customers can configure various grouping, ordering and positioning algorithms for their search results through the ranking index so that it starts delivering the best end-user search experience. A set of ranking configuration rules, once established for a particular database, is then being applied and maintained automatically by Clusterpoint Server when customer data is loaded or updated through Clusterpoint database CRUD API commands[10] or when database is reindexed.[11]

In Clusterpoint architecture only database ranking index must be modified to implement database search behavior changes. No changes are required for original customer data objects that are stored in plaintext XML in a Clusterpoint database. Customer application software design and code can be simplified by configuring indexing and search sorting details into the Document policy. Policy configuration determines the final ranking index organization at physical storage level by presorting the actual index data for custom relevance algorithms. Customers can avoid SQL programming for data sorting and grouping in their code; instead the database ranking index delivers this functionality.

Presorted indexing is a preferred computing model in Clusterpoint architecture whenever raw search performance and predictable sub-second latency for database querying services are top customer priorities. Its drawback is less flexible algorithmic data sorting options in customer application software code.

Commercial deployments[edit]

Clusterpoint DBMS is used to build and operate scalable web-GUI and mobile devices oriented backend application solutions that need to manage increasing volumes of variable data objects in real-time using distributed database architecture. Software is being used in Internet search services,[12] government databases, SIEM solutions for machine-generated data management, lawful intercept data acquisition solutions,[13] content aggregator systems, business directory services and news agency systems.

Clusterpoint DBMS is deployed and used for operating public 24h/7d web and mobile application Internet services in commercial customer accounts since 2008.[14]

Platform Components[edit]

Technically Clusterpoint DBMS software comprises a scalable XML/JSON database server, distributed cluster data storage and enterprise search engine engineered into a single platform: Clusterpoint Server. Clusterpoint Server software installs on each cluster node and is being managed across all clusters with Clusterpoint Manager application, providing centralized administration and control for all databases through a single web GUI. The Clusterpoint Server software is developed in the C++ programming language and supports multi-threading and multi-core CPUs. Primary method of access for the platform is XML/JSON based web API.[15]

Architecture[edit]

Clusterpoint database has multi-master shared-nothing document store architecture and supports no-single-point-of failure fault-tolerant infrastructure use, including multi-datacenter replication for a distributed database. Internally Clusterpoint data is stored in customer own defined schema-free XML.

There is one single index per database providing all search features and customizable relevance sorting for structured (dates, numeric, chars) and unstructured (full-text) data in Clusterpoint architecture. New content is indexed in real-time and index data immediately can be read and searched for after each document has been inserted, updated or deleted.

To query a database customers use a single unified Internet-search-like and SQL-like query API.[16][17]

Database model[edit]

Schema-free XML / JSON document database with arbitrary customer-defined data structure, containing machine-readable and searchable data.

Documents are identified by a unique per entire cluster database identification string: document ID.[18] Document ID works similarly to the Internet URL address and could be any free format XML tag value or customer defined string value assigned as unique document identifier (examples: customer email address, Internet domain name, product code, social security number, car registration number, geographic address, checksum of a blob object etc.).

When an SQL database with multiple tables is being migrated to the Clusterpoint database model, denormalization must be performed. All external linked tables have to be embedded into XML parent-child tag hierarchy used in the Clusterpoint database model. All relational database normalization introduced technical encoding can be safely removed then. Most primary key indexes, foreign key indexes and software codes for categorized textual values can be replaced with their natural language equivalents. Clusterpoint database model facilitates use of natural language text values in data items so that data can be ranked for meaningful relevance within surrounding context and users could search it using Internet-search-like free format keywords or phrases.

Query syntax[edit]

Clusterpoint API query syntax supports natural language keyword and phrase queries, wildcards in search terms, per-character based template queries and structured SQL-like field queries. The following examples illustrate key principles of Clusterpoint database XML-centric query syntax.

Search across all database content for documents matching all keywords:

Example 1: <query> php developer london </query>

Search across all database content for the exact phrase (all terms in sequence):

Example 2: <query> "john smith" </query>

Free keyword search using wildcards matching "john", "johny", "smith", "smitley" etc.:

Example 3: <query> joh* smit* </query>

Phrase search query with wildcards:

Example 4: <query> "john smit*" </query>

Search for terms by pattern matching using character positioning templates:

Example 5: <query> jo?n sm[iy]th </query>

Search with a combined Internet-search-like and SQL-like query. The following example illustrates combined full-text search, numeric range search and data structure field-search that filters out only subset of data similarly to SQL SELECT ... WHERE ... statements:

Example 6: <query> php developer <salary>3500..4500</salary> <city>London</city> </query>

Multiple combined query rules can be constructed per single database query, using nesting brackets ((())) as Boolean AND, {} as OR, ~ as NOT logic operators. Full text keyword and phrase terms can be used as illustrated in the following example querying for PHP or Java developers, who are not in an expert position and only in 3 particular cities:

Example 7: <query> {php java} developer ~expert <city>{london "new york" beijing}</city> </query>

To search precise data structure fields in multi-level XML document hierarchy, the XPATH-style nested XML syntax is used to search only in the <city> tag as the child tag of the parent <address> tag:

Example 8: <query> <address><city>London</city></address> </query>

History[edit]

Development of Clusterpoint DBMS began in August 2006 by Clusterpoint Ltd., a privately held European technology company run by co-founding team of software engineers lead by Gints Ernestsons .[19][20][21][22][23]

The first public Clusterpoint DBMS software release was in January 2008.

Current Clusterpoint Server production version is 2.2.

Next Clusterpoint Server version 2.3 is under development by vendor as of June, 2013.

Features[edit]

General features[edit]

  • Data is managed in open, cross-platform, industry standard XML format, used internally at data store level and in XML API[24]
  • Data structure agnostic and type-rich database, handles variable data structure XML documents in a single database. Supports unstructured textual data, dates, numbers, meta-data (all XML types)
  • Cross-platform support: binaries are available for Linux, FreeBSD and Mac OS X. Clusterpoint Server software can be compiled on other operating systems.
  • Multi-master cluster software architecture: no single point of failure, any cluster node can serve as a master and run the management application
  • Horizontal database scalability: scales out from a single server to hundreds of servers networked into a cluster infrastructure

Access features[edit]

  • REST API is used for JSON document format compatibility [25]
  • Consistent UTF-8 encoding. Non-UTF-8 data can be saved, queried, and retrieved with a special binary data type.
  • XML objects for API queries and responses: enable direct integration in other programming languages supporting XML parsing, no specific client software required

Search/query features[edit]

  • Built-in enterprise search functionality: full-text phrase and keywords search, result snippeting, highlighting, term proximity search
  • Internet-search-like free-format ad hoc queries across all database structure, using natural language keywords and phrases
  • Querying with term stemming, term wildcards and character position patterns delivering self merge-joins[26] for inflected words and plural word forms
  • SQL-like XML-structure (fielded) queries like in SQL SELECT ... WHERE ... statements
  • Cluster-wide analytics aggregation with MIN(), MAX(), COUNT(), AVG() like in SQL SELECT ... GROUP BY ..., ORDER BY ... statements
  • Sorting of results in alphabetic, numeric, date order or according to result relevance
  • Autocomplete (instant search as you type) using the actual index data
  • Spell-check of query terms with alternative spelling suggestions for "Did you mean that?" functionality
  • Boosting of search query terms at query time, in order to increase, decrease or overwrite through the API relevancy weights or sorting rules built into the ranking index
  • Dynamic data classification per query by multi-level customer defined facets with exact hit counting (examples: categories, themes, product catalogs, geographic locations etc.)
  • Text-analytics driven similar content search across the entire database
  • XML data structure relevance ranking by tag weighting and document relevance ranking by document rating
  • Textual relevance ranking for matching search query terms to context, taking into account frequency and density of natural language terms
  • Predictive calculation of expected number of results based on the actual index statistics in large size databases to optimize performance

Administration/production use features[edit]

  • Granular security partitioning: API users and their access rights are based on groups and permissions assigned per specific databases and API commands
  • Transaction journaling, access logs, error logs and audit logs enabled by default
  • Document versioning enabled by default (preserving previous document versions for a certain time period)
  • Reindexing in background with automatic switchover provides availability during reindexation
  • Online, offline and incremental database backup
  • Automatic or manual synchronization of database replicas
  • Multiple administrator accounts for secure multi-tenancy of different customer databases on the same hardware
  • Centralized web GUI based database administration, including one-click configuration of clustered and replicated databases across all nodes[27]

Automatic full database content indexing[edit]

Clusterpoint software automatically builds and maintains document-type XML database content index when data us loaded, updated or deleted. A single ranking index is maintained to support these types of querying:

  • natural language based full text search, including language-specific stemming and collation rules
  • XML data structure queries (with full-text, exact match and binary match options)
  • virtual data structure search created from aliasing multiple real tags values to speed up Boolean OR queries
  • ad hoc search across all database content irrespectively from the database structure
  • numeric and date range search
  • geospatial search by range, distance or polygon coordinates and ordering by distance from a certain point
  • multi-level faceted search with automatic results classification by XML tags assigned as containing facets.[28]
  • combination of any of the above database search criteria into complex nested multi-part query expressions using Boolean AND, OR, NOT logic

Ranking index[edit]

A scalable ranking index presorts Clusterpoint database content access references for fast database search, including full-text search. It sorts data access pointers by customizable relevance weighting attributes that can be configured at database configuration level by customer. Ranking index differs from the traditional SQL-type B-tree or R-tree indexes. It has an inverted index design, engineered to deliver linear scale out ability in rack and stack COTS hardware cluster architecture so that it can support millisecond-latency textual search in many billions of data objects per distributed database.[29]

Ranking index allows to get rid of repetitive data sorting characteristic to SQL database servers. SQL databases often consume excessive computing resources for data sorting in large size databases, in particular when sorting and ordering information from multiple tables by SQL SELECT WHERE ... JOIN ... GROUP BY ... ORDER BY statements.

Data grouping, sorting and positioning for relevance[edit]

Clusterpoint database supports grouping and ordering functionality that is similar to SQL's GROUP BY and ORDER BY statements. However, data sorting features are implemented differently.

The sorting rules are "hard-wired" and built into the physical data files of ranking index. Ranking index organizes database access rules on physical disk level using sequential I/O access methods. It results into high-performance disk reads during database search and navigation so that query results can be delivered to customer applications with minimal latency. Clusterpoint Server does not need to sort data: it just follows ranking index rules and delivers data to users in portions sorted by relevance from most relevant to least relevant.

Database ranking rules need to be established by database architect at database configuration level using Policy file. Policy is an XML configuration file containing all database indexing, search grouping and sorting rules reflecting customer business logic or the actual search needs of the application.[30]

Customers can flexibly overwrite default ranking index configuration rules from their application software code when using Clusterpoint API, boosting or decreasing relevance of individual query terms.

Database administration[edit]

Clusterpoint Server can be controlled centrally through the Clusterpoint Manager application. Administrators using web-GUI dashboard control all their database services enterprise-wide, including cluster database administration, configuration of indexing and ranking policy, secure user account management, audit and log file view, database backup/restore, database sharding and replication.

Each database is being started and stopped as a separate database server process per each cluster node for the controlled management of CPU resources, RAM memory and disk storage. All cluster databases share a single networked computing and storage infrastructure and must be managed accordingly.

Clusterpoint Manager is used to manage underlying hardware resources to operate different cluster databases in parallel. Cluster nodes should have free RAM and disk storage capacity and dedicated network switching fabric among them for maximum performance.

Process and storage architecture[edit]

Technically each named Clusterpoint database is a safely isolated process that runs in its own RAM memory address space. It can access only its own local file system storage folder with the same name containing the particular database XML documents, index, configuration and log files stored on that local cluster node (shard). This architecture delivers elastic horizontal scale out ability and cluster-wide control over resource consumption for a particular customer database. It enables to customize storage allocation for each database cluster-wide by sharding and replication so that a database performance and data growth patterns would not negatively affect other databases run on a given hardware.

A cluster-wide database is created "virtually" from all local same name databases through Clusterpoint Server software. Locally stored ranking index per each cluster node is engineered with Lego-like modularity and represents one large "virtual index" to the Clusterpoint Server software. Administrators can start or stop cluster databases with one-click controls, safely enabling or preventing on-line access to database storage.

Multi-tenancy and virtualization[edit]

Multi-tenant database services using Clusterpoint DBMS can securely partition their runtime computing environment among named RAM processes and named file folder storage resources on local nodes, while running multiple databases in parallel on the same equipment. This method delivers the best utilization of modern multi-core CPU hardware. This is the preferred method for high-performance database computing with Clusterpoint software vs operating system level virtualization for multi-tenancy. OS-level virtualization may significantly decrease available network bandwidth among large number of cluster nodes running a large Clusterpoint database and could result into increased application latencies. Virtualization can still be used for smaller-scale installations, prototyping and development where operational performance guarantees and low latency are not the first priority.

Multi-copy database replication[edit]

Search and data access performance scalability and fault-tolerance is delivered through multi-copy database replication for a cluster database. Clusterpoint Server software can be configured to work with multiple working database copies, each additional replica running on its own hardware cluster and being synchronized using MVCC method. Database replicas can be located in multiple-datacenters and managed through the same single management interface. All replicas are equal in Clusterpoint architecture and are used for automatic load balancing of database search queries through Clusterpoint API.

In multi-datacenter use network bandwidth among locations may become the critical issue for Clusterpoint architecture because of increased latencies for database updates and synchronization delays among replicas, in particular, if encrypted VPN networking over the Internet links is used. Dedicated bandwidth is a preferred method for high-performance database replication.

Extendable server-side scripting with Lua[edit]

The Lua extends Clusterpoint Server functionality with custom server-side scripts. Lua scripts can implement customer-specific functions such as data aggregation, ETL tasks, meta-data markup, call-back to external programming languages using web services for extra functionality, real-time alerting or asynchronous triggers. Scripts can be executed before, during or after Clusterpoint API transactions of interest. Built-in configurable server-side hooks activate Lua scripts in different stages of each Clusterpoint transaction execution process.

Custom Lua scripts can be stored in Clusterpoint Server to work as "stored procedures".[31]

Programming language support[edit]

Clusterpoint DBMS uses REST principles and HTTP/HTTPS messaging for client-server communications between customer applications and Clusterpoint Server. Any client programming language or development environment, supporting HTTP POST/GET messaging, can connect to Clusterpoint Server directly and read, write, update, delete and search XML documents.[32]

Optional REST API interface for JSON data format transforms customer data between JSON and XML, while only XML is used for internal server-side data storage and processing by Clusterpoint Server.[33]

Clusterpoint Server has native client API Libraries using faster TCP/IP transport protocol for the following popular programming environments:

Licensing and support[edit]

Commercial. Perpetual, OEM and subscription licenses. The free trial license is available upon request.[38]

Vendor provides standard software maintenance and technical support service based on subscription model, delivering it over email, Skype or phone. Premium technical support for customers using the software in 24h/7d production environments includes remote problem diagnostics and resolution based on Service-level agreement.

Vendor optionally provides installation support, customer training how to administer and configure Clusterpoint databases and customer training how to use database ranking technology. [39]

3rd party tools and applications[edit]

  • Clusterpark Log Data Server - organizes enterprise logs in a singe instantly searchable database[40]
  • Crosslink Enterprise - integrates multiple applications together[41]
  • Crosslink Clusterpoint Adapter - downloadable Clusterpoint connector sample code[42]
  • DigiBrowser - transforms SQL to Clusterpoint NoSQL database without programming[43]
  • Network Traffic Surveillance System - works as a "video-recording system" for lawful intercept and analysis of computer communications[44]

See also[edit]

References[edit]

  1. ^ "Nosql-database.org". Nosql-database.org. Retrieved June 14, 2013. 
  2. ^ "Big data startups / document stores". Bigdata-startups.com. Retrieved June 14, 2013. 
  3. ^ "The NoSQL movement: document databases". Dataversity. Retrieved June 14, 2013. 
  4. ^ "Clusterpoint Product / Architecture". Clusterpoint.com. Retrieved June 14, 2013. 
  5. ^ "Clusterpoint Product / Searching". Clusterpoint.com. Retrieved June 14, 2013. 
  6. ^ "Clusterpoint Product / Indexing". Clusterpoint.com. Retrieved June 14, 2013. 
  7. ^ "Fulltext search engines". Mediawiki.org. Retrieved June 14, 2013. 
  8. ^ "Clusterpoint Solutions / Data-unification". Clusterpoint.com. Retrieved June 14, 2013. 
  9. ^ "Document Policy / Relevance Ranking". Clusterpoint.com. Retrieved June 14, 2013. 
  10. ^ "Clusterpoint Developer's Guide". Clusterpoint.com. Retrieved June 14, 2013. 
  11. ^ "Reindexing Clusterpoint database". Clusterpoint.com. Retrieved June 14, 2013. 
  12. ^ "Clusterpoint Product / Data Collectors". Clusterpoint.com. Retrieved June 14, 2013. 
  13. ^ "Clusterpoint Network Traffic Security System". Clusterpoint.com. Retrieved June 14, 2013. 
  14. ^ "Clusterpoint Solutions". Clusterpoint.com. Retrieved June 14, 2013. 
  15. ^ "Clusterpoint website". Clusterpoint.com. Retrieved June 14, 2013. 
  16. ^ "Clusterpoint Search Query Syntax". Clusterpoint.com. June 14, 2013. 
  17. ^ "Clusterpoint Architecture". Clusterpoint.com. Retrieved June 14, 2013. 
  18. ^ "Clusterpoint Document Policy". Clusterpoint.com. Retrieved June 14, 2013. 
  19. ^ "Clusterpoint Team". Clusterpoint.com. Retrieved June 14, 2013. 
  20. ^ "Crunchbase Profile". Crunchbase.com. Retrieved June 14, 2013. 
  21. ^ "BusinessWeek Company Profile". Businessweek. Retrieved June 14, 2013. 
  22. ^ "Clusterpoint Raises EUR1 Million From BaltCap". Privateequitywire. Retrieved June 14, 2013. 
  23. ^ "Clusterpoint Receives €1 Million From BaltCap". Arcticstartup.com. Retrieved June 14, 2013. 
  24. ^ "Documentation - XML API Overview". Clusterpoint.com. Retrieved June 14, 2013. 
  25. ^ "Documentation - REST / JSON API Overview". Clusterpoint.com. Retrieved June 14, 2013. 
  26. ^ "Making you app searchable using self merge-joins". Google. Retrieved June 14, 2013. 
  27. ^ "Product Features". Clusterpoint.com. Retrieved June 14, 2013. 
  28. ^ "Clusterpoint Data Loading". Clusterpoint.com. Retrieved June 14, 2013. 
  29. ^ "Clusterpoint Ranking Index". Clusterpoint.com. Retrieved June 14, 2013. 
  30. ^ "Result ordering and grouping". Clusterpoint.com. Retrieved June 14, 2013. 
  31. ^ "User Scripting". Clusterpoint.com. Retrieved June 14, 2013. 
  32. ^ "Clusterpoint XML API". Clusterpoint.com. Retrieved June 14, 2013. 
  33. ^ "Clusterpoint REST API". Clusterpoint.com. Retrieved June 14, 2013. 
  34. ^ "PHP API Library". Clusterpoint.com. Retrieved June 14, 2013. 
  35. ^ "NET API Library". Clusterpoint.com. Retrieved June 14, 2013. 
  36. ^ "Python API Library". Clusterpoint.com. Retrieved June 14, 2013. 
  37. ^ "Java API Library". Clusterpoint.com. Retrieved June 14, 2013. 
  38. ^ "Clusterpoint Free Trial License". Clusterpoint.com. Retrieved June 14, 2013. 
  39. ^ "Clusterpoint Customer Support Contacts". Clusterpoint.com. Retrieved June 14, 2013. 
  40. ^ "Clusterpark Log Data Server". Clusterpark Ltd. Retrieved June 17, 2013. 
  41. ^ "DigiBrowser". Datorikas Instituts DIVI. Retrieved June 14, 2013. 
  42. ^ "US.LV Network Traffic Surveillance System". US.LV. Retrieved June 14, 2013. 

External links[edit]