Data hub
A data hub is a collection of data from multiple sources organized for distribution, sharing, and often subsetting and sharing. Generally this data distribution is in the form of a hub and spoke architecture.
Features
A data hub differs from a data warehouse in that it is generally unintegrated and often at different grains. It differs from an operational data store because a data hub does not need to be limited to operational data.
A data hub differs from a data lake by homogenizing data and possibly serving data in multiple desired formats, rather than simply storing it in one place, and by adding other value to the data such as de-duplication, quality, security, and a standardized set of query services. A Data Lake tends to store data in one place for availability, and allow/require the consumer to process or add value to the data.
Data Hubs are ideally the "go-to" place for data within an enterprise, so that many point-to-point connections between callers and data suppliers do not need to be made, and so that the Data Hub organization can negotiate deliverables and schedules with various data enclave teams, rather than being an organizational free-for-all as different teams try to get new services and features from many other teams.
List of products that promote themselves as data hubs
- Avoiding Mass Extinctions Engine
- BuzzData
- CKAN[1]
- DataMarket
- Dataverse
- Factual
- GeoIQ
- Hadoop
- InfoChimps
- Kasabi
- MarkLogic
- PANDA project[1]
- ScraperWiki[1]
- Socrata
- Quandl
- Windows Azure MarketPlace
Includes integration of project management systems, account management systems..
Approaches and Considerations
- Timeliness. How often will data be copied into the hub?
- Quality. How will data be curated, and by whom?
- Security. Consolidating data in one place may weaken a security posture. How will it be secured, both against malicious intruders and inappropriate internal access
- Scalability. A Hub will offload real-time traffic from production systems, but must itself be scalable.
- Data Modeling. Some hubs require extensive data modeling to allow even basic processing. Others allow more flexible handling of many data sources. There are tradeoffs to both approaches.
In this blog post, Kurt Cagle focuses on Data Lakes (simple storage in Hadoop) vs. Data Hubs (more complex transformation, security and semantics using a hub product - in this case MarkLogic).[2]
References
- ^ a b c "From CMS to DMS: C is for Content, D is for Data". ScraperWiki. 2012-03-09. Retrieved 2012-03-12.
- ^ Template:Url=https://www.linkedin.com/pulse/data-hubs-marklogic-vs-hadoop-kurt-cagle