Star schema

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computing, the star schema (also called star-join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.[1]

Contents

[edit] Model

A star schema classifies the attributes of an event into facts (measured numeric/time data), and descriptive dimension attributes (product id, customer name, sale date) that give the facts a context. A fact record is the nexus between the specific dimension values and the recorded facts. The facts are stored at a uniform level of detail (the grain) in the fact table. Dimension attributes are organized into affinity groups and stored in a minimal number of dimension tables.

A weather star schema that records weather data may have facts of temperature, barometric pressure, wind speed, precipitation, cloud cover, etc and dimensions of location, date/time, reporter, etc.

Star schemas are designed to optimize user ease-of-use and retrieval performance by minimizing the number of tables to join to materialize a transaction.

A star schema is called such as it resembles a constellation of stars, generally several bright stars (facts) surrounded by dimmer ones (dimensions).

  • The fact table holds the metric values recorded for a specific event. Because of the desire to hold atomic level data, there generally are a very large number of records (billions). Special care is taken to minimize the number and size of attributes in order to constrain the overall table size and maintain performance. Fact tables generally come in 3 flavors - transaction (facts about a specific event eg Sale), snapshot (facts recorded at a point in time eg Account details at month end), and accumulating snapshot tables (eg month-to-date sales for a product).
  • Dimension tables usually have few records compared to fact tables, but may have a very large number of attributes that describe the fact data.

Often there can be dozens to hundreds of dimension attributes describing the various facets of a fact. Dimension attributes are organized into tables of loosely related attributes that share a known or unknown affinity. Attributes of color, style, size, texture can describe a product and would be included in a product dimension table. Dimension tables include attributes that typically would be normalized into separate tables (Snowflake schema). For example, in the US a location can be identified by a zipcode that exists within a neighborhood, city, state, region. All of these attributes would be included in a location dimension table.

On an Entity-Relationship (ER) diagram, fact tables have few distinct columns, while dimension tables have a large number of columns. However, most of the storage is used by the fact table.

Dimension tables are assigned a surrogate primary key of a simple integer that is assigned to the combination of low level attributes that form the natural key. Fact tables should also have a single surrogate primary key to allow for situations where there may be two or more facts having the exact same set of dimension keys.

A star schema that has many dimensions is sometimes called a centipede schema[2]. Having dimensions of only a few attributes, while simpler to maintain, results in queries with many table joins and makes the star schema less easy to use.

[edit] Benefits

The primary benefit of a star schema is its simplicity for users to write, and databases to process: queries are written with simple inner joins between the facts and a small number of dimensions. Star joins are simpler than possible in snowflake schema. Where conditions need only to filter on the attributes desired, and aggregations are fast.

The star schema is a way to implement multidimensional database (MDDB) functionality using a mainstream relational database: given most organizations' commitment to relational databases, a specialized multidimensional DBMS is likely to be both expensive and inconvenient.

[edit] Example

Star schema used by example query.

Consider a database of sales, perhaps from a store chain, classified by date, store and product. The image of the schema to the right is a star schema version of the sample schema provided in the snowflake schema article.

Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store and Dim_Product.

Each dimension table has a primary key on its Id column, relating to one of the columns (viewed as rows in the example schema) of the Fact_Sales table's three-column (compound) primary key (Date_Id, Store_Id, Product_Id). The non-primary key Units_Sold column of the fact table in this example represents a measure or metric that can be used in calculations and analysis. The non-primary key columns of the dimension tables represent additional attributes of the dimensions (such as the Year of the Dim_Date dimension).

For example, the following query answers how many TV sets have been sold, for each brand and country, in 1997:

SELECT
        P.Brand,
        S.Country,
        SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D    ON F.Date_Id = D.Id
INNER JOIN Dim_Store S   ON F.Store_Id = S.Id
INNER JOIN Dim_Product P ON F.Product_Id = P.Id
WHERE
        D.YEAR = 1997
AND P.Product_Category = 'tv'
GROUP BY
        P.Brand,
        S.Country

[edit] References

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages