What is a data cube?

A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. Facts are numerical measures. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.

Example:

2-D representation, the sales for Vancouver are shown with respect to the time dimension (organized in quarters) and the item dimension (organized according to the types of items sold). The fact, or measure displayed is dollars sold.

Now, suppose that we would like to view the sales data with a third dimension. For instance, suppose we would like to view the data according to time, item, as well as location. The above tables show the data at different degrees of summarization. In the data warehousing research literature, a data cube such as each of the above is referred to as a cuboid. Given a set of dimensions, we can construct a lattice of cuboids, each showing the data at a different level of summarization, or group by (i.e., summarized by a different subset of the dimensions). The lattice of cuboids is then referred to as a data cube. The following figure shows a lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier.

The cuboid which holds the lowest level of summarization is called the base cuboid. The 0-D cuboid which holds the highest level of summarization is called the apex cuboid. The apex cuboid is typically denoted by all.

STARS, SNOW FLAKES, AND FACT CONSTELLATIONS: SCHEMAS FOR MULTIDIMENSIONAL DATABASES

The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities or objects, and the relationships between them. Such a data model is appropriate for on-line transaction processing. Data warehouses, however, require a concise, subject-oriented schema which facilitates on-line data analysis. The most popular data model for data warehouses is a multidimensional model. This model can exist in the form of a star schema, a snow flake schema, or a fact constellation schema.

Star schema:

The star schema is a modeling paradigm in which the data warehouse contains (1) a large central table (fact table), and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.

In Star Schema, each dimension is represented by only one table, and each table contains a set of attributes. For example, the location dimension table contains the attribute set { location_key, city, state, country} This constraint may introduce some redundancy.

Example: Chennai, Madurai is both cities in the TamilNadu state in India.

Snow Flake schema:

The Snow Flake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snow flake.

Snowflake schema of a data warehouse for sales

The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such a table is easy to maintain and also saves storage space

Drawback:

The Snowflake schema needs more joins will be needed to execute a query, so it is not popular as the Star Schema in Data Warehouse Design. A compromise between the star schema and the snowflake schema is to adopt a mixed schema where only the very large dimension tables are normalized.

Fact constellation:

Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.

Fact constellation schema of a data warehouse for sales and shipping

This schema species two fact tables, sales and shipping. The sales table definition is identical to that of the star schema. A fact constellation schema allows dimension tables to be shared between fact tables. In data warehousing, there is a distinction between a data warehouse and a data mart. A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.

For data warehouses, the fact constellation schema are commonly used since it can model multiple, interrelated subjects. A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected subjects, and thus its scope is department-wide. For data marts, the star or snowflake schemas are popular since each are geared towards modeling single subjects. Examples for defining star, snowflake, and fact constellation schemas In DMQL, The following are the syntax to define the Star, Snowflake, and Fact constellation Schemas:

MEASURES: THEIR CATEGORIZATION AND COMPUTATION

A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point. Measures can be organized into three categories:

1.Distributive Measure

2.Algebraic Measure

3.Holistic Measure

Based on the kind of aggregate functions are used.

1. Distributive Measure

An aggregate function is distributive if it can be computed in a distributed manner as follows: Suppose the data is partitioned into n sets. The computation of the function on each partition derives one aggregate value. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function on all the data without partitioning, the function can be computed in a distributed manner. For example, count( ) can be computed for a data cube by first partitioning the cube into a set of subcubes, computing count( ) for each subcube, and then summing up the counts obtained for each subcube. Hence count() is a distributive aggregate function. For the same reason, sum( ), min( ), and max( ) are distributive aggregate functions. A measure is distributive if it is obtained by applying a distributive aggregate function.

A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. Facts are numerical measures. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.

Example:

2-D representation, the sales for Vancouver are shown with respect to the time dimension (organized in quarters) and the item dimension (organized according to the types of items sold). The fact, or measure displayed is dollars sold.

Now, suppose that we would like to view the sales data with a third dimension. For instance, suppose we would like to view the data according to time, item, as well as location. The above tables show the data at different degrees of summarization. In the data warehousing research literature, a data cube such as each of the above is referred to as a cuboid. Given a set of dimensions, we can construct a lattice of cuboids, each showing the data at a different level of summarization, or group by (i.e., summarized by a different subset of the dimensions). The lattice of cuboids is then referred to as a data cube. The following figure shows a lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier.

The cuboid which holds the lowest level of summarization is called the base cuboid. The 0-D cuboid which holds the highest level of summarization is called the apex cuboid. The apex cuboid is typically denoted by all.

STARS, SNOW FLAKES, AND FACT CONSTELLATIONS: SCHEMAS FOR MULTIDIMENSIONAL DATABASES

The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities or objects, and the relationships between them. Such a data model is appropriate for on-line transaction processing. Data warehouses, however, require a concise, subject-oriented schema which facilitates on-line data analysis. The most popular data model for data warehouses is a multidimensional model. This model can exist in the form of a star schema, a snow flake schema, or a fact constellation schema.

Star schema:

The star schema is a modeling paradigm in which the data warehouse contains (1) a large central table (fact table), and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.

In Star Schema, each dimension is represented by only one table, and each table contains a set of attributes. For example, the location dimension table contains the attribute set { location_key, city, state, country} This constraint may introduce some redundancy.

Example: Chennai, Madurai is both cities in the TamilNadu state in India.

Snow Flake schema:

The Snow Flake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snow flake.

Snowflake schema of a data warehouse for sales

The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such a table is easy to maintain and also saves storage space

Drawback:

The Snowflake schema needs more joins will be needed to execute a query, so it is not popular as the Star Schema in Data Warehouse Design. A compromise between the star schema and the snowflake schema is to adopt a mixed schema where only the very large dimension tables are normalized.

Fact constellation:

Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.

Fact constellation schema of a data warehouse for sales and shipping

This schema species two fact tables, sales and shipping. The sales table definition is identical to that of the star schema. A fact constellation schema allows dimension tables to be shared between fact tables. In data warehousing, there is a distinction between a data warehouse and a data mart. A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.

For data warehouses, the fact constellation schema are commonly used since it can model multiple, interrelated subjects. A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected subjects, and thus its scope is department-wide. For data marts, the star or snowflake schemas are popular since each are geared towards modeling single subjects. Examples for defining star, snowflake, and fact constellation schemas In DMQL, The following are the syntax to define the Star, Snowflake, and Fact constellation Schemas:

MEASURES: THEIR CATEGORIZATION AND COMPUTATION

A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point. Measures can be organized into three categories:

1.Distributive Measure

2.Algebraic Measure

3.Holistic Measure

Based on the kind of aggregate functions are used.

1. Distributive Measure

An aggregate function is distributive if it can be computed in a distributed manner as follows: Suppose the data is partitioned into n sets. The computation of the function on each partition derives one aggregate value. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function on all the data without partitioning, the function can be computed in a distributed manner. For example, count( ) can be computed for a data cube by first partitioning the cube into a set of subcubes, computing count( ) for each subcube, and then summing up the counts obtained for each subcube. Hence count() is a distributive aggregate function. For the same reason, sum( ), min( ), and max( ) are distributive aggregate functions. A measure is distributive if it is obtained by applying a distributive aggregate function.