2. Bring it all together with data warehousing

Filed under: data-science OLAP

Last updated on: July 21, 2021

Length:
4 minute read, 742 words

Let’s dive a bit deeper into OLAP by looking at a related concept - data warehousing. Large businesses typically have multiple OLTP databases that handle different parts of their business operations. Imagine for instance, an online store like Amazon. They would need separate systems to process orders, capture customer behavior on the website, manage deliveries, update inventory, etc. Each of these would require an OLTP database.

On the other side of this equation we have a data warehouse that needs to bring all these operational databases together to run OLAP queries that guide business decisions such as product recommendations for customers, forecasting inventory requirements, analyzing product trends, etc.

The set of operations that do all the necessary transformations to stitch these two different systems together are called ETL, which stands for Extract, Transform and Load.

The primary purpose of ETL is to organize operational data and to combine it with other secondary sources of data so that it is suitable for analytical queries. In the past, this often meant transforming the data so that it was either in a Boyce Codd normal form (BCNF) or the stricter fourth normal form (4NF). Normalization splits up data to avoid redundancy (duplication) by moving repeating groups of data into new tables. This makes it easier to manage the data and perform regular updates to it.

However, very large datasets do not take well to being normalized. This is because OLAP queries often require data from many different tables, and joining several large tables can often be a bottleneck when performing queries.

Therefore, these days most data warehouses use either a star or a snowflake schema to organize their data. While these schemas involve multiple tables that are related to each other, the rules for data organization are not as strict as those for normalized data.

A star schema is the simplest and most widely used schema for organizing data in a warehouse. It has a central fact table that is surrounded by dimension tables.

The fact table contains data on key business processes like sales, bank transactions, inventory, etc. For instance, in the illustration below, the fact being captured is the price and quantity of an item that was sold from a particular store to a particular customer (this would typically include a time component as well).

Stores, items, and customers constitute the dimensions of this fact. The dimension tables capture additional information about these dimensions. For example, the item table might include information about the type of item, the size and brand, the store table, the city and state that a store is located in, etc.

The different tables are linked to each other using keys. Keys are used to uniquely identify rows in a particular table. For instance, the storeId is the primary key for the store dimension and uniquely identifies the row that contains information about a particular store. The rows in a fact table are usually identified using a composite key that includes all the dimensions. In the illustration below, each row in the fact table is unique for the combination of store, item, and customer, i.e. there is only one row that shows the price and quantity of an item that was sold from a particular store to a particular customer.

The name for the schema comes from the fact that it has a central fact table that is surrounded by dimension tables that make it look sort of like a star.

The fact table tends to be really large and is frequently updated. While the dimension tables are smaller and less frequently updated.

The snowflake schema is a more normalized version of the star schema. In the star schema there cannot be more than one level of dimensions, but the snowflake schema has no such restrictions.

For instance, if we wanted to add additional information about the city that a store is located in, the star schema would require this to be added to the store dimension since it only allows a depth of one level for dimensions. This would mean that the data on a city could be duplicated if there were more than one store in a city.

In the case of a snowflake schema, the information could be captured separately in a city dimension table as shown below.

While the snowflake schema reduces data redundancy, it suffers from the same issues as stricter forms of normalization, since it requires joining more tables to perform analytical queries.