Databases and Data Modelling — A Quick Crash Course | by Col Jung | May, 2023

Databases and Data Modelling — A Quick Crash Course | by Col Jung | May, 2023

[ad_1]

Warehousing 101: A Practical Guide for Beginners

Col Jung
Towards Data Science
Image by author

Over my five years of working in enterprise analytics, I’ve observed an interesting number of data scientists entering the job with limited knowledge of data warehousing and data modelling.

This shouldn’t come as a surprise.

Data scientists hail from diverse backgrounds such as math, statistics, psychology, and programming. Many likely would not have delved into the intricacies of database systems during their university days.

That includes myself, a mathematician-turned-data-scientist.

I taught myself data science with online courses — a necessary prerequisite to get a data science job, I suppose — but only picked up database fundamentals on the job.

Plus, with data lakes being all the rage now, who even needs warehouses, right? (That’s a joke!)

I wrote this article as a quick and dirty crash course for anyone who’s jumping into an analytics job without much prior knowledge of data warehouses and data modelling.

We’ll cover three topics:

  • What an enterprise data warehouse workflow looks like;
  • What database normalisation achieves;
  • A taste-test of NoSQL databases.

Let’s dive in!

New to Medium? Join here and gain unlimited access to the best articles on the internet.

Specific details may vary between solutions, but a common architecture for enterprise-scale analytics looks like this:

  1. Data sits in a data lake.
  2. Data is loaded into a data warehouse.
  3. A data model is created.
  4. Analysts consume the data.

Let’s see what that looks like in more detail.

Analytical data processing. Image by author

Analytical data processing

Data from operational or transactional data stores (often from OLTP databases), files, real-time streams, or other sources are loaded into a centralised data lake.

Data lakes are common in large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analysed.

This load operation from the operational data plane to the analytical data plane usually involves an extract, transform, and load (ETL), or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and restructured for analysis. The resulting data structure is optimised for analytical queries.

30,000 feet view of the enterprise data landscape. Source: Z. Dehghani at MartinFowler.com with amendments by author

As data lakes typically belong in the domain of big data, which entails distributed compute and storage by frameworks like Apache Hadoop, the ETL processing is undertaken by data engineers who set up Hive or Spark jobs that process high volumes of data in parallel using multi-node clusters. These pipelines include both batch processing of static data and real-time processing of streaming data.

ETL pipelines connect operational and analytical data stores. Image by author

Data in the data lake sits ready for further analytics use. This includes exploring, wrangling and modelling by data scientists, or processing by data analysts for the purposes of creating reports and visuals.

Check out my explainer article on the entire enterprise data landscape, from warehouses to data lakes to data mesh.

Data Warehouses, Tables, Schemas & Normalisation

Alright, moving on. Time to enrich our data sitting in the lake.

First, the data is copied into a data warehouse that’s optimised for read activity.

A data warehouse is a relational database in which the data is stored in a schema that’s optimised for data analytics rather than transactional workloads. Warehouses are by design optimised for read operations — primarily queries to support business intelligence (BI), which includes creating reports, dashboards and visualisations, that, under the wings of good data storytellers can communicate value and impact decision-making.

Relational databases. Image by author

Relational databases are commonly used to store and query structured data. The data is stored in tables that represent entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely identifies it and these keys are used to reference the entity instance in other tables.

This is what enables the enrichment of data through joining tables together.

For example, a customer’s primary key can be referenced in a sales order record to indicate which customer placed that order:

-- Joining two tables using SQL
SELECT * FROM Customer C
JOIN Orders O
ON C.ID = O.Customer

The use of keys to reference data entities enables a relational database to be normalised — which in part means the elimination of duplicate data values so that, for example, the details of an individual customer are stored only once, not for each sales order the customer places.

More on normalisation later.

Tables are managed and queried using Structured Query Language (SQL), which is based on an ANSI standard, so it’s similar across multiple database systems. We saw a simple query joining two tables above.

Let’s talk more about database schemas.

Whereas data lakes employ a Schema-on-Read approach that requires no upfront schema to be defined, warehouses employ a more computationally-expensive but more organised Schema-on-Write paradigm where table schemas must be defined upfront.

These schemas are designed by data modellers and solutions architects in consultation with the downstream analytics and business users.

In common practice, modellers transform data from a transactional store into a schema whereby numeric values are stored in central fact tables.

These are then related to one or more dimension tables that represent the entities by which you want to aggregate those numeric measures — for example product or customer.

Each entity is represented by a row with a unique key value.

The remaining columns represent attributes of an entity — for example, products have names and categories, and customers have addresses and cities. It’s common in most analytical models to include a Time dimension so that you can aggregate numeric measures associated with events over time.

So here’s what all of this looks like:

A common star schema. Image by author

We can see that numeric measures (e.g. Revenue) aggregated by various dimensions in the model (Customer, Product, Time) sit in a central fact table, Sales.

To be more specific:

Each row in a fact table represents a recorded event with numeric measures attached. Here, the Sales table in our star schema represents sales transactions for individual items, and includes numeric values for quantity sold and revenue.

Sales can be aggregated by customer, product, store, and time dimensions, enabling you to easily find monthly total sales revenue by product for each store, for example.

This is where insights can then be unveiled.

In summary:

Joining tables through keys enrich the data, and aggregations then deliver the insights.

-- Find total sales for each customer and product in 2022
SELECT c.name, p.name, s.sum(s.Revenue) FROM Sales s
JOIN Customer s
ON c.Key = s.CustomerKey
JOIN Product p
ON p.Key = s.ProductKey
JOIN Time t
ON t.Key = s.TimeKey
GROUP BY 1, 2
WHERE t.Year = '2023'

Star schemas are the most common type of schema, which can be extended into a snowflake schema by adding additional tables related to the existing dimension tables — often to represent dimensional hierarchies. For example, products have their own subcategories.

A snowflake schema. Source: Wikipedia

Overall, a data warehouse is a great choice when you have transactional data that can be organised into a structured schema of tables, and you want to use SQL to query them in a performant environment.

See my PowerBI tutorial on a real-world example of data modelling.

Analytical Data Model

While data analysts and data scientists can work with the data directly in the warehouse, it’s common to create one or more analytical data models that pre-aggregate the data to make it easier to produce reports, dashboards, and interactive visualisations.

These are called Online Analytical Processing (OLAP) models or cubes.

The data aggregations are across dimensions at different levels or hierarchies, enabling you to drill up/down to view aggregations at multiple hierarchical levels — for example to find total sales by region, by city, or for an individual address.

Hierarchies enable drilling up and down. Image by author

Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.

Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. Conceptually, this means the model forms a multi-dimensional structure in which any point where the dimensions intersect represents an aggregated measure for those dimensions.

For example, as suggested earlier, sales revenue might be totaled by date, customer, and product, giving rise to a query output that looks like a Cartesian coordinate within a cube.

Pretty cool right?!

The “cube”. Measures (e.g. sales) are aggregated by dimensions time, customer & product. Image by author

Important! Although we commonly refer to an analytical model as a cube, there can be more (or fewer) than three dimensions — it’s just not easy for us to visualise more than three!

Ready for consumption!

Data analysts consume data from these analytical models (Step 3) — or directly from data warehouses (Step 2) — or even from the ‘raw’ datasets sitting in the data lake (Step 1) to explore data and create dashboards, reports & visualisations to generate insights.

A PowerBI dashboard. See my tutorial for a how-to guide!

Non-technology professionals in the organisation can perform self-service data analysis and reporting on the data, particularly if it has already been visualised in a BI tool such as PowerBI or Tableau.

These visualisations built on good analytical data models show comparisons, trends, and key performance indicators (KPIs), and can take the form of charts, graphs, reports, which are often disseminated in documents and PowerPoint presentations, web-based dashboards, and interactive environments (e.g. PowerBI and Tableau) where further downstream users — even the C-Suite — can easily explore the data visually and make data-informed decisions.

Let’s now dive into some details of database normalisation.

This is the process of organising the data in our relational database to reduce redundancy and improve data integrity. It involves breaking down each table into smaller tables and defining relationships between them to eliminate duplicate data and inconsistencies.

The goal of normalisation is to create a database schema that is more efficient, flexible, and less prone to data anomalies. We start with a set of tables and often end up with a larger set of ‘cleaner’ tables.

There are several levels of normalisation, each with a set of rules to follow. The most common are:

First normal form (1NF)

Each column of a table must contain atomic (indivisible) values. That is, no column should contain a list or a set of values.

For example, imagine you have a table of customer orders. In each row, there’s a column for the customer’s name and a column for a list of items they ordered. In 1NF, you would break that list of items into individual rows, so each row only contains information about one item. This helps to prevent duplication of information.

Second normal form (2NF)

A table is in 2NF if it is in 1NF and all non-key columns are fully dependent on the primary key. In other words, there should be no partial dependencies where a non-key column is dependent on only part of the primary key.

Continuing with the customer order example, let’s say you have a new column for the date the order was placed. If that date is only dependent on the customer’s name, then you could end up with duplicates in the table. In 2NF, you would break the table into two — one table for customer information and one for order information. This ensures that each piece of information is only stored once and prevents partial dependencies.

Third normal form (3NF)

A table is in 3NF if it is in 2NF and all non-key columns are independent of each other. In other words, there should be no transitive dependencies where a non-key column is dependent on another non-key column.

Now let’s say you have a new column for the price of each item ordered. If that price is dependent on the item itself, rather than on any other column in the table, then you’re already in 3NF. But if the price is dependent on some other column, like the manufacturer of the item, then you would need to break the table into multiple tables to eliminate those transitive dependencies. The goal is to ensure that each column only depends on the primary key of the table and not on any other non-key column.

Glorious details on normalisation on Wikipedia.

There are additional levels of normalisation beyond 3NF, but they are less commonly used in practice.

As an aside, let’s quickly cover non-relational databases.

These are data management systems that don’t apply a relational schema to the data. Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language.

There are four common types of non-relational databases commonly in use.

  • Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.
  • Document databases, which are a specific form of key-value database in which the value is a JSON document, which the system is optimised to parse and query.
  • Column family databases, which store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.
  • Graph databases, which store entities as nodes with links to define relationships between them.
NoSQL databases. Image by author

Analytical models enable you to structure data to support analysis.

Data for analytics is migrated from operational systems to data lakes and then to data warehouses. Here, the resulting relational database is modelled, which includes normalising it for efficiency and baking in a schema suitable for your business use case.

These models are based on having related tables of data, and define the numeric values that you want to analyse or report (measures) and the entities by which you want to aggregate them (dimensions).

There are three common levels of normalisation, which breaks each table down into smaller, more specific tables that are connected to each other in a more logical and efficient way.

To further improve efficiency, the data may be pre-aggregated into an OLAP model or cube. Intuitively, these models form multiple dimensional structures that resemble a cube in the case that you’re aggregating over precisely 3 dimensions, for instance sales revenue across customer, product and time.

Insights are then derived by data analysts and downstream users by enriching the data of their choice (by joining tables) and performing aggregations of interest.

Through persuasive data-storytelling, analysts and business professionals then present their findings to decision-makers who can act on the data.

If you found this article helpful, please give it a few claps! Your support really helps the algorithm promote the article. I appreciate it.

Find me on Linkedin, Twitter & YouTube.

  • AI Revolution: Fast-paced Intro to Machine Learning — here
  • ChatGPT & GPT-4: How OpenAI Won the NLU War — here
  • Generative AI Art: Midjourney & Stable Diffusion Explained — here
  • Power of Data Storytelling — Sell Stories, Not Data — here
  • Data Warehouses, Data Lakes & Data Mesh Explained — here
  • Power BI — From Data Modelling to Stunning Reports — here
  • Data Warehouses & Data Modelling — a Quick Crash Course — here
  • Machine Learning versus Mechanistic Modelling — here
  • Popular Machine Learning Performance Metrics Explained — here
  • Future of Work: Is Your Career Safe in the AI Age — here
  • Beyond ChatGPT: Search for a Truly Intelligence Machine — here
  • Regression: Predict House Prices using Python — here
  • Classification: Predict Employee Churn using Python — here
  • Python Jupyter Notebooks versus Dataiku DSS — here
  • How to Leverage Cloud Computing for Your Business — here

Join Medium here and enjoy unlimited access to the best articles on the internet.

You will be directly supporting myself and other top writers. Cheers!



[ad_2]
Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *