Better Structure Integrity with Data Normalization

What is meant by data normalization?

Google’s BigQuery is a data warehousing and data analytics solution for Big Data stored in columns and rows. For those that are comfortable using SQL queries to retrieve information, BigQuery is easy to learn because it uses standard SQL language to query the data. It also includes some extensions to do a few things SQL normally can’t, but overall, if you know SQL, you’re set.

Very often there will be fields that are repeated within a table. The data in those fields should be the same throughout, but since it’s being re-entered, there’s a risk of incurring an error. As an example, imagine someone calling a customer support line to pay a hospital bill or mortgage. The customer rep records the initial information (name, account, security verification, etc.), and then punts the customer to a different representative who asks for and records the same information.

With every transfer to a different representative, there is a risk that the information being gathered doesn’t quite match up, causing problems.

Each customer rep sees a customer name field on their screen, which they fill in. The first rep types in Charles Cuthbert Billingsworth, Jr. The other, types Chucky Billingsworth.

A name mismatch could cause an issue for software that pulls data from the table. It may appear to the application that Charles and Chucky are different people, causing an error within the application. Despite both being the same person and despite the April mortgage being paid on time, the error might trigger a late notice to the customer, incorrectly stating that the mortgage payment was overdue.

To address this, we can normalize the data. The idea behind data normalization is to reduce the risk of mismatch errors caused by repeated fields (like name) within the table.

We do this by splitting the data into smaller, more simplified tables and then creating a relationship between those tables with a JOIN. For example, we could split the data into a customer table containing the names and a transaction table containing the payments.

The result is that the data only appears once. In this example, there is just one name associated with each ID. If the name field were to be modified, it only needs to be updated in one place, not re-entered for every transaction. This reduces the confusion and eliminates the potential error.

If you want to improve the integrity of your datasets, normalization is one strategy to consider when structuring your data.