The Ultimate Guide to Database Normalization (1NF, 2NF, 3NF, 4NF, 5NF)
Everything You Must Know About Normalization for Interviews
So, have you ever tried to organize a messed-up wardrobe?
You know, the one where you never find anything in particular, and once you need that one specific item, everything tumbles down!
So now, imagine that being your database.
Kind of horrifying, right?
Here is the databases’ very own Marie Kondo: Database Normalization!
I've been a developer, so I know the nightmares with data.
I mean, there's not anything that quite matches that feeling at the bottom of your gut when you realize your database has turned into some kind of twisted web of redundant, inconsistent information.
But then again, rest easy, for I am here to guide you through everything about normalizing databases from the First Normal Form through the Fifth Normal Form.
What is database normalization in general?
Normalization is the process of organizing the database so that all forms of redundancy and dependency are eliminated.
It's almost like planning a place for everything and putting everything in its place.
Some benefits crop up from doing it right:
Making data more comprehensible
Supports enhancement and extension of the database
Defends against multiple types of data anomalies
But, the main question is: why do we need normalization?
Let's see one simple example to demonstrate all the problems that the unnormalized data bank is capable of causing:.
Suppose we have an online bookstore. We can start with something that appears as follows:
| order_id | customer_name | customer_email | book_title | author | quantity | price |
------------------------------------------------------------------------
| 1 | John Doe | john@example.com | The Great Gatsby | F. Scott Fitzgerald | 2 | 15.99 |
------------------------------------------------------------------------
| 2 | Jane Smith | jane@example.com | To Kill a Mockingbird | Harper Lee | 1 | 12.99 |
------------------------------------------------------------------------
| 3 | John Doe | john@example.com | 1984 | George Orwell | 1 | 10.99 |
This seems pretty good on the surface, right? But later, as we expand our shop, we'll begin to run into trouble:
Redundancy of Data: The data of John Doe is redundant in two rows. If the email changed we would have to change at multiple places.
Updating anomalies: When the email for John is updated in one row but not in the other, it causes data inconsistency.
Insertion Anomalies: A new customer cannot be inserted before they have ordered because the OrderID is part of the primary key.
Deletion Anomalies: Due to the in-deletion of the order made by Jane, customer data referencing Jane will be deleted.
These are the kinds of problems that normalization helps solve for us.
Now, let's go on to the levels of normalization-very endearingly, “normal forms.”
First Normal Form (1NF)
The first Normal Form is the fundamental of a well-drawn database. For a table to be in 1NF, it should:
Not use row order to indicate meaning
No Mixed Data Types Within the Same Column
Each table requires a primary key.
No repeating groups
Now, let's take a closer look at each of these rules:
1. Do Not use row order to indicate meaning
Let's just say we have a list of Beatles band members by height.
| Name |
---------
| Paul |
---------
| John |
---------
| George|
---------
| Ringo |
Although we might know the list is sorted by height, the database has no way of knowing that.
The rows of a relational database table are in no particular order and should have no meaningful order assumed.
To correct this, a more particular column can be added for height:
| Name | Height_cm|
---------------------
| Paul | 180 |
---------------------
| John | 179 |
---------------------
| George | 177 |
---------------------
| Ringo | 172 |
2. No Mixed Data Types Within the Same Column
Suppose we have a table with some product prices:
| Price | Product |
-------------------
| apple | 0.50 |
-------------------
| Banana| 0.75 |
-------------------
| Orange| N/A |
Here, we have mixed numeric values with a string ("N/A") in the Price column.
This breaks 1NF. We could use the following instead as our datatype:
| Product | Price |
-------------------
| Apple | 0.50 |
-------------------
| Banana | 0.75 |
-------------------
| Orange | NULL |
3. Each table requires a primary key
A PRIMARY KEY can be defined as one or more columns in the table for which the values are used to uniquely identify rows.
For our example, the OrderID can serve as the table requirement of the primary key for Orders.
4. No repeating groups
Imagine a table where we try to store multiple phone numbers of a customer:
| CustomerID | Name | PhoneNumber1 | PhoneNumber2 | PhoneNumber3 |
------------------------------------------------------------------
| 1 | John | 555-1234 | 555-5678 | 555-9012 |
------------------------------------------------------------------
| 2 | Jane | 555-3456 | | |
This isn't in 1NF as we have a repeating group (phone numbers). Instead, we should be breaking out phone numbers into a separate table:
Customers:
| customerID | name |
---------------------
| 1 | John |
---------------------
| 2 | Jane |
PhoneNumbers:
| CustomerID | PhoneNumber |
----------------------------
| 1 | 555-1234 |
----------------------------
| 1 | 555-5678 |
----------------------------
| 1 | 555-9012 |
----------------------------
| 2 | 555-3456 |
So we've applied those rules and taken the first step toward a database that's well organized.
But we're just getting started!
Second Normal Form 2NF
Moving to 2NF, we will more focus on how non-key columns are related to the primary key.
The rule for this form is very simple: every non-key attribute must depend totally on the primary key.
Let's revisit the bookstore example, but this time with a composite primary key:
| order_id | order_book_id | customer_id | book_title | book_author | customer_name | qty |
------------------------------------------------------------------------
| 1 | 101 | 201 | The Great Gatsby | F. Scott Fitzgerald | John Doe | 2 |
------------------------------------------------------------------------
| 1 | 102 | 201 | 1984 | George Orwell | John Doe | 1 |
------------------------------------------------------------------------
| 2 | 101 | 202 | Great Gatsby |F. Scott Fitzgerald | Smith, Jane |1 |
In the table below, the primary key is (order_id, order_book_id), while the other attributes depend on only part of the key:
book_title and book_author depend only on order_book_id
customer_name is functionally dependent on customer_id
This is a violation of 2NF. It can be solved by decomposing the table:
Orders:
| ORDERID | CUSTOMERID |
------------------------
| 1 | 201 |
------------------------
| 2 | 202 |
OrderDetails:
| orderID | bookID | quantity |
-------------------------------
| 1 | 101 | 2 |
-------------------------------
| 1 | 102 | 1 |
-------------------------------
| 2 | 101 | 1 |
Books:
| BookID | Title | Author |
----------------------------------------------------
| 101 | The Great Gatsby | F. Scott Fitzgerald |
----------------------------------------------------
| 102 | 1984 | George Orwell |
Customers:
| CustomerID | CustomerName |
-----------------------------
| 201 | Mr. Smith |
-----------------------------
| 202 | Jane Smith |
Now, the whole primary key of its table is functionally dependent upon each non-key attribute.
Third Normal Form (3NF)
The Third Normal Form takes us another step forward.
There, the keyword is “The key, the whole key, and nothing but the key.”
In the Third Normal Form, each attribute appearing within your table must rely solely on the primary key.
Let's assume the above Customers table already, and then we added in a City for the Customer:
| CustomerID | CustomerName | CustomerCity | StateName |
---------------------------------------------------------
| 201 | John Doe | New York | New York |
---------------------------------------------------------
| 202 | Jane Smith | Los Angeles | California |
This is a breach of 3NF because StateName depends on CustomerCity and not directly on the primary key, i.e. CustomerID.
To correct, put the state information in a separate table:
| CustomerID | CustomerName | CustomerCity |
----------------------------------------------
| 201 | John Doe | New York |
----------------------------------------------
| 202 | Jane Smith | Los Angeles |
| CityName | StateName |
-----------------------------
| New York | New York |
-----------------------------
| Los Angeles | California |
All attributes of each table are dependent only on the primary key of that table.
Fourth Normal Form (4NF)
Fourth Normal Form deals with multi-valued dependencies in the sense that it states multi-valued dependencies are on the key.
4NF states that a table should not have any multi-valued dependencies other than a candidate key.
Let's run through an example.
Suppose we have a table for a website for selling custom birdhouse designs:
| Model | Color | Style |
----------------------------------
| Cottage | Red | Victorian |
----------------------------------
| Cottage | Blue | Victorian |
----------------------------------
| Cottage | Red | Modern |
----------------------------------
| Cottage | Blue | Modern |
----------------------------------
| Lodge | Brown | Rustic |
----------------------------------
| Lodge | Green | Rustic |
----------------------------------
| Lodge | Brown | Alpine |
----------------------------------
| Lodge | Green | Alpine |
This table is not in 4NF because it has more than one independent multivalued fact about the key.
The colours available for a model are independent of the styles available.
To make this 4NF, we bring it down to two tables:
ModelColors:
| Model | Colour |
----------------------
| Cottage | Red |
----------------------
| Cottage | Blue |
----------------------
| Lodge | Brown |
----------------------
| Lodge | Green |
ModelStyles:
| Model | Style |
------------------------
| Cottage | Victorian |
------------------------
| Cottage | Modern |
------------------------
| Lodge | Rustic |
------------------------
| Lodge | Alpine |
Now each table depicts a single multivalued dependency.
Fifth Normal Form (5NF)
At last, we hit the Fifth Normal Form.
A relation is in 5NF when and only when it cannot be decomposed further without losing information, whether or not constraints are used to make the decomposition regain functionality.
5NF deals with join dependencies and ensures that a table cannot be losslessly decomposed any further.
That's a bit abstract; let's work through an example.
Suppose we have the following table, indicating which flavor of ice cream different suppliers provide to different stores:
| Supplier | Flavor | Channel |
----------------------------------
| Frosty's | Vanilla | Store A |
----------------------------------
| Frosty's | Vanilla | Store B |
----------------------------------
| Alpine | Vanilla | Store A |
----------------------------------
| Alpine | Chocolate| Store B |
That's OK but it will lead to some inferences we don't want to make.
Take, for example, the case where we add that Alpine supplies Chocolate to Store A without saying a thing about Frosty's:
This has to be decomposed into three tables to be in 5NF:
SupplierFlavors:
| Vendor | Flavour |
--------------------------
| Frosty's | Vanilla |
--------------------------
| Alpine | Vanilla |
--------------------------
| Alpine | Chocolates |
SupplierStores:
| Vendor | Shop |
----------------------
| Frosty's | Store A |
----------------------
| Frosty's | Store B |
----------------------
| Alpine | Store A |
----------------------
| Alpine | Store B |
StoresFlavors:
| Store | Sample |
--------------------------------
| Store A | Vanilla |
--------------------------------
| Store B | Vanilla |
--------------------------------
| Store B | Frosty's |
Then we could say "Alpine supplies Chocolate to Store A" just by adding one more row to the StoresFlavors table, without saying anything at all about any other suppliers.
The Practical Side of Normalization
However, in the real world, one should never forget that the goal is not always to achieve perfect normalization.
Sometimes, denormalization might work wonders for query simplification or performance.
For instance, when taking a price for an item from a high-traffic e-commerce site, you would put that in the OrderDetails table;
this breaks 3NF because the price in this case is not a property of the product but is, rather, an attribute of that order.
This duplication allows for faster queries to be answered and maintains a history of the price at the time of purchase.
It is important to know the principles of normalization and apply them wisely, depending on your specific needs.
Final Thoughts
Database normalization may seem like hell at first, but it is a really useful skill for developers or data enthusiasts. It draws that very line between a chaotic data dump and an efficient, lean database: something to love working with.
Let's look at the significant features:
First Normal Form (1NF): Eliminate repeating groups, and ensure atomic values.
Second Normal Form: Remove partial dependencies upon a composite key.
Third Normal Form (3NF): Remove the transitive dependencies.
Fourth Normal Form (4NF): This looks at multi-valued dependencies.
Fifth Normal Form (5NF): Handle join dependencies. Remember that normalization is an ongoing task.
Just like organizing your closet: every so often, you need to continue refining the way things are ordered.
These rules should have prepared you to keep your data in order and impress even Marie Kondo herself!
Try creating a database for a project you're working on, and see how you can apply these normalization principles to make it more efficient and maintainable.
Please consider ❤️ liking this article if you enjoyed reading it. Also, you can support me here.
Connect: LinkedIn | Gumroad Shop | Medium | GitHub
Subscribe: Substack Newsletter | Appreciation Tip: Support