Clean Your Data Using SQL Efficiently in 8 Simple Steps

I know you spend 60–80% of your time cleaning data.

Mar 15, 2024

Data professionals cleans so much data in their work that, they should add as “data janitors” to their business cards!
- Richard Warepam, Your Data Guide

Hehe, ain’t I right?

Alright, aspiring data readers, let’s dive into the fascinating world of data science. Now, imagine data as a raw diamond. It’s valuable, sure, but to truly shine, it needs a good polish. That’s where data cleaning comes in. It’s like the manicure for your data, ensuring it’s accurate, complete, and consistent.

For this article, our magic wand for this process is SQL, a mighty language that can tame the wildest of data sets.

It’s packed with functions and commands that can transform your data from a rough stone into a sparkling gem. So buckle up, because we’re about to embark on a thrilling journey through the step-by-step process of data cleaning. Let’s make that data shine!

Step 1: Understanding the Data

Before diving into the cleaning process, it’s crucial to understand the data you’re working with. Use the DESCRIBE statement to get an overview of the table structure:

DESCRIBE sales_data;

Step 2: Identifying Missing Values

Missing values can skew your analysis and lead to incorrect conclusions. To identify missing values in SQL, use the IS NULL condition:

SELECT *
FROM sales_data
WHERE column_name IS NULL;

Step 3: Handling Missing Values

Once you’ve identified missing values, you have several options:

Ignore: Sometimes, if the dataset is large enough, the simplest approach is to ignore rows with missing values.

SELECT *
FROM sales_data
WHERE column_name IS NOT NULL;

Fill: You can fill missing values with a default value using the UPDATE statement:

UPDATE sales_data
SET column_name = 'default_value'
WHERE column_name IS NULL;

Interpolate: For numerical data, you might want to interpolate missing values based on surrounding data points.

WITH RankedData AS (
    SELECT
        column_name,
        LAG(column_name) OVER (ORDER BY id) AS prev_value,
        LEAD(column_name) OVER (ORDER BY id) AS next_value,
        id
    FROM
        sales_data
),
InterpolatedData AS (
    SELECT
        id,
        column_name,
        CASE
            WHEN column_name IS NULL THEN (prev_value + next_value) / 2
            ELSE column_name
        END AS interpolated_column
    FROM
        RankedData
)
SELECT *
FROM InterpolatedData;

In this example, id is assumed to be a column that orders the data. The LAG and LEAD functions are used to get the previous and next values, and then the CASE statement is used to interpolate the missing value as the average of these two.

This is a basic form of interpolation and may need to be adjusted based on the context of your data.

Step 4: Removing Duplicates

Duplicate records can occur due to data entry errors or during data collection. Use the DISTINCT keyword to find unique records.

To delete duplicates, you can use a temporary table or the ROW_NUMBER() function:

-- Create a new table with distinct records
CREATE TABLE temp_sales_data AS
SELECT DISTINCT *
FROM sales_data;

-- Delete the old table
DROP TABLE sales_data;

-- Rename the new table to the original table name
ALTER TABLE temp_sales_data
RENAME TO sales_data;

Step 5: Standardizing Data Formats

Consistent data formats are essential for analysis. Use the CAST or CONVERT function to standardize data types:

UPDATE sales_data
SET date_column = CAST(date_column AS DATE)
WHERE date_column IS NOT NULL;

Step 6: Validating Data Ranges

Ensure that numerical values fall within expected ranges using the BETWEEN keyword:

SELECT *
FROM sales_data
WHERE numeric_column NOT BETWEEN expected_minimum AND expected_maximum;

Step 7: Cleaning Strings

Strings often contain leading or trailing spaces or incorrect capitalization. Use the TRIM, LOWER, and UPPER functions to clean them:

UPDATE sales_data
SET string_column = TRIM(string_column);

Step 8: Dealing with Outliers

Outliers can be legitimate or errors. To detect outliers, consider using statistical methods like the Interquartile Range (IQR).

Then, calculate the IQR and filter out outliers:

WITH Quartiles AS (
    SELECT
        column_name,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY column_name) AS Q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY column_name) AS Q3
    FROM
        sales_data
),
IQR AS (
    SELECT
        Q1,
        Q3,
        (Q3 - Q1) AS IQR
    FROM
        Quartiles
),
Outliers AS (
    SELECT
        *,
        Q1 - (1.5 * IQR) AS Lower_Bound,
        Q3 + (1.5 * IQR) AS Upper_Bound
    FROM
        IQR
)
SELECT
    *
FROM
    sales_data
JOIN
    Outliers
ON
    sales_data.column_name BETWEEN Lower_Bound AND Upper_Bound;

Conclusion

Data cleaning is an iterative and crucial part of the data analysis process. By following these steps and utilizing SQL’s powerful functions, you can ensure that your data is clean and ready for analysis.

Remember, clean data leads to more accurate and reliable insights, which is the foundation of effective data-driven decision-making.

If you really enjoyed this, consider learning data science with me.

📝 Read this valuable article for my best pieces of advice

✅ Best Resources (Way to appreciate my work)

eBook 1: Personal INTERVIEW Ready “SQL” CheatSheet
eBook 2: Personal INTERVIEW Ready “Statistics” Cornell Notes
Best Selling eBook: Top 50+ ChatGPT Personas for Custom Instructions

Your Data Guide