Clean Your Data Using SQL Efficiently in 8 Simple Steps
I know you spend 60–80% of your time cleaning data.
Data professionals cleans so much data in their work that, they should add as “data janitors” to their business cards!
Hehe, ain’t I right?
Alright, aspiring data readers, let’s dive into the fascinating world of data science. Now, imagine data as a raw diamond. It’s valuable, sure, but to truly shine, it needs a good polish. That’s where data cleaning comes in. It’s like the manicure for your data, ensuring it’s accurate, complete, and consistent.
For this article, our magic wand for this process is SQL, a mighty language that can tame the wildest of data sets.
It’s packed with functions and commands that can transform your data from a rough stone into a sparkling gem. So buckle up, because we’re about to embark on a thrilling journey through the step-by-step process of data cleaning. Let’s make that data shine!
Step 1: Understanding the Data
Before diving into the cleaning process, it’s crucial to understand the data you’re working with. Use the DESCRIBE
statement to get an overview of the table structure:
DESCRIBE sales_data;
Step 2: Identifying Missing Values
Missing values can skew your analysis and lead to incorrect conclusions. To identify missing values in SQL, use the IS NULL
condition:
SELECT *
FROM sales_data
WHERE column_name IS NULL;
Step 3: Handling Missing Values
Once you’ve identified missing values, you have several options:
Ignore: Sometimes, if the dataset is large enough, the simplest approach is to ignore rows with missing values.
SELECT *
FROM sales_data
WHERE column_name IS NOT NULL;
Fill: You can fill missing values with a default value using the
UPDATE
statement:
UPDATE sales_data
SET column_name = 'default_value'
WHERE column_name IS NULL;
Interpolate: For numerical data, you might want to interpolate missing values based on surrounding data points.
WITH RankedData AS (
SELECT
column_name,
LAG(column_name) OVER (ORDER BY id) AS prev_value,
LEAD(column_name) OVER (ORDER BY id) AS next_value,
id
FROM
sales_data
),
InterpolatedData AS (
SELECT
id,
column_name,
CASE
WHEN column_name IS NULL THEN (prev_value + next_value) / 2
ELSE column_name
END AS interpolated_column
FROM
RankedData
)
SELECT *
FROM InterpolatedData;
In this example, id
is assumed to be a column that orders the data. The LAG
and LEAD
functions are used to get the previous and next values, and then the CASE
statement is used to interpolate the missing value as the average of these two.
This is a basic form of interpolation and may need to be adjusted based on the context of your data.
Step 4: Removing Duplicates
Duplicate records can occur due to data entry errors or during data collection. Use the DISTINCT
keyword to find unique records.
To delete duplicates, you can use a temporary table or the ROW_NUMBER()
function:
-- Create a new table with distinct records
CREATE TABLE temp_sales_data AS
SELECT DISTINCT *
FROM sales_data;
-- Delete the old table
DROP TABLE sales_data;
-- Rename the new table to the original table name
ALTER TABLE temp_sales_data
RENAME TO sales_data;
Step 5: Standardizing Data Formats
Consistent data formats are essential for analysis. Use the CAST
or CONVERT
function to standardize data types:
UPDATE sales_data
SET date_column = CAST(date_column AS DATE)
WHERE date_column IS NOT NULL;
Step 6: Validating Data Ranges
Ensure that numerical values fall within expected ranges using the BETWEEN
keyword:
SELECT *
FROM sales_data
WHERE numeric_column NOT BETWEEN expected_minimum AND expected_maximum;
Step 7: Cleaning Strings
Strings often contain leading or trailing spaces or incorrect capitalization. Use the TRIM
, LOWER
, and UPPER
functions to clean them:
UPDATE sales_data
SET string_column = TRIM(string_column);
Step 8: Dealing with Outliers
Outliers can be legitimate or errors. To detect outliers, consider using statistical methods like the Interquartile Range (IQR).
Then, calculate the IQR and filter out outliers:
WITH Quartiles AS (
SELECT
column_name,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY column_name) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY column_name) AS Q3
FROM
sales_data
),
IQR AS (
SELECT
Q1,
Q3,
(Q3 - Q1) AS IQR
FROM
Quartiles
),
Outliers AS (
SELECT
*,
Q1 - (1.5 * IQR) AS Lower_Bound,
Q3 + (1.5 * IQR) AS Upper_Bound
FROM
IQR
)
SELECT
*
FROM
sales_data
JOIN
Outliers
ON
sales_data.column_name BETWEEN Lower_Bound AND Upper_Bound;
Conclusion
Data cleaning is an iterative and crucial part of the data analysis process. By following these steps and utilizing SQL’s powerful functions, you can ensure that your data is clean and ready for analysis.
Remember, clean data leads to more accurate and reliable insights, which is the foundation of effective data-driven decision-making.
If you really enjoyed this, consider learning data science with me.
📝 Read this valuable article for my best pieces of advice
✅ Best Resources (Way to appreciate my work)
eBook 2: Personal INTERVIEW Ready “Statistics” Cornell Notes
Best Selling eBook: Top 50+ ChatGPT Personas for Custom Instructions