How To Automate Data Science Tasks With Python (Part 1)

In Part 1: It is about loading the data and understanding the data.

Sep 04, 2024

I’m skipping right to the main topic. We are all aware that some of the tasks on our data science worklist are redundant and can be automated.

The only reason we want to automate these tasks is to save time by not having to write the same code every time.

But how can we automate? Using AI tools? No! -We will automate the tasks by defining a function and calling it as needed.

Some of the tasks we can automate include:

Data loading and understanding the data
Handling missing values and data transformation
Outlier detection and handling them
Exploratory data visualisation
Feature selection and importance

In this article, we will solely talk about loading the data and understanding the data. This is the first part of a series of articles titled “How To Automate Data Science Tasks Using Python.”

One Piece of Advice:
If any task seems repeated and redundant in your project. Always define a function and automate your work

Before going into the main thing, I would love it if you check out my eBooks and support me:

Also, get free data science & AI eBooks: https://codewarepam.gumroad.com/

Loading the data:

Data loading is always the initial step in any data science project. This is part of data collection.

During data collecting, we learned that there are several types of data sources and formats. However, the most commonly utilized formats are CSV, Excel, and SQL databases.

In addition, to automate data loading, given the variety of data formats, we should ensure that our function is highly flexible in all situations.

Normally, this is how we go with data loading, correct?

1. Loading data from a CSV file:

import pandas as pd

# Load a CSV file
df_csv = pd.read_csv('path/to/your/file.csv')

2. Loading data from Excel file:

import pandas as pd

# Load an Excel file
df_excel = pd.read_excel('path/to/your/file.xlsx')

3. Loading data from a SQL database:

import pandas as pd
from sqlalchemy import create_engine

# Connection details
conn_details = {
    'user': 'your_username',
    'password': 'your_password',
    'host': 'your_host',
    'db_name': 'your_database'
}

# Create a connection string
conn_string = f"postgresql://{conn_details['user']}:{conn_details['password']}@{conn_details['host']}/{conn_details['db_name']}"

# Create a SQLAlchemy engine
engine = create_engine(conn_string)

# Load data from the SQL database
df_sql = pd.read_sql('SELECT * FROM your_table', con=engine)

The question now is how we can automate this data-loading process while taking into account all of the conditions.

So, we must use an if-else statement to describe the conditions and then apply the given program logic based on the conditions.

Hence, the function goes like this:

import pandas as pd
import sqlite3
from sqlalchemy import create_engine

def load_data(file_path, file_type='csv', sql_query=None, db_type=None, conn_details=None):

    if file_type == 'csv':
        df = pd.read_csv(file_path)
    elif file_type == 'excel':
        df = pd.read_excel(file_path)
    elif file_type == 'sql':
        if db_type == 'sqlite':
            conn = sqlite3.connect(conn_details['db_name'])
        else:
            engine = create_engine(f"{db_type}://{conn_details['user']}:{conn_details['password']}@{conn_details['host']}/{conn_details['db_name']}")
            conn = engine.connect()
        df = pd.read_sql_query(sql_query, conn)
        conn.close()
    else:
        raise ValueError("Unsupported file type. Choose from 'csv', 'excel', 'sql'.")
    return df

Understanding the data:

Understanding the data is extremely important after it has been loaded.

What do we typically do in a Jupyter Notebook to interpret the data?

We attempt to comprehend the structure and contents of the data, such as the number of rows and columns, data types, missing values, and basic descriptive statistics summary of statistics.

Do these seem familiar?

1. Shape of the data:

df.shape

2. Listing out the column names:

df.columns

3. Exploring through the columns’ data type

df.dtypes

4. Checking the missing values:

df.isnull().sum()

5. Basic Descriptive Statistics summary of the data:

df.describe(include='all').T

We already know what to do right now. To automate this entire process, we must combine all of these distinct methods into a single function, like this:

def data_summary(df):

    summary = {
        'shape': df.shape,
        'columns': df.columns.tolist(),
        'dtypes': df.dtypes.to_dict(),
        'missing_values': df.isnull().sum().to_dict(),
        'description': df.describe(include='all').T
    }
    return summary

In this manner, when we need to understand any data, we can just call the function “data_summary” and it will immediately provide all of the information we require.

Something like this:

# Load your data
df = pd.read_csv('path/to/your/file.csv')

# Generate summary
summary = data_summary(df)

# Print the summary
print(f"Dataset shape: {summary['shape']}")
print(f"\nColumns: {summary['columns']}")
print("\nData types:")
for col, dtype in summary['dtypes'].items():
    print(f"  {col}: {dtype}")
print("\nMissing values:")
for col, count in summary['missing_values'].items():
    print(f"  {col}: {count}")
print("\nDescriptive statistics:")
print(summary['description'])

Visualizing the data:

I realize we have a basic understanding of the data by now, but it is insufficient. You and I both know it.

It is important to understand the data in detail by visualizing its distributions and correlations.

Because the following processes, like dealing with missing values, outliers, and EDA, require a thorough grasp of the data.

So, how do we visualise the distributions and correlations? Like this, right?

Pairplot for numerical columns

import matplotlib.pyplot as plt
import seaborn as sns 

# Load your data
df = pd.read_csv('path/to/your/file.csv')

sns.pairplot(df.select_dtypes(include=['number']))
plt.show()

Distribution plots for individual numerical columns

import matplotlib.pyplot as plt
import seaborn as sns

# Load your data
df = pd.read_csv('path/to/your/file.csv')

for col in df.select_dtypes(include=['number']).columns:
  sns.histplot(df[col], kde=True)
  plt.title(f'Distribution of {col}')
  plt.show()

Now, as before, let’s combine these two and define a function called “visualise_data_distribution” to investigate numerical data distribution.

def visualize_data_distribution(df):
  # Pairplot for numerical columns
  sns.pairplot(df.select_dtypes(include=['number']))
  plt.show()

  # Distribution plots for individual numerical columns
  for col in df.select_dtypes(include=['number']).columns:
  sns.histplot(df[col], kde=True)
  plt.title(f'Distribution of {col}')
  plt.show()

Later, we can call this function from wherever and with any data we need.

# Load your data
df = pd.read_csv('path/to/your/file.csv')

# Visualise the data distribution
visualize_data_distribution(df)

To visualize the correlation matrix

import matplotlib.pyplot as plt
import seaborn as sns

# Load your data
df = pd.read_csv('path/to/your/file.csv')

plt.figure(figsize=(10, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

Instead of writing those lines of code everywhere, we can just define a function with the same code and logic. Later, only use that function wherever necessary. Like:

def visualize_correlation_matrix(df):
  
  plt.figure(figsize=(10, 8))
  corr_matrix = df.corr()
  sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')  
  plt.title('Correlation Matrix')
  plt.show()

# Load your data
df = pd.read_csv('path/to/your/file.csv')

# Visualize the correlation matrix
visualize_correlation_matrix(df)

Wrapping it up:

Finally, we learnt how to automate each step in the process of loading and understanding data. Now, let’s put everything together to create a function called “load_understand_data.”