How To Perform Statistical Analysis Using Python: The Ultimate Guide
5 Proven Methods that Every Data Science Professional Uses
I don’t believe in definitions. Everyone has their own way of defining “statistical analysis.”
For me, statistical analysis’s one and only goal is to understand the data.
When do we perform this analysis? Right after we gather the data, as we want to familiarize ourselves with it, correct?
In this article, I won’t talk about the different types of statistical analysis, like EDA, inferential, prescriptive, and so on. Let’s keep that for another article.
But if you want to learn about EDA or why statistics are important, read this:
(More than 8K people found it helpful.)
Getting back to this article, here I will outline 5 tried-and-true methods that every data science professional employs to understand their data immediately after they’ve collected the necessary data. So, let’s dive in!
Those 5 methods are:
📚Additional Resources:
1. Descriptive Statistics
Why do we use descriptive statistics?
Let’s say you just met a stranger at a gathering. You find her interesting, and you want to start a conversation with her. Naturally, you start by asking her name, her interests, and so forth, correct?
What you’ve just done is gain a high-level understanding of her
Similarly, when it comes to acquiring a high-level understanding of your data, such as exploring, analyzing, and effectively communicating it, we use descriptive statistics
This helps us understand the distribution, variability, and central tendencies of the data.
In other words, descriptive statistics are used to summarize and show the basic features of data, like the mean, median, range, standard deviation, quartiles, etc.
To calculate descriptive statistics, we can use Python libraries like, pandas
, numpy
, and scipy
. For example, let’s see:
# Import the libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
# Load the dataset from a csv file
df = pd.read_csv("data.csv")
# Get the summary statistics using pandas
df.describe()
# Get the mean of each column using numpy
np.mean(df, axis=0)
# Get the median of each column using numpy
np.median(df, axis=0)
# Get the standard deviation of each column using numpy
np.std(df, axis=0)
# Get the variance of each column using numpy
np.var(df, axis=0)
# Get the mean of each column using scipy
stats.mean(df, axis = 0)
# Get the median of each column using scipy
stats.median(df, axis = 0)
# Get the mode of each column using scipy
stats.mode(df, axis=0)
# Get the skewness of each column using scipy
stats.skew(df, axis=0)
# Get the kurtosis of each column using scipy
stats.kurtosis(df, axis=0)
2. Hypothesis Testing
Before embarking on any data science project, it’s crucial to formulate some initial hypotheses about the population, isn’t it?
Having gained a better understanding of your data through the task of descriptive statistics, let’s proceed with checking those preset hypotheses to see if they are correct or wrong about the population based on this sample data.
Hence, this process of checking is what we call hypothesis testing.
If the hypothesis is correct, it is considered a null hypothesis (H0), and otherwise, it is considered an alternate hypothesis (H1).
To conduct hypothesis testing, we need to:
Read this: How to Perform Hypothesis Testing Using Python
Set up a null hypothesis (H0) and an alternate hypothesis (H1)
Then, choose a significance level (alpha)
Finally, calculate a test statistic and a p-value
Now, make the decision based on the p-value
To understand the whole process, read the article I have attached above.
For hypothesis testing, we can use Python libraries like,scipy
, statsmodels
, and pingouin
. For example:
To perform a one-sample t-test, which tests whether the mean of a population is equal to a given value, we can use the
ttest_1samp
function fromscipy.stats
or thettest
function frompingouin
:
# Import the libraries
from scipy import stats
import pingouin as pg
# Define the sample data and the population mean
data = [1, 2, 3, 4, 5]
popmean = 3.5
# Perform the one-sample t-test using scipy
t, p = stats.ttest_1samp(data, popmean)
print('t = {:.4f}, p = {:.4f}'.format(t, p))
# Perform the one-sample t-test using pingouin
df = pg.ttest(data, popmean)
print(df)
To perform a two-sample t-test, which tests whether the means of two independent groups are equal or not, we can use the
ttest_ind
function fromscipy.stats
, thettest_ind
function fromstatsmodels.stats
, or thettest
function frompingouin
:
# Import the libraries
from scipy import stats
from statsmodels.stats import weightstats
import pingouin as pg
# Define the sample data for two groups
group1 = [1, 2, 3, 4, 5]
group2 = [6, 7, 8, 9, 10]
# Perform the two-sample t-test using scipy
t, p = stats.ttest_ind(group1, group2)
print('t = {:.4f}, p = {:.4f}'.format(t, p))
# Perform the two-sample t-test using statsmodels
t, p, df = weightstats.ttest_ind(group1, group2)
print('t = {:.4f}, p = {:.4f}, df = {:.4f}'.format(t, p, df))
# Perform the two-sample t-test using pingouin
df = pg.ttest(group1, group2)
print(df)
3. Correlation:
After applying the two methods mentioned above, you have probably discovered a lot about your data. And now comes correlation.
Let’s recall the example above about meeting a stranger at a gathering.
After getting to know her, if you are really interested in her, you would ask her out for dinner to know more about her and find out how strongly you are compatible with her, right?
Well, that is a correlation. In data terms, the measure of the strength and direction of the linear relationship between two variables in the data is correlation.
This measure ranges from -1 to 1. Here, -1
means perfect negative correlation; 0
means no correlation; and lastly, 1
means perfect positive correlation.
Correlation helps you understand how variables are related to each other and whether they can be used for prediction or causation.
Learn more about it:
Decoding “CORRELATION” in Data Mining: The Ultimate Guide
To calculate and visualize correlation, we can use Python libraries like, numpy
, pandas
, scipy
, seaborn
, and matplotlib
.
But, pandas
and seaborn
is the most used and easiest way to calculate and visualize it, respectively. For example:
# get the correlation matrix with pandas
df.corr()
# plot a heatmap of the correlation matrix with seaborn and matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show() # show the plot
4. Regression
By now, you’ve grasped the relationship between the variables in your data, haven’t you?
And, if a variable is dependent on another variable, knowing “how much the dependent variable changes as the independent variables change, and whether we can use the independent variables to predict the dependent variable” is important.
Hence, the method to model and understand this dependency relationship between one dependent variable and one or more independent variables is called “regression.”
There are different types of regression, like linear regression, logistic regression, polynomial regression, etc.
To learn the most known regression algorithm, read:
Mathematical Understanding of ML Algorithms: Linear Regression (Part 1/10)
For regression, to calculate and evaluate the relationship, we can use Python libraries, like sklearn
, scipy
, statsmodels
and seaborn
to visualize it.
Generally, sklearn
and seaborn
is the most used option. Here’s an example:
# Import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Load the dataset from a csv file
df = pd.read_csv("data.csv")
# Define the dependent and independent variables
y = df["y"] # dependent variable
X = df["x"] # independent variable
# Reshape the variables to fit the model
y = y.values.reshape(-1, 1)
X = X.values.reshape(-1, 1)
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Get the model parameters
slope = model.coef_[0][0]
intercept = model.intercept_[0]
r_squared = model.score(X, y)
# Print the model parameters
print(f"slope = {slope:.4f}")
print(f"intercept = {intercept:.4f}")
print(f"r_squared = {r_squared:.4f}")
# Plot the data points and the regression line
sns.regplot(x=X, y=y, ci=None)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Linear regression of y on x")
plt.show()
5. Visualization:
The last technique we’re going to discuss is visualization.
You’re likely already familiar with this concept.
The data we’ve examined and analyzed throughout our statistical exploration can be easily and effectively communicated to anyone through graphical representation.
This process of crafting such graphical representations using charts, graphs, and the like is known as visualization.
How to Visualize Data in the Most Effective Way
There are many Python libraries, like seaborn
, matplotlib
, plotly
, etc. that can be used.
But the ones I use the most are seaborn
and matplotlib
. For example:
import matplotlib.pyplot as plt # import matplotlib.pyplot as plt
import seaborn as sns # import seaborn as sns
To create a simple line plot using
matplotlib
, you can use theplt.plot
function:
x = [1, 2, 3, 4, 5] # define the x values
y = [2, 4, 6, 8, 10] # define the y values
plt.plot(x, y) # plot a line plot of x and y
plt.xlabel("x") # label the x-axis
plt.ylabel("y") # label the y-axis
plt.title("Line plot of x and y") # add a title
plt.show() # show the plot
To create a simple line plot using
seaborn
, you can use thesns.lineplot
function:
x = [1, 2, 3, 4, 5] # define the x values
y = [2, 4, 6, 8, 10] # define the y values
sns.lineplot(x, y) # plot a line plot of x and y
plt.xlabel("x") # label the x-axis
plt.ylabel("y") # label the y-axis
plt.title("Line plot of x and y") # add a title
plt.show() # show the plot
Conclusion:
In this article, you have learned how to use Python to perform statistical analysis effectively. I have covered some of the common statistical tasks, such as descriptive statistics, hypothesis testing, correlation, regression, and visualization, and how to use various libraries and tools to perform and interpret them.