How to Conduct Descriptive Statistics Analysis Effectively
Breaking Down Descriptive Statistics: How to Squeeze More Insights from Your Data
First things first. I'm not going to teach you coding here. I will advise you on what questions you should ask during a descriptive statistics analysis procedure.
What are descriptive statistics? It is defined as a method for summarizing and describing a dataset's primary aspects.
However, you should focus on “What is its ultimate goal?” - The ultimate purpose of descriptive statistics is to understand the data prior to any data analysis or model-building activities.
But why exactly do we use descriptive statistics? Because it helps us make sense of all of the data by organizing, summarizing, and presenting it in an understandable format. Also, before we begin any advanced statistical tests, such as inferential statistics, we must first understand what our data is telling us.
As previously said, in this article, I will walk you through the entire method that I generally use to undertake descriptive analysis. I am confident that this will greatly benefit your descriptive statistics procedure.
Step 1: Where’s the Right Data?
In an age where data is everything and everywhere, obtaining the right data for your work is essential.
Why, though? Instead of elaborating, let me ask you a question: do you cook using expired or bad ingredients? I'm guessing your answer is “No,” and if it’s “Yes,” I'm speechless! (Please take good care of yourself). We don’t, because it’s harmful for our health and will make our food taste worse.
The same principle applies to data: garbage in, garbage out!
The next point is how we know we have good quality data, correct?
Simple! Ask these questions to yourself:
Does this data actually represent what I want to study?
Is it recent enough to be relevant?
Does it come from reliable sources?
If the answer to these questions turn out as “Yes!”. You have passed the data quality check.
Remember this: Good analysis starts with good data.
Step 2: Hey, Clean Up The Data!
Good data does not imply clean data. Do you know that?
After you've found solid data, you must clean and organize it to ensure that the data quality is ideal. In return, well-organized, clean data will surprise you with outstanding outcomes.
I have already written a few articles about "Data Cleaning/Data Wrangling". I recommend that you read the ones listed below to learn the entire coding process. Many others have loved it, and I hope you find it useful too.
To clean up the data, here’s what you should do:
First, Look for the missing values (Those blank spaces really affect your result, Trust me!) - Ask, How many missing values are there in the data? How should I handle them?
If the missing values are handled, it’s time you hunt down the duplicates and remove them. (Duplicates too affect our results with biases)
Now, look out for some obvious errors in the data like say, (someone saying they’re 999 years old!) and fix those.
Lastly, ask “Are all the data in the right format?” Make sure everything is in the format the data needs to be.
Also, organize the data when it has been cleaned.
Here are the actions you need to take:
Sort your data in a way that makes sense given the problem description.
Grouping similar things is also an effective technique to organize.
You also need to ensure that everything is correctly labeled.
Finally, convert data types as needed. (Changing the text dates to actual date format, like we normally do!)
Step 3: The Main Thing - Descriptive Statistics Analysis
Although this is the primary topic we wanted to cover in this article, I want you to know that the first two stages I outlined above have an impact on the final result.
So, in descriptive statistics analysis, we have finished organizing the data. The next stage is to summarize and better understand the data.
How do we do this? - In this scenario, I mostly rely on three approaches/metrics (which I believe are the best) to carry out the work for me. They are:
Count
Middle Ground
Spread
I'll tell you why. These three metrics provide a wide range of information about the data. For better understanding, I've compiled a set of questions that you should obviously know the answers to before proceeding with any analysis or building a model.
Read and analyze the following questions:
How many observations exist in the dataset?
Is the sample size adequate for your analysis?
Is there any missing data or outliers that affect the overall count?
Based on the questions above, you might be able to guess how we'll answer them: Yes, examining the COUNT will tell you how frequently objects appear in your data. COUNT is quite handy, especially for categorical data.
A personal advice from me: Always construct frequency tables and charts to track the COUNT in your data.
Next set of questions:
What is the dataset’s typical value?
How do the central values vary between groups or categories?
Are the central values influenced by extreme values or outliers?
Middle Ground is the answer to these questions. What do I mean by “middle ground?” - Central Tendencies.
You can look at your data’s middle ground in three ways. All exhibit your data from various perspectives.
Mean - Describes the well-known and popular average.
Median - Displays the middle value when everything is lined up.
Mode identifies the most common occurring value.
Now for the final group of questions:
How much variation exists in the data?
Are the data points tightly concentrated around the core value or widely dispersed?
How do several datasets compare in terms of variability?
To obtain the answers to these questions, you must first determine the dispersion of your data. And these three metrics will help you just well.
Range is the difference between the highest and lowest numbers.
Variance indicates how far the values normally deviate from the mean.
Standard deviation - I believe it is a more user-friendly term than variance.
Aside from these three primary areas you can also look at “Quartiles” and “Percentiles” to find specific areas.
Step 4: It’s Time To Present the Data!
As I previously stated, the most successful methods of analyzing descriptive data include organizing, summarizing, and presenting. So far, by arranging and summarizing our data, we have gained a thorough understanding of its details. It's time to show those details visually.
Here, let me just highlight the sorts of graphs or charts that I recommend you use based on what you need:
Histograms are used to visualize how values are distributed.
Box plots are useful for identifying outliers and understanding the dispersion.
Scatter plots are used to observe the interactions between items.
Bar charts are used to compare different qualities or categories.
“A Good Visual is worth a Thousand numbers”
Wrapping this Up:
How will the above steps make it effective, as stated in the title?
The short answer is if you completed all of the steps outlined in this post. You have a lot of knowledge about your data, and last, you need to ask these questions based on your project goals:
If your purpose is to identify patterns:
Do some events occur together?
To identify trends:
How do things change over time?
Observe odd things:
What stands out?
Finally, to investigate relationships:
Do some things seem to be connected?
Have you found this article useful? Please let me know in the comments.
Please consider ❤️ liking this article. Also, you can support me here.
I would love it if you check out my eBooks to support me:
Also, get free data science & AI eBooks: https://codewarepam.gumroad.com/
Connect: LinkedIn | Gumroad Shop | Medium | GitHub
Subscribe: Substack Newsletter | Appreciation Tip: Support