4 Effective Techniques for Feature Engineering
The Ultimate Guide to "How to Do Feature Engineering"
You are an engineer, right?
You solve problems in your domain. That is what engineering is.
Then, what is feature engineering?
Imagine we have made a model with the features in the data, but it does not work well.
Why? - Maybe the data features are not good enough to make a model that is accurate.
This is a feature problem, right?
So, the way to fix this problem by making new and better features for machine learning models from the existing data is feature engineering.
The only goal of feature engineering is to improve the model’s performance.
However, feature engineering can also be challenging, time-consuming, and require domain knowledge.
Note: Do you want to improve your domain knowledge?
Read my eBook: “Domain Knowledge Handbook for Data Aspirants.”
In this article, I will share four interesting techniques and best practices for feature engineering that can help you in your data science projects.
I won’t bore you much with “what and when is it considered feature engineering?” or “why it is important.”
But if you want to know, comment down below, and I will teach you there. So, let’s get started with:
How do you do feature engineering?
See, there is no definitive answer to “how to do feature engineering.”
It all depends on the type, quality, and characteristics of the data, as well as the problem we are trying to solve and the model we are using.
Yet, there are some techniques that I personally use in my feature engineering process. Here are four of them:
1. Indicator Variables
The first type of feature engineering is done by using indicator variables to isolate key information from our data.
Indicator variables are binary features that indicate the presence or absence of a certain condition.
It helps our model focus on what is important by highlighting it beforehand.
Let me explain it to you with some examples of creating indicator variables:
Indicator variables from thresholds: We can create an indicator variable based on a set threshold value of a continuous feature.
For example, if we are predicting real estate prices and we want to capture the effect of large properties, we can create an indicator variable from the “size of the property” feature in the data by setting a condition of “size > 1000 sq ft.”
real_estate_df = pd.DataFrame(real_estate_data) threshold_size = 1000 real_estate_df['large_property'] = (real_estate_df['size'] > threshold_size).astype(int)
Indicator variables from multiple features: We can create an indicator variable based on a combination of two or more features.
For example, if we are predicting customer churn, we can create an indicator variable to indicate “high engagement” by setting a certain threshold for all multiple features like “number of calls,” “number of emails,” and “number of visits” from the data.
customer_churn_df = pd.DataFrame(customer_churn_data) threshold_calls = 10 threshold_emails = 5 threshold_visits = 3 customer_churn_df['high_engagement'] = ((customer_churn_df['number_of_calls'] > threshold_calls) & (customer_churn_df['number_of_emails'] > threshold_emails) & (customer_churn_df['number_of_visits'] > threshold_visits)).astype(int)
Indicator variables from categories: We can create an indicator variable based on a specific category of a categorical feature.
For example, if we are predicting movie ratings and we have a feature for “genre,” we can create an indicator variable for “comedy” to capture the effect of this genre.
movie_ratings_df = pd.DataFrame(movie_ratings_data) selected_genre = 'comedy' movie_ratings_df['is_comedy'] = (movie_ratings_df['genre'] == selected_genre).astype(int)
2. Interaction Features:
The second type of feature engineering is done by creating interaction features that capture the combined effect of two or more features.
It helps our model learn complex non-linear relationships in our data that may not be captured by the individual features.
Now, let me explain it to you with some examples of creating interaction features:
Interaction feature from arithmetic operations: We can create an interaction feature by applying an arithmetic operation (such as addition, subtraction, multiplication, and division) to two or more features.
For example, if we are predicting car prices and there are features like “horsepower” and “torque” in the data, we can create an interaction feature called “power” by multiplying both features.
car_prices_df = pd.DataFrame(car_prices_data) car_prices_df['power'] = car_prices_df['horsepower'] * car_prices_df['torque']
Interaction feature from polynomial terms: We can create an interaction feature “by raising a feature to a power” or “by creating a polynomial term with two or more features.”
For example, if we are predicting student grades and we have a feature for the “hours of study”, we can create an interaction feature for the “study effect” by squaring the feature or by multiplying it with another feature, such as the “difficulty of the course”.
student_grades_df = pd.DataFrame(student_grades_data) student_grades_df['study_effect'] = student_grades_df['hours_of_study'] * student_grades_df['difficulty_of_course'] + (some other features)[5]
Interaction feature from logical operations: We can create an interaction feature by applying a logical operation (such as AND, OR, or XOR) to two or more binary features.
For example, if we are predicting credit risk and we have features for the “income” and the “credit score”, we can create an interaction feature for the “risk level” by using the AND operation on the two features.
credit_risk_df = pd.DataFrame(credit_risk_data) credit_risk_df['risk_level'] = (credit_risk_df['income'] > 50000) & (credit_risk_df['credit_score'] > 650) credit_risk_df['risk_level'] = credit_risk_df['risk_level'].astype(int)
3. Transformation Features
The third type of feature engineering is done by transforming our features to make them more suitable for our model.
It helps our model handle skewed distributions, outliers, nonlinear relationships, and other issues that may affect its performance.
Now, let me explain it to you with some examples of transformation features:
Transformation feature from log: We can transform a feature by taking the log to reduce its skewness and range, especially if the feature has a “long-tailed distribution” or a “multiplicative relationship” with the target variable.
For example, if we are predicting the number of views of a YouTube video and we have a feature for the “number of subscribers”, we can transform it by taking the log to capture the “exponential growth” of the views.
youtube_df = pd.DataFrame(youtube_data) youtube_df['log_views'] = np.log(youtube_df['number_of_subscribers'])
Transformation feature from square root: We can transform a feature by taking the square root to reduce its skewness and range, especially if the feature has a “right-skewed distribution” or a “quadratic relationship” with the target variable.
For example, if we are predicting the speed of a car and we have a feature for the “distance traveled”, we can transform it by taking the square root to capture the “diminishing returns of the speed”.
car_speed_df = pd.DataFrame(car_speed_data) car_speed_df['sqrt_speed'] = np.sqrt(car_speed_df['distance_traveled'])
Transformation feature from binning: We can transform a feature by binning it into discrete intervals or categories, especially if the feature has a lot of noise, outliers, or nonlinearity.
For example, if we are predicting the popularity of a song and we have a feature for the “duration”, we can transform it by binning it into short, medium, or long categories to capture the “optimal length of a song.”
song_popularity_df = pd.DataFrame(song_popularity_data) bins = [0, 180, 240, 9999] # Adjust the bin edges as needed labels = ['short', 'medium', 'long'] song_popularity_df['duration_category'] = pd.cut(song_popularity_df['duration'], bins=bins, labels=labels, right=False)
4. Encoding Features
The fourth and final type of feature engineering is done by encoding our features to make them more compatible with our model.
It helps our model handle categorical features, ordinal features, text features, and other types of features that may not be directly usable by our model.
Lastly, let me explain it to you with some examples of encoding features:
Encoding a feature from one-hot encoding: We can encode a feature by using one-hot encoding to create binary features for each category of a nominal feature, especially if the feature has a low cardinality (i.e., a small number of unique values) and no inherent order.
For example, if we are predicting the type of animal and we have a feature for the “colour”, we can encode it by using one-hot encoding to create binary features for each colour such as “black”, “white”, “brown”, etc.
animal_df = pd.DataFrame(animal_data) one_hot_encoded = pd.get_dummies(animal_df['color'], prefix='color') animal_df = pd.concat([animal_df, one_hot_encoded], axis=1)
Encoding feature from ordinal encoding: We can encode a feature by using ordinal encoding to assign a numerical value to each category of an ordinal feature, especially if the feature has a high cardinality (i.e., a large number of unique values) and an inherent order.
For example, if we are predicting the quality of a product and we have a feature for the “rating”, we can encode it by using ordinal encoding to assign a numerical value to each rating such as “1 for poor,” “2 for fair,” “3 for good,” “4 for very good,” and “5 for excellent.”
product_rating_df = pd.DataFrame(product_rating_data) ordinal_mapping = {'poor': 1, 'fair': 2, 'good': 3, 'very good': 4, 'excellent': 5} product_rating_df['encoded_rating'] = product_rating_df['rating'].map(ordinal_mapping)
Encoding a feature from frequency encoding: We can encode a feature by using frequency encoding to replace each category of a categorical feature with its frequency or proportion in the data, especially if the feature has a high cardinality and no inherent order.
For example, if we are predicting the genre of a book and we have a feature for the “author”, you can encode it by using frequency encoding to replace each author with their “frequency or proportion of books” in the data.
book_genre_df = pd.DataFrame(book_genre_data) frequency_encoding = book_genre_df['author'].value_counts(normalize=True) book_genre_df['encoded_author'] = book_genre_df['author'].map(frequency_encoding)
Learn data science with me:
📝 Read this valuable article for my best pieces of advice
📚 Interview Ready Resources:
Wrapping it up:
Feature engineering is a crucial step in the data science process that can greatly improve your model performance and insights.
In this article, I have shared four techniques and best practices for feature engineering, such as creating indicator variables, interaction features, transformation features, and encoding features.
However, feature engineering is not a one-size-fits-all solution, and you should always experiment with different features and evaluate their impact on your model.
I hope this article has given you some ideas for your own feature engineering projects.