Applying EDA to Feature Selection on a Real-World Insurance Dataset

Today, I applied the complete EDA → Data Cleaning → Data Preprocessing → Feature Engineering → Feature Selection workflow on a real-world insurance dataset.

Project Source Code

This hands-on practice helped me clearly understand how raw, messy data is transformed step by step into a format that can be fed into a machine learning model.

More importantly, it showed me why each step exists — not just how to do it.

What I Worked On

I took an insurance people dataset and applied the full data preparation pipeline:

Exploratory Data Analysis (EDA)
Data Cleaning
Data Preprocessing
Feature Engineering
Feature Selection

The goal was not to build a model quickly, but to understand how real-world data behaves and how decisions are made before modeling.

Key Learnings from Exploratory Data Analysis (EDA)

EDA helped me understand: - What each feature represents - The distribution of numerical variables - The behavior of categorical features - Missing values and their impact - Relationships between features - How the target variable behaves

EDA made one thing very clear:

You cannot clean, preprocess, or engineer features properly if you don’t understand the data first.

Data Cleaning: Fixing What’s Wrong

Based on insights from EDA, I performed data cleaning steps such as: - Handling missing values using appropriate strategies - Removing duplicate records - Fixing incorrect data types - Handling inconsistent categorical values - Detecting and handling outliers - Identifying logical and domain-related errors

This phase reinforced an important idea:

EDA tells you what’s wrong. Data cleaning fixes it.

Data Preprocessing: Making Data Model-Ready

Once the data was clean, I focused on preprocessing to make it usable for machine learning models.

This included: - Encoding categorical variables - Feature transformation where required - Feature scaling

I learned that preprocessing is not about fixing mistakes, but about transforming valid data into a usable format.

Feature Engineering & Feature Selection

I also applied basic feature engineering and feature selection concepts: - Creating meaningful features from existing ones - Removing redundant or low-importance features - Using feature selection techniques to reduce noise

This step highlighted how much model performance depends on data representation, not just algorithms.

The Biggest Realization

Working on a real-world dataset gave me a clear picture of how messy and imperfect real data is.

I realized that: - Real-world datasets are rarely clean - Decisions during EDA affect everything that comes after - Data preparation takes more time than modeling - Understanding the data deeply is more valuable than rushing to build models

I also encountered some confusions in a few concepts, which is expected at this stage. Instead of ignoring them, I plan to dig deeper and clear those gaps through further practice and study.

Conclusion

This exercise helped me connect theory with practice and understand the complete data preparation pipeline in a realistic way.

It reinforced the idea that: - Good machine learning starts with good data understanding - Strong fundamentals matter more than speed - Confusion is part of the learning process — clarity comes with practice

What’s Next?

I’ll continue practicing on more real-world datasets and gradually move toward modeling once my data preparation foundation is strong.

For now, the focus remains on learning correctly, not quickly.

Learn More On EDA

Quote