An end-to-end Machine Learning project to predict residential real estate prices using a regression-based pipeline. This project covers the full lifecycle from data cleaning and outlier removal to comparative analysis between Linear and Tree-based models.
Project Overview
The goal of this project is to build a reliable predictive model for house prices using the Washington State dataset. We addressed common real-world data issues such as skewed distributions, outliers, and categorical encoding to find the most accurate estimator.
Tech Stack
- Language: Python 3.x
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn
* Models Tested: Linear (Ridge) Regression, Decision Tree, Random Forest
The Pipeline
1. Data Cleaning & Preprocessing
- Outlier Removal: Applied the IQR (Interquartile Range) Method to filter out extreme price outliers that skew model learning.
- Target Analysis: Identified that house prices were heavily right-skewed.
Log Transformation: Applied
np.log1pto the target variable (price) to correct the skewness and transform it into a normal distribution for better regression performance.Feature Engineering: * Created
is_renovatedas a binary flag.- Dropped redundant columns (
sqft_above,sqft_basement) to prevent multicollinearity.* Removed low-variance or high-cardinality noise (
street,country,date).
- Dropped redundant columns (
2. Feature Engineering
We implemented a ColumnTransformer to automate the preprocessing:
* Numerical Features: Scaled using StandardScaler (Mean=0, Std=1).
* Categorical Features: Transformed city and statezip using OneHotEncoder to handle 100+ unique locations.
3. Model Comparison & Metrics
We split the data into 80% training and 20% testing sets.
| Model | R² Score | Mean Absolute Error (MAE) | Performance | ||
|---|---|---|---|---|---|
| Linear (Ridge) | 0.78 | ~$73,000 | Best Baseline | ||
| Random Forest | 0.72 | ~$78,000 | Good (Low Variance) | ||
| Decision Tree | 0.46 | ~$107,000 | Poor (High Overfit) |
Critical Thinking: Why the results?
Random Forest vs. Decision Tree: A single Decision Tree overfitted the training data by memorizing specific house details. The Random Forest used an Ensemble approach (Bagging), averaging 100 trees to cancel out noise and significantly improve the R² score.
Linear Model Success: Because we had a high number of categorical features (Zip codes), the Linear Ridge model handled the sparse high-dimensional data more efficiently than the tree-based models.
Future Improvements
- Implement XGBoost or LightGBM for boosting performance.
- Hyperparameter tuning using GridSearchCV.
- Feature Importance analysis to identify the top price drivers.