Skip to content

knight99rus/ML1_Introduction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML1: Introduction to Machine Learning — House Pricing

Data Science Bootcamp is completed. Let me remind you my path:

  • Data Science Bootcamp (completed)
  • Core track (12 projects on key DS and ML topics)
  • Practice (8 projects)

This project marks my official entry into the field of Machine Learning. While previous modules focused on general-purpose tools and data manipulation, this is where the real journey begins: transitioning from observing static charts to building algorithms capable of predicting future outcomes.

Topics

  • Foundational ML: Understanding the conceptual shift from Rule-Based to Learning-Based systems.
  • Supervised Learning: Practical implementation of Regression and Classification tasks.
  • Exploratory Data Analysis (EDA): Identifying patterns, correlations, and anomalies in real-world data.
  • Model Benchmarking: Establishing Baselines using naive models to measure true algorithmic value.
  • Performance Metrics: Using MAE and RMSE to quantify and interpret prediction accuracy.

Roadmap

1. Foundations & Hypotheses

The work began with a theoretical deep dive. I explored the fundamental classification of tasks, determining why predicting rental prices is a Regression problem and defining the boundaries between multiclass and multilabel classification.

2. Statistical Data Auditing

One does not simply "feed" raw data into a model. I conducted a rigorous audit of the Kaggle RentHop dataset:

  • Target Analysis: Visualized price distributions and identified extreme outliers that could compromise model integrity.
  • Data Cleaning: Applied statistical filtering (1st–99th percentile) to strip away noise and focus on the representative data range.
  • Correlation Study: Utilized Heatmaps and Scatterplots to quantify how features like bedrooms or bathrooms actually impact the final cost.

3. Feature Generation

I experimented with data complexity by creating Polynomial Features up to the 10th degree. This stage was essential for understanding the trade-off between model flexibility and computational cost, observing how feature transformations can capture non-linearities.

4. The Model Showdown

To establish a performance baseline, I trained and compared three distinct approaches:

  • Linear Regression: The primary baseline for predicting continuous values.
  • Decision Tree Regressor: An introduction to non-linear, tree-based data splitting.
  • Naive Models (Mean/Median): A critical "sanity check" to ensure that the developed models provide genuine predictive power over simple averages.

Results

The project concluded with a comprehensive evaluation using MAE and RMSE. By comparing error rates across different architectures, I identified the most effective model for this dataset. This module taught me that Machine Learning is not just about writing code; it’s about asking the right questions and methodically searching for answers within the data.

How to Run the Project

  1. Clone the repository:

    git clone https://github.com/knight99rus/ML1_Introduction.git
    cd ML1_Introduction
  2. Create and activate a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # For Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install jupyter pandas numpy scikit-learn matplotlib seaborn scipy statsmodels lightgbm
  4. Download data:

  5. Launch Jupyter Notebook:

    jupyter notebook

    Open and execute the cells in the project01.ipynb file.

About

Introduction to machine learning with a focus on primary data analysis and it's practical application

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors