Data Science Bootcamp is completed. Let me remind you my path:
- Data Science Bootcamp (completed)
- Core track (12 projects on key DS and ML topics)
- Practice (8 projects)
This project marks my official entry into the field of Machine Learning. While previous modules focused on general-purpose tools and data manipulation, this is where the real journey begins: transitioning from observing static charts to building algorithms capable of predicting future outcomes.
- Foundational ML: Understanding the conceptual shift from Rule-Based to Learning-Based systems.
- Supervised Learning: Practical implementation of Regression and Classification tasks.
- Exploratory Data Analysis (EDA): Identifying patterns, correlations, and anomalies in real-world data.
- Model Benchmarking: Establishing Baselines using naive models to measure true algorithmic value.
- Performance Metrics: Using MAE and RMSE to quantify and interpret prediction accuracy.
The work began with a theoretical deep dive. I explored the fundamental classification of tasks, determining why predicting rental prices is a Regression problem and defining the boundaries between multiclass and multilabel classification.
One does not simply "feed" raw data into a model. I conducted a rigorous audit of the Kaggle RentHop dataset:
- Target Analysis: Visualized price distributions and identified extreme outliers that could compromise model integrity.
- Data Cleaning: Applied statistical filtering (1st–99th percentile) to strip away noise and focus on the representative data range.
- Correlation Study: Utilized Heatmaps and Scatterplots to quantify how features like bedrooms or bathrooms actually impact the final cost.
I experimented with data complexity by creating Polynomial Features up to the 10th degree. This stage was essential for understanding the trade-off between model flexibility and computational cost, observing how feature transformations can capture non-linearities.
To establish a performance baseline, I trained and compared three distinct approaches:
- Linear Regression: The primary baseline for predicting continuous values.
- Decision Tree Regressor: An introduction to non-linear, tree-based data splitting.
- Naive Models (Mean/Median): A critical "sanity check" to ensure that the developed models provide genuine predictive power over simple averages.
The project concluded with a comprehensive evaluation using MAE and RMSE. By comparing error rates across different architectures, I identified the most effective model for this dataset. This module taught me that Machine Learning is not just about writing code; it’s about asking the right questions and methodically searching for answers within the data.
-
Clone the repository:
git clone https://github.com/knight99rus/ML1_Introduction.git cd ML1_Introduction -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # For Windows: venv\Scripts\activate
-
Install dependencies:
pip install jupyter pandas numpy scikit-learn matplotlib seaborn scipy statsmodels lightgbm
-
Download data:
- Read the task on the Kaggle competition page.
- Download
test.jsonfile.
-
Launch Jupyter Notebook:
jupyter notebook
Open and execute the cells in the
project01.ipynbfile.