recipes_analysis
Introduction
This project examines a recipe dataset containing nutritional and rating information. The central question is: What is the relationship between calories and average ratings?
- Number of rows: 83782
- Relevant columns:
nutrition
: List of all recipes nutritional info (including calories)minutes
: Time required to prepare the recipe.average_rating
: The average user rating for the recipe.
- All columns:
- ‘name’, ‘id’, ‘minutes’, ‘contributor_id’, ‘submitted’, ‘tags’, ‘nutrition’, ‘n_steps’, ‘steps’, ‘description’, ‘ingredients’, ‘n_ingredients’, ‘average_rating’
Data Cleaning
- Extracted calorie information from the nutrition column.
- Dropped rows where the calories, minutes were missing.
- Filtered calorie values to those <= 5000.
- Drop rows where average rating was missing.
Here’s a snapshot of the cleaned DataFrame:
name | id | minutes | contributor_id | submitted | … | |
---|---|---|---|---|---|---|
brownies in the world best ever | 333281 | 40 | 985201 | 2008-10-27 | … | |
in canada chocolate chip cookies | 453467 | 45 | 1848091 | 2011-04-11 | … | |
412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | … |
The relevant cleaned data:
calories | minutes | average_rating |
---|---|---|
138.4 | 40 | 4 |
595.1 | 45 | 5 |
194.8 | 40 | 5 |
878.3 | 120 | 5 |
267 | 90 | 5 |
Pivot Table:
calorie_bin | mean | count |
---|---|---|
(0, 500] | 4.62621 | 61076 |
(500, 1000] | 4.62457 | 15458 |
(1000, 1500] | 4.61285 | 2416 |
(1500, 2000] | 4.61402 | 826 |
(2000, 2500] | 4.64426 | 434 |
Framing a Prediction Problem
- Prediction Problem: Can we predict the average rating of a recipe based on its calorie content and the minutes needed to cook the recipe? Type of Prediction: Regression
Baseline Model
- Features:
- Quantitative:
calories
,minutes
- Quantitative:
-
Encoding: StandardScaler for quantitative variables.
- Mean Squared Error: 0.41306454697189354
- R-squared: -3.280752206569204e-05
Final Model
Feature Engineering
- Created two new features to enhance the predictive power of the model:
- Calories per Minute (
calories_per_minute
): Captures the intensity of calorie usage relative to time. - Logarithm of Calories (
log_calories
): Accounts for the non-linear scaling of calories, potentially reducing skewness in the data.
- Calories per Minute (
These features were chosen to better represent the relationships within the dataset, as both calories_per_minute
and log_calories
align with how calorie usage scales with time and energy density.
Model Selection
- Algorithm: Ridge Regression
- Hyperparameters:
alpha
: 0.1 (controls regularization strength)- Polynomial features: Degree = 2 (to capture non-linear relationships)
GridSearchCV was used to tune hyperparameters by performing an exhaustive search over a parameter grid.
Performance Comparison
- Baseline Model:
- Features Used:
calories
andminutes
- MSE: 0.41306454697189354
- Features Used:
- Final Model:
- Features Used:
calories_per_minute
,log_calories
, and their polynomial expansions (degree=2) - MSE: 0.4051743738813599
- R-squared: 0.00141071065875098
- Features Used:
Improvement
- The final model showed a slight improvement in Mean Squared Error (MSE), reducing from 0.413 to 0.405.
- This improvement highlights the value of engineered features like
calories_per_minute
andlog_calories
, which provided a more nuanced understanding of the relationships in the data. - The polynomial expansions allowed the model to account for non-linear interactions, leading to better predictions.