1: Introduction
While exploring regression, we've briefly mentioned overfitting and the problems it can cause. In this mission, we'll explore how to identify overfitting and what you can do to avoid it. To explore overfitting, we'll use a dataset on cars which dataset contains 7 numerical features that could have an effect on a car's fuel efficiency:
cylinders
-- the number of in the engine.displacement
-- the of the engine.horsepower
-- the of the engine.weight
-- the weight of the car.acceleration
-- the acceleration of the car.model year
-- the year that car model was released (e.g.70
corresponds to1970
).origin
-- where the car was manufactured (0
if North America,1
if Europe,2
if Asia).
The mpg
column is our target column and is the one we want to predict using the other features.
The dataset is hosted by the University of California Irvine on . You'll notice that the Data Foldercontains a few different files. We'll be working with , which omits the 8 rows containing missing values for fuel efficiency (mpg
column).
The code below imports Pandas, reads the data into a Dataframe, and cleans up some messy values. Explore the dataset to become more familiar with it.
Instructions
This step is a demo. Play around with code or advance to the next step.
import pandas as pd
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"] cars = pd.read_table("auto-mpg.data", delim_whitespace=True, names=columns) filtered_cars = cars[cars['horsepower'] != '?'] filtered_cars['horsepower'] = filtered_cars['horsepower'].astype('float')
2: Bias And Variance
At the heart of understanding overfitting is understanding bias and variance. Bias and variance make up the 2 observable sources of error in a model that we can indirectly control.
Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.
Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance.
In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.
3: Bias-Variance Tradeoff
We've discussed before how overfitting generally happens when a model performs well on a training set but doesn't generalize well to new data. A key nuance here is that you should think of overfitting as a relative term. Between any 2 models, one will overfit more than the other one.
Understanding the is critical to understanding overfitting. Every process has some amount of inherent noise that's unobservable. Overfit models tend to capture the noise as well as the signal in a dataset.
Scott Fortman Roe's has a wonderful image that describes this tradeoff:
We can approximate the bias of a model by training a few different models from the same class (linear regression in this case) using different features on the same dataset and calculating their error scores. For regression, we can use mean absolute error, mean squared error, or R-squared.
We can calculate the variance of the predicted values for each model we train and we'll observe an increase in variance as we build more complex, multivariate models.
While an extremely simple, univariate linear regression model will underfit, an extremely complicated, multivariate linear regression model will overfit. Depending on the problem you're working on, there's a happy middle ground that will help you construct reliable and useful predictive models.
Let's first create a function that we can use for training the model and computing the bias and variance values and use it to train some simple, univariate models.
Instructions
-
Create a function named
train_and_test
that:- Takes in a list of column names as the sole parameter (
cols
), - Trains a linear regression model using:
- The columns in
cols
as the features, - The
mpg
column as the target variable.
- The columns in
- Uses the trained model to make predictions using the same input it was trained on,
- Computes the variance of the predicted values and the mean squared error between the predicted values and the actual label (
mpg
column). - Returns the mean squared error value followed by the variance (e.g.
return(mse, variance)
).
- Takes in a list of column names as the sole parameter (
-
Use the
train_and_test
function to train a model using only thecylinders
column. Assign the resulting mean squared error value and variance tocyl_mse
andcyl_var
. -
Use the
train_and_test
function to train a model using only theweight
column. Assign the resulting mean squared error value and variance toweight_mse
andweight_var
.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error import numpy as np import matplotlib.pyplot as plt %matplotlib inline def train_and_test(cols): # Split into features & target. features = filtered_cars[cols] target = filtered_cars["mpg"] # Fit model. lr = LinearRegression() lr.fit(features, target) # Make predictions on training set. predictions = lr.predict(features) # Compute MSE and Variance. mse = mean_squared_error(filtered_cars["mpg"], predictions) variance = np.var(predictions) return(mse, variance) cyl_mse, cyl_var = train_and_test(["cylinders"]) weight_mse, weight_var = train_and_test(["weight"])4: Multivariate Models
Now that we have a function for training a regression model and calculating the mean squared error and variance, let's use it to train and understand more complex models.
Instructions
Use the train_and_test
function to train linear regression models using the following columns as the features:
- columns:
cylinders
,displacement
.- MSE:
two_mse
, variance:two_var
.
- MSE:
- columns:
cylinders
,displacement
,horsepower
.- MSE:
three_mse
, variance:three_var
.
- MSE:
- columns:
cylinders
,displacement
,horsepower
,weight
.- MSE:
four_mse
, variance:four_var
.
- MSE:
- columns:
cylinders
,displacement
,horsepower
,weight
,acceleration
.- MSE:
five_mse
, variance:five_var
.
- MSE:
- columns:
cylinders
,displacement
,horsepower
,weight
,acceleration
,model year
- MSE:
six_mse
, variance:six_var
.
- MSE:
- columns:
cylinders
,displacement
,horsepower
,weight
,acceleration
,model year
,origin
- MSE:
seven_mse
, variance:seven_var
.
- MSE:
Use print
statements or the variable inspector below to display each value.
# Our implementation for train_and_test, takes in a list of strings.
def train_and_test(cols): # Split into features & target. features = filtered_cars[cols] target = filtered_cars["mpg"] # Fit model. lr = LinearRegression() lr.fit(features, target) # Make predictions on training set. predictions = lr.predict(features) # Compute MSE and Variance. mse = mean_squared_error(filtered_cars["mpg"], predictions) variance = np.var(predictions) return(mse, variance)one_mse, one_var = train_and_test(["cylinders"])
two_mse, two_var = train_and_test(["cylinders", "displacement"]) three_mse, three_var = train_and_test(["cylinders", "displacement", "horsepower"]) four_mse, four_var = train_and_test(["cylinders", "displacement", "horsepower", "weight"]) five_mse, five_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration"]) six_mse, six_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"]) seven_mse, seven_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"])5: Cross Validation
The multivariate regression models you trained got progressively better at reducing the amount of error.
A good way to detect if your model is overfitting is to compare the in-sample error and the out-of-sample error, or the training error with the test error. So far, we calculated the in sample error by testing the model over the same data it was trained on. To calculate the out-of-sample error, we need to test the data on a test set of data. We unfortunately don't have a separate test dataset and we'll instead use cross validation.
If a model's cross validation error (out-of-sample error) is much higher than the in sample error, then your data science senses should start to tingle. This is the first line of defense against overfitting and is a clear indicator that the trained model doesn't generalize well outside of the training set.
Let's create a new function to handle performing the cross validation and computing the cross validation error.
Instructions
Create a function namedtrain_and_cross_val
that:
- takes in a single parameter (list of column names),
- trains a linear regression model using the features specified in the parameter,
- uses the class to perform 10-fold validation using a random seed of 3 (we use this seed to answer check your code),
- calculates the overall, mean squared error across all folds and the overall, mean variance across all folds.
- returns the overall mean squared error value then the overall variance (e.g.
return(avg_mse, avg_var)
).
Use the train_and_cross_val
function to train linear regression models using the following columns as the features:
- the
cylinders
anddisplacement
columns. Assign the resulting mean squared error value totwo_mse
and the resulting variance value totwo_var
. - the
cylinders
,displacement
, andhorsepower
columns. Assign the resulting mean squared error value tothree_mse
and the resulting variance value tothree_var
. - the
cylinders
,displacement
,horsepower
, andweight
columns. Assign the resulting mean squared error value tofour_mse
and the resulting variance value tofour_var
. - the
cylinders
,displacement
,horsepower
,weight
,acceleration
columns. Assign the resulting mean squared error value tofive_mse
and the resulting variance value tofive_var
. - the
cylinders
,displacement
,horsepower
,weight
,acceleration
, andmodel year
columns. Assign the resulting mean squared error value tosix_mse
and the resulting variance value tosix_var
. - the
cylinders
,displacement
,horsepower
,weight
,acceleration
,model year
, andorigin
columns. Assign the resulting mean squared error value toseven_mse
and the resulting variance value toseven_var
.
Use the variable display to inspect each value.
from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error import numpy as np def train_and_cross_val(cols): features = filtered_cars[cols] target = filtered_cars["mpg"] variance_values = [] mse_values = [] # KFold instance. kf = KFold(n=len(filtered_cars), n_folds=10, shuffle=True, random_state=3) # Iterate through over each fold. for train_index, test_index in kf: # Training and test sets. X_train, X_test = features.iloc[train_index], features.iloc[test_index] y_train, y_test = target.iloc[train_index], target.iloc[test_index] # Fit the model and make predictions. lr = LinearRegression() lr.fit(X_train, y_train) predictions = lr.predict(X_test) # Calculate mse and variance values for this fold. mse = mean_squared_error(y_test, predictions) var = np.var(predictions)# Append to arrays to do calculate overall average mse and variance values.
variance_values.append(var) mse_values.append(mse) # Compute average mse and variance values. avg_mse = np.mean(mse_values) avg_var = np.mean(variance_values) return(avg_mse, avg_var) two_mse, two_var = train_and_cross_val(["cylinders", "displacement"]) three_mse, three_var = train_and_cross_val(["cylinders", "displacement", "horsepower"]) four_mse, four_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight"]) five_mse, five_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration"]) six_mse, six_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"]) seven_mse, seven_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"])6: Plotting Cross-Validation Error Vs. Cross-Validation Variance
During cross validation, the more features we added to the model, the lower the mean squared error got. This is a good sign and indicates that the model generalizes well to new data it wasn't trained on. As the mean squared error value went up, however, so did the variance of the predictions. This is to be expected, since the models with lower squared error values had higher model complexity, which tends to be more sensitive to small variations in input values (or high variance).
For each model, let's plot the error and variance to get a better idea of the tradeoff as the number of features increased.
Instructions
-
On the same Axes instance:
- Generate a scatter plot with the model's number of features on the x-axis and the model's overall, cross-validation mean squared error on the y-axis. Use
red
for the scatter dot color. - Generate a scatter plot with the model's number of features on the x-axis and the model's overall, cross-validation variance on the y-axis. Use
blue
for the scatter dot color.
- Generate a scatter plot with the model's number of features on the x-axis and the model's overall, cross-validation mean squared error on the y-axis. Use
-
Use
plt.show()
to display the scatter plot.
# We've hidden the `train_and_cross_val` function to save space but you can still call the function!
import matplotlib.pyplot as plt %matplotlib inline two_mse, two_var = train_and_cross_val(["cylinders", "displacement"]) three_mse, three_var = train_and_cross_val(["cylinders", "displacement", "horsepower"]) four_mse, four_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight"]) five_mse, five_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration"]) six_mse, six_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"]) seven_mse, seven_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"]) plt.scatter([2,3,4,5,6,7],[two_mse,three_mse,four_mse,five_mse,six_mse,seven_mse],c="red") plt.scatter([2,3,4,5,6,7],[two_var, three_var, four_var, five_var, six_var, seven_var],c="blue") plt.show()7: Conclusion
While the higher order multivariate models overfit in relation to the lower order multivariate models, the in-sample error and out-of-sample didn't deviate by much. The best model was around 50% more accurate than the simplest model. On the other hand, the overall variance increased around 25% as we increased the model complexity. This is a really good starting point, but your work is not done! The increased variance with the increased model complexity means that your model will have more unpredictable performance on truly new, unseen data.
If you were working on this problem on a data science team, you'd need to confirm the predictive accuracy of the model using completely new, unobserved data (e.g. maybe from cars from later years). Since often you can't wait until a model is deployed in the wild to know how well it works, the exploration we did in this mission helps you approximate a model's real world performance.
8: Next Steps
In this mission, we explored overfitting at a deeper level and introduced related terminology that you'll see in other literature as well. So far, we've mostly dealt with supvervised machine learning models to solve regression and classification problems. In the next mission, we'll explore an unsupervised machine learning technique called k-means clustering.