Linear Regression

This page covers simple linear regression with height and weight, train/test splitting, prediction, coefficients, and multivariate regression on car data.

What you should be able to do

Separate input features X from target values y.
Train LinearRegression and predict on test data.
Interpret intercept and coefficient values.
Normalize features before multivariate modeling when scales differ.

Reusable patterns

For one input feature, scikit-learn expects X in 2D shape, so reshape(-1, 1) is required.
A regression score near 1 means stronger fit on the tested data.
Normalization reduces scale differences between features.

Linear regression

X = input data, features, or independent variables.
y = target value, label, or dependent variable.

In regression, y is a continuous numeric value.

Listing 1. Import libraries

# import libraries
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

Listing 2. Fetch the dataset

# fetch the dataset
dataset = datasets.fetch_openml('bodyfat', version = 1)

Listing 3. Display dataset keys

# display dataset keys
dataset.keys()

Expected text output or note

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR'])

Listing 4. Display feature names, which are columns in the dataset

# display feature names, which are columns in the dataset
# this example uses Height and Weight
dataset.feature_names

Expected text output or note

['Density',
 'Age',
 'Weight',
 'Height',
 'Neck',
 'Chest',
 'Abdomen',
 'Hip',
 'Thigh',
 'Knee',
 'Ankle',
 'Biceps',
 'Forearm',
 'Wrist']

Listing 5. Define data for linear regression

# define data for linear regression
data = dataset.data

Listing 6. First few rows of data

# first few rows of data
data.head()

Expected text output or note

Density  Age  Weight  Height  Neck  Chest  Abdomen    Hip  Thigh  Knee  \
0   1.0708   23  154.25   67.75  36.2   93.1     85.2   94.5   59.0  37.3   
1   1.0853   22  173.25   72.25  38.5   93.6     83.0   98.7   58.7  37.3   
2   1.0414   22  154.00   66.25  34.0   95.8     87.9   99.2   59.6  38.9   
3   1.0751   26  184.75   72.25  37.4  101.8     86.4  101.2   60.1  37.3   
4   1.0340   24  184.25   71.25  34.4   97.3    100.0  101.9   63.2  42.2   

   Ankle  Biceps  Forearm  Wrist  
0   21.9    32.0     27.4   17.1  
1   23.4    30.5     28.9   18.2  
2   24.0    28.8     25.2   16.6  
3   22.8    32.4     29.4   18.2  
4   24.0    32.2     27.7   17.7

Listing 7. Correlation matrix

# correlation matrix
corr_matrix = data.corr()

Listing 8. Display the correlation matrix with a heatmap

# display the correlation matrix with a heatmap
import seaborn as sns
plt.figure(figsize = (10, 8))
sns.heatmap(corr_matrix, annot = True, cmap = "coolwarm", fmt = ".2f")
plt.title("Correlation matrix")
plt.show()

Expected text output or note

<Figure size 1000x800 with 2 Axes>

[visual output omitted; run the code to display the image or chart]

Listing 9. Clean the data

# clean the data
data = data.dropna(subset = ["Height", "Weight"]) # removes rows that have missing values in the columns we use

Listing 10. Convert height and weight to the metric system

# convert height and weight to the metric system
X = data["Height"] * 2.54 # X, or Height, is the input feature
y = data["Weight"] * 0.46 # y, or Weight, is the target value

Listing 11. Initial visualization before model training

# initial visualization before model training
plt.scatter(X, y)
plt.xlabel("Height in cm")
plt.ylabel("Weight in kg")
plt.show()

Expected text output or note

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Splitting into training and test sets

Listing 12. Convert x to 2d shape

X = X.values.reshape(-1, 1) # convert X to 2D shape

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Listing 13. Visualize the training set

# visualize the training set
print("Zoomed view")

plt.scatter(X_train, y_train)

plt.xlabel("Height in cm")
plt.ylabel("Weight in kg")

# if outliers exist, the plot may be hard to read, so zooming helps show the main part of the data
plt.xlim(150, 205)
plt.ylim(40, 130)
plt.show()

Expected text output or note

Zoomed view

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Listing 14. Create and train the linear regression model

# create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

Expected text output or note

LinearRegression()

Listing 15. Predict on the test set

# predict on the test set
y_pred = model.predict(X_test) # here the model predicts weights for heights from the test set; y_pred contains predicted weights

y_test contains the actual weights from the dataset.
y_pred contains the weights predicted by the model.

Model evaluation is based on comparing y_test and y_pred.

Listing 16. Show actual values and the regression line for comparison

# show actual values and the regression line for comparison
plt.scatter(X_test, y_test, label = "Actual values")
plt.plot(X_test, y_pred, color = "r", linewidth = 3, label = "Linear regression")
plt.xlabel("Height")
plt.ylabel("Weight")
plt.xlim(150, 205)
plt.ylim(40, 130)
plt.legend()
plt.show()

Expected text output or note

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Model evaluation

Listing 17. Print the regression coefficients

# print the regression coefficients
print(f"Slope coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")

Expected text output or note

Slope coefficient: 0.38299146150326047
Intercept: 13.778566972313058

For each additional centimeter of height, the model predicts about 0.383 kg more weight.
The regression equation is approximately: weight = 13.7786 + 0.3830 * height.

Listing 18. Predict for an unknown value

# predict for an unknown value
model.predict([[196]]) # predict the weight for a person who is 196 cm tall

Expected text output or note

array([88.84489343])

Multivariate linear regression: regression with more than one input feature

Listing 19. Fetch the cars dataset

# fetch the cars dataset
dataset = datasets.fetch_openml("cars")

Listing 20. Inspect the cars dataset fields

dataset.keys()

Expected text output or note

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR'])

Listing 21. Display the data

# display the data
dataset.data

Expected text output or note

mpg  cylinders  displacement  horsepower  weight  acceleration  \
0    18.0          8         307.0       130.0    3504          12.0   
1    15.0          8         350.0       165.0    3693          11.5   
2    18.0          8         318.0       150.0    3436          11.0   
3    16.0          8         304.0       150.0    3433          12.0   
4    17.0          8         302.0       140.0    3449          10.5   
..    ...        ...           ...         ...     ...           ...   
393  27.0          4         140.0        86.0    2790          15.6   
394  44.0          4          97.0        52.0    2130          24.6   
395  32.0          4         135.0        84.0    2295          11.6   
396  28.0          4         120.0        79.0    2625          18.6   
397  31.0          4         119.0        82.0    2720          19.4   

     model.year  
0            70  
1            70  
2            70  
3            70  
4            70  
..          ...  
393          82  
394          82  
395          82  
396          82  
397          82  

[396 rows x 7 columns]

Listing 22. Store the data in a variable

# store the data in a variable
data = dataset.data

Listing 23. Remove incomplete data

# remove incomplete data
data = data.dropna(subset = ["mpg", "horsepower", "weight", "acceleration"]) # these are the variables used in the model

Listing 24. Select features and the target variable

# select features and the target variable
# this does not use the entire dataset
X = data.iloc[50:100, [3, 4, 5]] # selects rows 50 to 99 and columns at positions 3, 4, and 5: horsepower, weight, acceleration
y = data.iloc[50:100, [0]] # selects rows 50 to 99 and column 0: mpg
# to use the full dataset, write:
# X = data[['horsepower', 'weight', 'acceleration']]
# y = data[['mpg']]

Normalize the data because features often have different scales:

horsepower can be around 50 to 200.
weight can be several thousand.
acceleration can be around 10 to 25.

Listing 25. Display the first few rows

# display the first few rows
X.head()

Expected text output or note

horsepower  weight  acceleration
51        70.0    2074          19.5
52        76.0    2065          14.5
53        65.0    1773          19.0
54        69.0    1613          18.0
55        60.0    1834          19.0

Listing 26. Normalize x

# normalize X
from sklearn.preprocessing import normalize
X_norm = normalize(X)
X_norm

Expected text output or note

array([[0.03373051, 0.99938679, 0.00939636],
       [0.03677807, 0.99929882, 0.00701687],
       [0.03663431, 0.99927136, 0.01070849],
       [0.04273569, 0.99902421, 0.01114844],
       [0.03269613, 0.99941171, 0.01035377],
       [0.03578073, 0.99930473, 0.01047864],
       [0.04166607, 0.99910846, 0.00679815],
       [0.03760154, 0.99926087, 0.00799033],
       [0.02394924, 0.99965885, 0.01042235],
       [0.03734811, 0.99926955, 0.00809209],
       [0.03860446, 0.99922712, 0.00740667],
       [0.03857663, 0.99925171, 0.00280557],
       [0.03987689, 0.99920086, 0.00273442],
       [0.03625166, 0.99933737, 0.00326265],
       [0.03702938, 0.99930922, 0.00314629],
       [0.04081543, 0.9991618 , 0.00312918],
       [0.04485001, 0.99899092, 0.00237188],
       [0.0344086 , 0.99940336, 0.00299688],
       [0.03588335, 0.9993514 , 0.00302766],
       [0.0429272 , 0.99907421, 0.00282416],
       [0.04159418, 0.99911782, 0.00578888],
       [0.03851181, 0.99925299, 0.00320932],
       [0.03170666, 0.99949138, 0.00341456],
       [0.03258609, 0.99946199, 0.00372412],
       [0.03676667, 0.99931799, 0.00343156],
       [0.03815788, 0.99925951, 0.00494008],
       [0.0302522 , 0.99951662, 0.00716499],
       [0.02919136, 0.99955243, 0.00654289],
       [0.03150453, 0.99946982, 0.00821857],
       [0.03588421, 0.99933365, 0.00667613],
       [0.04017622, 0.99916503, 0.00742387],
       [0.03867749, 0.99923502, 0.00578169],
       [0.03694245, 0.99929339, 0.00692671],
       [0.04186673, 0.99909237, 0.00785001],
       [0.04264389, 0.99908531, 0.00316783],
       [0.04081543, 0.9991618 , 0.00312918],
       [0.03633488, 0.99933436, 0.00325761],
       [0.03387444, 0.99941967, 0.00358525],
       [0.03968256, 0.99920686, 0.00330688],
       [0.03995181, 0.99919891, 0.00232043],
       [0.03358308, 0.99943232, 0.00268665],
       [0.03618973, 0.9993405 , 0.00297764],
       [0.03538004, 0.99936808, 0.00342007],
       [0.04535969, 0.99896802, 0.00232073],
       [0.0453984 , 0.9989665 , 0.00221948],
       [0.04575138, 0.99894872, 0.0028758 ],
       [0.03362357, 0.9994206 , 0.0052837 ],
       [0.03049176, 0.99951995, 0.00548852],
       [0.0339358 , 0.99940927, 0.00542973],
       [0.02911664, 0.99956111, 0.00545937]])

Listing 27. Normalize y

# normalize y
y_norm = y/np.amax(y)
y_norm

Expected text output or note

mpg
51   0.857143
52   0.857143
53   0.885714
54   1.000000
55   0.771429
56   0.742857
57   0.685714
58   0.714286
59   0.657143
60   0.571429
61   0.600000
62   0.371429
63   0.400000
64   0.428571
65   0.400000
66   0.485714
67   0.314286
68   0.371429
69   0.342857
70   0.371429
71   0.542857
72   0.428571
73   0.371429
74   0.371429
75   0.400000
76   0.514286
77   0.628571
78   0.600000
79   0.742857
80   0.628571
81   0.800000
82   0.657143
83   0.800000
84   0.771429
85   0.371429
86   0.400000
87   0.371429
88   0.400000
89   0.428571
90   0.342857
91   0.371429
92   0.371429
93   0.400000
94   0.371429
95   0.342857
96   0.371429
97   0.514286
98   0.457143
99   0.514286
100  0.514286

Listing 28. Split the normalized data

# split the normalized data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_norm, y_norm, test_size = 0.2, random_state = 42)

Listing 29. Train the multivariate model

# train the multivariate model
car_regression_model = LinearRegression().fit(X_train, y_train)

Listing 30. Score of the car model

# score of the car model
car_regression_model.score(X_test, y_test)

Expected text output or note

0.8707399834767127

Listing 31. Predict on the test set

# predict on the test set
y_pred = car_regression_model.predict(X_test)

Listing 32. Prediction for a new car

# prediction for a new car
car_regression_model.predict(normalize([[100, 3000, 20]]))

Expected text output or note

array([[0.60435537]])

Back to overview

Python Data Foundations Documentation

Linear Regression

Linear regression

Listing 1. Import libraries

Listing 2. Fetch the dataset

Listing 3. Display dataset keys

Listing 4. Display feature names, which are columns in the dataset

Listing 5. Define data for linear regression

Listing 6. First few rows of data

Listing 7. Correlation matrix

Listing 8. Display the correlation matrix with a heatmap

Listing 9. Clean the data

Listing 10. Convert height and weight to the metric system

Listing 11. Initial visualization before model training

Listing 12. Convert x to 2d shape

Listing 13. Visualize the training set

Listing 14. Create and train the linear regression model

Listing 15. Predict on the test set

Listing 16. Show actual values and the regression line for comparison

Listing 17. Print the regression coefficients

Listing 18. Predict for an unknown value

Listing 19. Fetch the cars dataset

Listing 20. Inspect the cars dataset fields

Listing 21. Display the data

Listing 22. Store the data in a variable

Listing 23. Remove incomplete data

Listing 24. Select features and the target variable

Listing 25. Display the first few rows

Listing 26. Normalize x

Listing 27. Normalize y

Listing 28. Split the normalized data

Listing 29. Train the multivariate model

Listing 30. Score of the car model

Listing 31. Predict on the test set

Listing 32. Prediction for a new car