Linear Regression
This page covers simple linear regression with height and weight, train/test splitting, prediction, coefficients, and multivariate regression on car data.
What you should be able to do
- Separate input features X from target values y.
- Train LinearRegression and predict on test data.
- Interpret intercept and coefficient values.
- Normalize features before multivariate modeling when scales differ.
Reusable patterns
- For one input feature, scikit-learn expects X in 2D shape, so reshape(-1, 1) is required.
- A regression score near 1 means stronger fit on the tested data.
- Normalization reduces scale differences between features.
Linear regression
- X = input data, features, or independent variables.
- y = target value, label, or dependent variable.
In regression, y is a continuous numeric value.
Listing 1. Import libraries
# import libraries
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegressionListing 2. Fetch the dataset
# fetch the dataset
dataset = datasets.fetch_openml('bodyfat', version = 1)Listing 3. Display dataset keys
# display dataset keys
dataset.keys()Expected text output or note
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR'])Listing 4. Display feature names, which are columns in the dataset
# display feature names, which are columns in the dataset
# this example uses Height and Weight
dataset.feature_namesExpected text output or note
['Density',
'Age',
'Weight',
'Height',
'Neck',
'Chest',
'Abdomen',
'Hip',
'Thigh',
'Knee',
'Ankle',
'Biceps',
'Forearm',
'Wrist']Listing 5. Define data for linear regression
# define data for linear regression
data = dataset.dataListing 6. First few rows of data
# first few rows of data
data.head()Expected text output or note
Density Age Weight Height Neck Chest Abdomen Hip Thigh Knee \
0 1.0708 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3
1 1.0853 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3
2 1.0414 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9
3 1.0751 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3
4 1.0340 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2
Ankle Biceps Forearm Wrist
0 21.9 32.0 27.4 17.1
1 23.4 30.5 28.9 18.2
2 24.0 28.8 25.2 16.6
3 22.8 32.4 29.4 18.2
4 24.0 32.2 27.7 17.7Listing 7. Correlation matrix
# correlation matrix
corr_matrix = data.corr()Listing 8. Display the correlation matrix with a heatmap
# display the correlation matrix with a heatmap
import seaborn as sns
plt.figure(figsize = (10, 8))
sns.heatmap(corr_matrix, annot = True, cmap = "coolwarm", fmt = ".2f")
plt.title("Correlation matrix")
plt.show()Expected text output or note
<Figure size 1000x800 with 2 Axes>
[visual output omitted; run the code to display the image or chart]Listing 9. Clean the data
# clean the data
data = data.dropna(subset = ["Height", "Weight"]) # removes rows that have missing values in the columns we useListing 10. Convert height and weight to the metric system
# convert height and weight to the metric system
X = data["Height"] * 2.54 # X, or Height, is the input feature
y = data["Weight"] * 0.46 # y, or Weight, is the target valueListing 11. Initial visualization before model training
# initial visualization before model training
plt.scatter(X, y)
plt.xlabel("Visina u cm")
plt.ylabel("Tezina u kg")
plt.show()Expected text output or note
<Figure size 640x480 with 1 Axes>
[visual output omitted; run the code to display the image or chart]Splitting into training and test sets
Listing 12. Convert x to 2d shape
X = X.values.reshape(-1, 1) # convert X to 2D shape
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)Listing 13. Visualize the training set
# visualize the training set
print("Zumirano")
plt.scatter(X_train, y_train)
plt.xlabel("Visina u cm")
plt.ylabel("Tezina u kg")
# if outliers exist, the plot may be hard to read, so zooming helps show the main part of the data
plt.xlim(150, 205)
plt.ylim(40, 130)
plt.show()Expected text output or note
Zumirano
<Figure size 640x480 with 1 Axes>
[visual output omitted; run the code to display the image or chart]Listing 14. Create and train the linear regression model
# create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)Expected text output or note
LinearRegression()Listing 15. Predict on the test set
# predict on the test set
y_pred = model.predict(X_test) # here the model predicts weights for heights from the test set; y_pred contains predicted weights- y_test contains the actual weights from the dataset.
- y_pred contains the weights predicted by the model.
Model evaluation is based on comparing y_test and y_pred.
Listing 16. Show actual values and the regression line for comparison
# show actual values and the regression line for comparison
plt.scatter(X_test, y_test, label = "Stvarne values")
plt.plot(X_test, y_pred, color = "r", linewidth = 3, label = "Linear regression")
plt.xlabel("Visina")
plt.ylabel("Tezina")
plt.xlim(150, 205)
plt.ylim(40, 130)
plt.legend()
plt.show()Expected text output or note
<Figure size 640x480 with 1 Axes>
[visual output omitted; run the code to display the image or chart]Model evaluation
Listing 17. Print the regression coefficients
# print the regression coefficients
print(f"Koeficijent nagiba (slope): {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")Expected text output or note
Koeficijent nagiba (slope): 0.38299146150326047
Intercept: 13.778566972313058- For each additional centimeter of height, the model predicts about 0.383 kg more weight.
- The regression equation is approximately: weight = 13.7786 + 0.3830 * height.
Listing 18. Predict for an unknown value
# predict for an unknown value
model.predict([[196]]) # predict the weight for a person who is 196 cm tallExpected text output or note
array([88.84489343])Multivariate linear regression: regression with more than one input feature
Listing 19. Fetch the cars dataset
# fetch the cars dataset
dataset = datasets.fetch_openml("cars")Listing 20. Code listing 20
dataset.keys()Expected text output or note
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR'])Listing 21. Display the data
# display the data
dataset.dataExpected text output or note
mpg cylinders displacement horsepower weight acceleration \
0 18.0 8 307.0 130.0 3504 12.0
1 15.0 8 350.0 165.0 3693 11.5
2 18.0 8 318.0 150.0 3436 11.0
3 16.0 8 304.0 150.0 3433 12.0
4 17.0 8 302.0 140.0 3449 10.5
.. ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6
394 44.0 4 97.0 52.0 2130 24.6
395 32.0 4 135.0 84.0 2295 11.6
396 28.0 4 120.0 79.0 2625 18.6
397 31.0 4 119.0 82.0 2720 19.4
model.year
0 70
1 70
2 70
3 70
4 70
.. ...
393 82
394 82
395 82
396 82
397 82
[396 rows x 7 columns]Listing 22. Store the data in a variable
# store the data in a variable
data = dataset.dataListing 23. Remove incomplete data
# remove incomplete data
data = data.dropna(subset = ["mpg", "horsepower", "weight", "acceleration"]) # these are the variables used in the modelListing 24. Select features and the target variable
# select features and the target variable
# this does not use the entire dataset
X = data.iloc[50:100, [3, 4, 5]] # selects rows 50 to 99 and columns at positions 3, 4, and 5: horsepower, weight, acceleration
y = data.iloc[50:100, [0]] # selects rows 50 to 99 and column 0: mpg
# to use the full dataset, write:
# X = data[['horsepower', 'weight', 'acceleration']]
# y = data[['mpg']]Normalize the data because features often have different scales:
- horsepower can be around 50 to 200.
- weight can be several thousand.
- acceleration can be around 10 to 25.
Listing 25. Display the first few rows
# display the first few rows
X.head()Expected text output or note
horsepower weight acceleration
51 70.0 2074 19.5
52 76.0 2065 14.5
53 65.0 1773 19.0
54 69.0 1613 18.0
55 60.0 1834 19.0Listing 26. Normalize x
# normalize X
from sklearn.preprocessing import normalize
X_norm = normalize(X)
X_normExpected text output or note
array([[0.03373051, 0.99938679, 0.00939636],
[0.03677807, 0.99929882, 0.00701687],
[0.03663431, 0.99927136, 0.01070849],
[0.04273569, 0.99902421, 0.01114844],
[0.03269613, 0.99941171, 0.01035377],
[0.03578073, 0.99930473, 0.01047864],
[0.04166607, 0.99910846, 0.00679815],
[0.03760154, 0.99926087, 0.00799033],
[0.02394924, 0.99965885, 0.01042235],
[0.03734811, 0.99926955, 0.00809209],
[0.03860446, 0.99922712, 0.00740667],
[0.03857663, 0.99925171, 0.00280557],
[0.03987689, 0.99920086, 0.00273442],
[0.03625166, 0.99933737, 0.00326265],
[0.03702938, 0.99930922, 0.00314629],
[0.04081543, 0.9991618 , 0.00312918],
[0.04485001, 0.99899092, 0.00237188],
[0.0344086 , 0.99940336, 0.00299688],
[0.03588335, 0.9993514 , 0.00302766],
[0.0429272 , 0.99907421, 0.00282416],
[0.04159418, 0.99911782, 0.00578888],
[0.03851181, 0.99925299, 0.00320932],
[0.03170666, 0.99949138, 0.00341456],
[0.03258609, 0.99946199, 0.00372412],
[0.03676667, 0.99931799, 0.00343156],
[0.03815788, 0.99925951, 0.00494008],
[0.0302522 , 0.99951662, 0.00716499],
[0.02919136, 0.99955243, 0.00654289],
[0.03150453, 0.99946982, 0.00821857],
[0.03588421, 0.99933365, 0.00667613],
[0.04017622, 0.99916503, 0.00742387],
[0.03867749, 0.99923502, 0.00578169],
[0.03694245, 0.99929339, 0.00692671],
[0.04186673, 0.99909237, 0.00785001],
[0.04264389, 0.99908531, 0.00316783],
[0.04081543, 0.9991618 , 0.00312918],
[0.03633488, 0.99933436, 0.00325761],
[0.03387444, 0.99941967, 0.00358525],
[0.03968256, 0.99920686, 0.00330688],
[0.03995181, 0.99919891, 0.00232043],
[0.03358308, 0.99943232, 0.00268665],
[0.03618973, 0.9993405 , 0.00297764],
[0.03538004, 0.99936808, 0.00342007],
[0.04535969, 0.99896802, 0.00232073],
[0.0453984 , 0.9989665 , 0.00221948],
[0.04575138, 0.99894872, 0.0028758 ],
[0.03362357, 0.9994206 , 0.0052837 ],
[0.03049176, 0.99951995, 0.00548852],
[0.0339358 , 0.99940927, 0.00542973],
[0.02911664, 0.99956111, 0.00545937]])Listing 27. Normalize y
# normalize y
y_norm = y/np.amax(y)
y_normExpected text output or note
mpg
51 0.857143
52 0.857143
53 0.885714
54 1.000000
55 0.771429
56 0.742857
57 0.685714
58 0.714286
59 0.657143
60 0.571429
61 0.600000
62 0.371429
63 0.400000
64 0.428571
65 0.400000
66 0.485714
67 0.314286
68 0.371429
69 0.342857
70 0.371429
71 0.542857
72 0.428571
73 0.371429
74 0.371429
75 0.400000
76 0.514286
77 0.628571
78 0.600000
79 0.742857
80 0.628571
81 0.800000
82 0.657143
83 0.800000
84 0.771429
85 0.371429
86 0.400000
87 0.371429
88 0.400000
89 0.428571
90 0.342857
91 0.371429
92 0.371429
93 0.400000
94 0.371429
95 0.342857
96 0.371429
97 0.514286
98 0.457143
99 0.514286
100 0.514286Listing 28. Split the normalized data
# split the normalized data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_norm, y_norm, test_size = 0.2, random_state = 42)Listing 29. Train the multivariate model
# train the multivariate model
car_regression_model = LinearRegression().fit(X_train, y_train)Listing 30. Score of the car model
# score of the car model
car_regression_model.score(X_test, y_test)Expected text output or note
0.8707399834767127Listing 31. Predict on the test set
# predict on the test set
y_pred = car_regression_model.predict(X_test)Listing 32. Prediction for a new car
# prediction for a new car
car_regression_model.predict(normalize([[100, 3000, 20]]))Expected text output or note
array([[0.60435537]])