Python Data Foundations Documentation

A plain documentation-style guide for Python, data handling, visualization, and machine learning basics.

Linear Regression

This page covers simple linear regression with height and weight, train/test splitting, prediction, coefficients, and multivariate regression on car data.

What you should be able to do
  • Separate input features X from target values y.
  • Train LinearRegression and predict on test data.
  • Interpret intercept and coefficient values.
  • Normalize features before multivariate modeling when scales differ.
Reusable patterns
  • For one input feature, scikit-learn expects X in 2D shape, so reshape(-1, 1) is required.
  • A regression score near 1 means stronger fit on the tested data.
  • Normalization reduces scale differences between features.

Linear regression

In regression, y is a continuous numeric value.

Listing 1. Import libraries

# import libraries
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

Listing 2. Fetch the dataset

# fetch the dataset
dataset = datasets.fetch_openml('bodyfat', version = 1)

Listing 3. Display dataset keys

# display dataset keys
dataset.keys()
Expected text output or note
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR'])

Listing 4. Display feature names, which are columns in the dataset

# display feature names, which are columns in the dataset
# this example uses Height and Weight
dataset.feature_names
Expected text output or note
['Density',
 'Age',
 'Weight',
 'Height',
 'Neck',
 'Chest',
 'Abdomen',
 'Hip',
 'Thigh',
 'Knee',
 'Ankle',
 'Biceps',
 'Forearm',
 'Wrist']

Listing 5. Define data for linear regression

# define data for linear regression
data = dataset.data

Listing 6. First few rows of data

# first few rows of data
data.head()
Expected text output or note
Density  Age  Weight  Height  Neck  Chest  Abdomen    Hip  Thigh  Knee  \
0   1.0708   23  154.25   67.75  36.2   93.1     85.2   94.5   59.0  37.3   
1   1.0853   22  173.25   72.25  38.5   93.6     83.0   98.7   58.7  37.3   
2   1.0414   22  154.00   66.25  34.0   95.8     87.9   99.2   59.6  38.9   
3   1.0751   26  184.75   72.25  37.4  101.8     86.4  101.2   60.1  37.3   
4   1.0340   24  184.25   71.25  34.4   97.3    100.0  101.9   63.2  42.2   

   Ankle  Biceps  Forearm  Wrist  
0   21.9    32.0     27.4   17.1  
1   23.4    30.5     28.9   18.2  
2   24.0    28.8     25.2   16.6  
3   22.8    32.4     29.4   18.2  
4   24.0    32.2     27.7   17.7

Listing 7. Correlation matrix

# correlation matrix
corr_matrix = data.corr()

Listing 8. Display the correlation matrix with a heatmap

# display the correlation matrix with a heatmap
import seaborn as sns
plt.figure(figsize = (10, 8))
sns.heatmap(corr_matrix, annot = True, cmap = "coolwarm", fmt = ".2f")
plt.title("Correlation matrix")
plt.show()
Expected text output or note
<Figure size 1000x800 with 2 Axes>

[visual output omitted; run the code to display the image or chart]

Listing 9. Clean the data

# clean the data
data = data.dropna(subset = ["Height", "Weight"]) # removes rows that have missing values in the columns we use

Listing 10. Convert height and weight to the metric system

# convert height and weight to the metric system
X = data["Height"] * 2.54 # X, or Height, is the input feature
y = data["Weight"] * 0.46 # y, or Weight, is the target value

Listing 11. Initial visualization before model training

# initial visualization before model training
plt.scatter(X, y)
plt.xlabel("Visina u cm")
plt.ylabel("Tezina u kg")
plt.show()
Expected text output or note
<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Splitting into training and test sets

Listing 12. Convert x to 2d shape

X = X.values.reshape(-1, 1) # convert X to 2D shape

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Listing 13. Visualize the training set

# visualize the training set
print("Zumirano")

plt.scatter(X_train, y_train)

plt.xlabel("Visina u cm")
plt.ylabel("Tezina u kg")

# if outliers exist, the plot may be hard to read, so zooming helps show the main part of the data
plt.xlim(150, 205)
plt.ylim(40, 130)
plt.show()
Expected text output or note
Zumirano

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Listing 14. Create and train the linear regression model

# create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
Expected text output or note
LinearRegression()

Listing 15. Predict on the test set

# predict on the test set
y_pred = model.predict(X_test) # here the model predicts weights for heights from the test set; y_pred contains predicted weights

Model evaluation is based on comparing y_test and y_pred.

Listing 16. Show actual values and the regression line for comparison

# show actual values and the regression line for comparison
plt.scatter(X_test, y_test, label = "Stvarne values")
plt.plot(X_test, y_pred, color = "r", linewidth = 3, label = "Linear regression")
plt.xlabel("Visina")
plt.ylabel("Tezina")
plt.xlim(150, 205)
plt.ylim(40, 130)
plt.legend()
plt.show()
Expected text output or note
<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Model evaluation

Listing 17. Print the regression coefficients

# print the regression coefficients
print(f"Koeficijent nagiba (slope): {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")
Expected text output or note
Koeficijent nagiba (slope): 0.38299146150326047
Intercept: 13.778566972313058

Listing 18. Predict for an unknown value

# predict for an unknown value
model.predict([[196]]) # predict the weight for a person who is 196 cm tall
Expected text output or note
array([88.84489343])

Multivariate linear regression: regression with more than one input feature

Listing 19. Fetch the cars dataset

# fetch the cars dataset
dataset = datasets.fetch_openml("cars")

Listing 20. Code listing 20

dataset.keys()
Expected text output or note
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR'])

Listing 21. Display the data

# display the data
dataset.data
Expected text output or note
mpg  cylinders  displacement  horsepower  weight  acceleration  \
0    18.0          8         307.0       130.0    3504          12.0   
1    15.0          8         350.0       165.0    3693          11.5   
2    18.0          8         318.0       150.0    3436          11.0   
3    16.0          8         304.0       150.0    3433          12.0   
4    17.0          8         302.0       140.0    3449          10.5   
..    ...        ...           ...         ...     ...           ...   
393  27.0          4         140.0        86.0    2790          15.6   
394  44.0          4          97.0        52.0    2130          24.6   
395  32.0          4         135.0        84.0    2295          11.6   
396  28.0          4         120.0        79.0    2625          18.6   
397  31.0          4         119.0        82.0    2720          19.4   

     model.year  
0            70  
1            70  
2            70  
3            70  
4            70  
..          ...  
393          82  
394          82  
395          82  
396          82  
397          82  

[396 rows x 7 columns]

Listing 22. Store the data in a variable

# store the data in a variable
data = dataset.data

Listing 23. Remove incomplete data

# remove incomplete data
data = data.dropna(subset = ["mpg", "horsepower", "weight", "acceleration"]) # these are the variables used in the model

Listing 24. Select features and the target variable

# select features and the target variable
# this does not use the entire dataset
X = data.iloc[50:100, [3, 4, 5]] # selects rows 50 to 99 and columns at positions 3, 4, and 5: horsepower, weight, acceleration
y = data.iloc[50:100, [0]] # selects rows 50 to 99 and column 0: mpg
# to use the full dataset, write:
# X = data[['horsepower', 'weight', 'acceleration']]
# y = data[['mpg']]

Normalize the data because features often have different scales:

Listing 25. Display the first few rows

# display the first few rows
X.head()
Expected text output or note
horsepower  weight  acceleration
51        70.0    2074          19.5
52        76.0    2065          14.5
53        65.0    1773          19.0
54        69.0    1613          18.0
55        60.0    1834          19.0

Listing 26. Normalize x

# normalize X
from sklearn.preprocessing import normalize
X_norm = normalize(X)
X_norm
Expected text output or note
array([[0.03373051, 0.99938679, 0.00939636],
       [0.03677807, 0.99929882, 0.00701687],
       [0.03663431, 0.99927136, 0.01070849],
       [0.04273569, 0.99902421, 0.01114844],
       [0.03269613, 0.99941171, 0.01035377],
       [0.03578073, 0.99930473, 0.01047864],
       [0.04166607, 0.99910846, 0.00679815],
       [0.03760154, 0.99926087, 0.00799033],
       [0.02394924, 0.99965885, 0.01042235],
       [0.03734811, 0.99926955, 0.00809209],
       [0.03860446, 0.99922712, 0.00740667],
       [0.03857663, 0.99925171, 0.00280557],
       [0.03987689, 0.99920086, 0.00273442],
       [0.03625166, 0.99933737, 0.00326265],
       [0.03702938, 0.99930922, 0.00314629],
       [0.04081543, 0.9991618 , 0.00312918],
       [0.04485001, 0.99899092, 0.00237188],
       [0.0344086 , 0.99940336, 0.00299688],
       [0.03588335, 0.9993514 , 0.00302766],
       [0.0429272 , 0.99907421, 0.00282416],
       [0.04159418, 0.99911782, 0.00578888],
       [0.03851181, 0.99925299, 0.00320932],
       [0.03170666, 0.99949138, 0.00341456],
       [0.03258609, 0.99946199, 0.00372412],
       [0.03676667, 0.99931799, 0.00343156],
       [0.03815788, 0.99925951, 0.00494008],
       [0.0302522 , 0.99951662, 0.00716499],
       [0.02919136, 0.99955243, 0.00654289],
       [0.03150453, 0.99946982, 0.00821857],
       [0.03588421, 0.99933365, 0.00667613],
       [0.04017622, 0.99916503, 0.00742387],
       [0.03867749, 0.99923502, 0.00578169],
       [0.03694245, 0.99929339, 0.00692671],
       [0.04186673, 0.99909237, 0.00785001],
       [0.04264389, 0.99908531, 0.00316783],
       [0.04081543, 0.9991618 , 0.00312918],
       [0.03633488, 0.99933436, 0.00325761],
       [0.03387444, 0.99941967, 0.00358525],
       [0.03968256, 0.99920686, 0.00330688],
       [0.03995181, 0.99919891, 0.00232043],
       [0.03358308, 0.99943232, 0.00268665],
       [0.03618973, 0.9993405 , 0.00297764],
       [0.03538004, 0.99936808, 0.00342007],
       [0.04535969, 0.99896802, 0.00232073],
       [0.0453984 , 0.9989665 , 0.00221948],
       [0.04575138, 0.99894872, 0.0028758 ],
       [0.03362357, 0.9994206 , 0.0052837 ],
       [0.03049176, 0.99951995, 0.00548852],
       [0.0339358 , 0.99940927, 0.00542973],
       [0.02911664, 0.99956111, 0.00545937]])

Listing 27. Normalize y

# normalize y
y_norm = y/np.amax(y)
y_norm
Expected text output or note
mpg
51   0.857143
52   0.857143
53   0.885714
54   1.000000
55   0.771429
56   0.742857
57   0.685714
58   0.714286
59   0.657143
60   0.571429
61   0.600000
62   0.371429
63   0.400000
64   0.428571
65   0.400000
66   0.485714
67   0.314286
68   0.371429
69   0.342857
70   0.371429
71   0.542857
72   0.428571
73   0.371429
74   0.371429
75   0.400000
76   0.514286
77   0.628571
78   0.600000
79   0.742857
80   0.628571
81   0.800000
82   0.657143
83   0.800000
84   0.771429
85   0.371429
86   0.400000
87   0.371429
88   0.400000
89   0.428571
90   0.342857
91   0.371429
92   0.371429
93   0.400000
94   0.371429
95   0.342857
96   0.371429
97   0.514286
98   0.457143
99   0.514286
100  0.514286

Listing 28. Split the normalized data

# split the normalized data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_norm, y_norm, test_size = 0.2, random_state = 42)

Listing 29. Train the multivariate model

# train the multivariate model
car_regression_model = LinearRegression().fit(X_train, y_train)

Listing 30. Score of the car model

# score of the car model
car_regression_model.score(X_test, y_test)
Expected text output or note
0.8707399834767127

Listing 31. Predict on the test set

# predict on the test set
y_pred = car_regression_model.predict(X_test)

Listing 32. Prediction for a new car

# prediction for a new car
car_regression_model.predict(normalize([[100, 3000, 20]]))
Expected text output or note
array([[0.60435537]])

Back to overview