Python Data Foundations Documentation

A plain documentation-style guide for Python, data handling, visualization, and machine learning basics.

NumPy Arrays, Statistics, Splitting, and Iris Plots

This page covers array indexing, slicing, basic statistics, manual train/test splits, train_test_split, class counts, and simple Iris visualizations.

What you should be able to do
  • Use NumPy slicing with start:stop:step.
  • Split X and y consistently.
  • Use train_test_split and inspect class counts.
  • Create pie charts and line plots.
Reusable patterns
  • Rows represent examples; columns represent features.
  • A split must keep X rows aligned with the correct y labels.
  • np.unique(..., return_counts=True) is useful for class distribution checks.

Working with NumPy arrays and basic statistics

Listing 1. Mean value of the 1st column; ":" means all rows and "0" means the first column

import statistics
print(statistics.mean(data[:, 0])) # mean value of the 1st column; ":" means all rows and "0" means the first column
print(max(data[:, 1])) # largest value in the 2nd column
print(min(data[:, 3])) # smallest value in the 4th column
Expected text output or note
5.843333333333334
4.4
0.1

Listing 2. Data for the first flower; "0" selects the first row and all columns

print(data[0, :]) # data for the first flower; "0" selects the first row and all columns
print(flower_species[0]) # species of the first flower
Expected text output or note
[5.1 3.5 1.4 0.2]
0

Listing 3. All data for the 2nd and 5th flower

print(data[[1, 4], :]) # all data for the 2nd and 5th flower
Expected text output or note
[[4.9 3.  1.4 0.2]
 [5.  3.6 1.4 0.2]]

Listing 4. A[start:stop:step]

# a[start:stop:step]
print(data[0:5, :]) # all data for the first 5 flowers; selects the first 5 rows and all columns
print(data[::2, :]) # selects every second row and all columns
print(data[1::4, :]) # selects every fourth row starting from index 1 and all columns
Expected text output or note
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[[5.1 3.5 1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [5.  3.6 1.4 0.2]
 [4.6 3.4 1.4 0.3]
 [4.4 2.9 1.4 0.2]
 [5.4 3.7 1.5 0.2]
 [4.8 3.  1.4 0.1]
 [5.8 4.  1.2 0.2]
 [5.4 3.9 1.3 0.4]
 [5.7 3.8 1.7 0.3]
 [5.4 3.4 1.7 0.2]
 [4.6 3.6 1.  0.2]
 [4.8 3.4 1.9 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.4 1.4 0.2]
 [4.8 3.1 1.6 0.2]
 [5.2 4.1 1.5 0.1]
 [4.9 3.1 1.5 0.2]
 [5.5 3.5 1.3 0.2]
 [4.4 3.  1.3 0.2]
 [5.  3.5 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.1 3.8 1.9 0.4]
 [5.1 3.8 1.6 0.2]
 [5.3 3.7 1.5 0.2]
 [7.  3.2 4.7 1.4]
 [6.9 3.1 4.9 1.5]
 [6.5 2.8 4.6 1.5]
 [6.3 3.3 4.7 1.6]
 [6.6 2.9 4.6 1.3]
 [5.  2.  3.5 1. ]
 [6.  2.2 4.  1. ]
 [5.6 2.9 3.6 1.3]
 [5.6 3.  4.5 1.5]
 [6.2 2.2 4.5 1.5]
 [5.9 3.2 4.8 1.8]
 [6.3 2.5 4.9 1.5]
 [6.4 2.9 4.3 1.3]
 [6.8 2.8 4.8 1.4]
 [6.  2.9 4.5 1.5]
 [5.5 2.4 3.8 1.1]
 [5.8 2.7 3.9 1.2]
 [5.4 3.  4.5 1.5]
 [6.7 3.1 4.7 1.5]
 [5.6 3.  4.1 1.3]
 [5.5 2.6 4.4 1.2]
 [5.8 2.6 4.  1.2]
 [5.6 2.7 4.2 1.3]
 [5.7 2.9 4.2 1.3]
 [5.1 2.5 3.  1.1]
 [6.3 3.3 6.  2.5]
 [7.1 3.  5.9 2.1]
 [6.5 3.  5.8 2.2]
 [4.9 2.5 4.5 1.7]
 [6.7 2.5 5.8 1.8]
 [6.5 3.2 5.1 2. ]
 [6.8 3.  5.5 2.1]
 [5.8 2.8 5.1 2.4]
 [6.5 3.  5.5 1.8]
 [7.7 2.6 6.9 2.3]
 [6.9 3.2 5.7 2.3]
 [7.7 2.8 6.7 2. ]
 [6.7 3.3 5.7 2.1]
 [6.2 2.8 4.8 1.8]
 [6.4 2.8 5.6 2.1]
 [7.4 2.8 6.1 1.9]
 [6.4 2.8 5.6 2.2]
 [6.1 2.6 5.6 1.4]
 [6.3 3.4 5.6 2.4]
 [6.  3.  4.8 1.8]
 [6.7 3.1 5.6 2.4]
 [5.8 2.7 5.1 1.9]
 [6.7 3.3 5.7 2.5]
 [6.3 2.5 5.  1.9]
 [6.2 3.4 5.4 2.3]]
[[4.9 3.  1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.9 3.1 1.5 0.1]
 [4.3 3.  1.1 0.1]
 [5.1 3.5 1.4 0.3]
 [5.1 3.7 1.5 0.4]
 [5.  3.  1.6 0.2]
 [4.7 3.2 1.6 0.2]
 [5.5 4.2 1.4 0.2]
 [4.9 3.6 1.4 0.1]
 [4.5 2.3 1.3 0.3]
 [4.8 3.  1.4 0.3]
 [5.  3.3 1.4 0.2]
 [5.5 2.3 4.  1.3]
 [4.9 2.4 3.3 1. ]
 [5.9 3.  4.2 1.5]
 [6.7 3.1 4.4 1.4]
 [5.6 2.5 3.9 1.1]
 [6.1 2.8 4.7 1.2]
 [6.7 3.  5.  1.7]
 [5.5 2.4 3.7 1. ]
 [6.  3.4 4.5 1.6]
 [5.5 2.5 4.  1.3]
 [5.  2.3 3.3 1. ]
 [6.2 2.9 4.3 1.3]
 [5.8 2.7 5.1 1.9]
 [7.6 3.  6.6 2.1]
 [7.2 3.6 6.1 2.5]
 [5.7 2.5 5.  2. ]
 [7.7 3.8 6.7 2.2]
 [5.6 2.8 4.9 2. ]
 [7.2 3.2 6.  1.8]
 [7.2 3.  5.8 1.6]
 [6.3 2.8 5.1 1.5]
 [6.4 3.1 5.5 1.8]
 [6.9 3.1 5.1 2.3]
 [6.7 3.  5.2 2.3]
 [5.9 3.  5.1 1.8]]

Splitting into training and test sets

Listing 5. Selects every second index starting from 0, meaning 0, 2, 4, 6, 8, ..., and all columns

train_x1=data[::2, :] # selects every second index starting from 0, meaning 0, 2, 4, 6, 8, ..., and all columns
test_x1=data[1::2, :] # selects every second index starting from 1, meaning 1, 3, 5, 7, 9, ..., and all columns
train_y1=flower_species[::2] # selects every second label starting from 0
test_y1=flower_species[1::2] # selects every second label starting from 1
# training and test sets must correspond to each other
Practice task. split the data so every 3rd item goes to training, and every 4th item starting from the 10th goes to testing.

Listing 6. Solution

# solution
train_x1 = data[::3, :]
test_x1 = data[9::4, :]
train_y1 = flower_species[::3]
test_y1 = flower_species[9::4]

train_test_split and random data splitting

Listing 7. Code listing 7

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (data, flower_species, test_size = 0.3, random_state = 42)

Listing 8. Shape of the test variables

# shape of the test variables
print(X_test.shape) # 45 examples and 4 features; input data X is two-dimensional: (number_of_examples, number_of_features)
print(y_test.shape) # 45 labels; y is one-dimensional: (number_of_examples,)
Expected text output or note
(45, 4)
(45,)

Listing 9. Count how many examples of each species are in the test data

# count how many examples of each species are in the test data
import numpy as np
np.unique(y_test, return_counts = True) # class 0 appears 19 times, class 1 appears 13 times, and class 2 appears 13 times
Expected text output or note
(array([0, 1, 2]), array([19, 13, 13]))

Listing 10. Store values in two variables

# store values in two variables
values, flower_counts = np.unique(y_test, return_counts = True) # values contains the unique classes and flower_counts contains the number of appearances of each class

Listing 11. These values are later used for the pie chart

print(values)
print(flower_counts)
# these values are later used for the pie chart
Expected text output or note
[0 1 2]
[19 13 13]

Visual display of Iris data

Listing 12. Pie chart

# pie chart
from matplotlib import pyplot as plt

plt.title("Example: Pie chart")
plt.pie(
    flower_counts,
    labels = class_labels, # use names instead of numbers
    colors = ["c", "g", "y"],
    autopct = "%1.2f%%" # shows percentages with two decimals; %1.2f means a decimal number with two decimals and %% means the literal percent sign
)
plt.show()
Expected text output or note
<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Listing 13. Store column names in feature_columns so they can be used in the chart legend

# store column names in feature_columns so they can be used in the chart legend
feature_columns = sklearn_dataset.feature_names
print(feature_columns)
Expected text output or note
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Listing 14. Draw a line chart for all features of the first 20 flowers

# draw a line chart for all features of the first 20 flowers
plt.plot(data[:20, 0], label = feature_columns[0]) # values of the first column, sepal length, for the first 20 flowers
plt.plot(data[:20, 1], label = feature_columns[1]) # second
plt.plot(data[:20, 2], label = feature_columns[2]) # third
plt.plot(data[:20, 3], label = feature_columns[3]) # fourth

plt.legend(loc="upper right") # legend position
plt.show()

# for a scatter plot, use plt.scatter() instead of plt.plot()
Expected text output or note
<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Back to overview