Python Data Foundations Documentation

A plain documentation-style guide for Python, data handling, visualization, and machine learning basics.

Image Datasets and OpenML

This page covers image datasets, the Olivetti faces dataset, OpenML fetching, vector-to-image reshaping, and subplot grids.

What you should be able to do
  • Understand flat image vectors versus 2D image arrays.
  • Fetch datasets by name or ID.
  • Convert face data to NumPy and reshape a 4096-value vector into 64 x 64 pixels.
  • Display many images with subplot grids.
Reusable patterns
  • A 64 x 64 image becomes a vector of length 4096 when flattened.
  • plt.subplot(rows, columns, position) uses positions starting from 1.
  • enumerate is useful when a loop needs both a counter and each data row.

Datasets that contain images: Olivetti faces

Listing 1. Fetch the olivetti_faces dataset from scikit-learn

# fetch the olivetti_faces dataset from scikit-learn
from sklearn import datasets
olivetti_dataset = datasets.fetch_olivetti_faces()
Expected text output or note
downloading Olivetti faces from https://ndownloader.figshare.com/files/5976027 to /root/scikit_learn_data

The difference between load and fetch: load_iris() loads a small dataset that comes with the library, while fetch_olivetti_faces() can download a dataset from the internet and save it in the local cache.

Listing 2. Code listing 2

olivetti_dataset.keys()
Expected text output or note
dict_keys(['data', 'images', 'target', 'DESCR'])

Listing 3. Code listing 3

print(olivetti_dataset.data)
Expected text output or note
[[0.30991736 0.3677686  0.41735536 ... 0.15289256 0.16115703 0.1570248 ]
 [0.45454547 0.47107437 0.5123967  ... 0.15289256 0.15289256 0.15289256]
 [0.3181818  0.40082645 0.49173555 ... 0.14049587 0.14876033 0.15289256]
 ...
 [0.5        0.53305787 0.607438   ... 0.17768595 0.14876033 0.19008264]
 [0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576]
 [0.5165289  0.46280992 0.28099173 ... 0.35950413 0.3553719  0.38429752]]

Listing 4. Code listing 4

print(olivetti_dataset.DESCR)
Expected text output or note
.. _olivetti_faces_dataset:

The Olivetti faces dataset
--------------------------

`This dataset contains a set of face images`_ taken between April 1992 and
April 1994 at AT&T Laboratories Cambridge. The
:func:`sklearn.datasets.fetch_olivetti_faces` function is the data
fetching / caching function that downloads the data
archive from AT&T.

.. _This dataset contains a set of face images: https://cam-orl.co.uk/facedatabase.html

As described on the original website:

    There are ten different images of each of 40 distinct subjects. For some
    subjects, the images were taken at different times, varying the lighting,
    facial expressions (open / closed eyes, smiling / not smiling) and facial
    details (glasses / no glasses). All the images were taken against a dark
    homogeneous background with the subjects in an upright, frontal position
    (with tolerance for some side movement).

**Data Set Characteristics:**

=================   =====================
Classes                                40
Samples total                         400
Dimensionality                       4096
Features            real, between 0 and 1
=================   =====================

The image is quantized to 256 grey levels and stored as unsigned 8-bit
integers; the loader will convert these to floating point values on the
interval [0, 1], which are easier to work with for many algorithms.

The "target" for this database is an integer from 0 to 39 indicating the
identity of the person pictured; however, with only 10 examples per class, this
relatively small dataset is more interesting from an unsupervised or
semi-supervised perspective.

The original dataset consisted of 92 x 112, while the version available here
consists of 64x64 images.

When using these images, please give credit to AT&T Laboratories Cambridge.

OpenML and dataset fetching

Listing 5. Code listing 5

!pip install openml
Expected text output or note
Collecting openml
  Downloading openml-0.15.1-py3-none-any.whl.metadata (10 kB)
Collecting liac-arff>=2.4.0 (from openml)
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting xmltodict (from openml)
  Downloading xmltodict-1.0.4-py3-none-any.whl.metadata (14 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from openml) (2.32.4)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.12/dist-packages (from openml) (1.6.1)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.12/dist-packages (from openml) (2.9.0.post0)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from openml) (2.2.2)
Requirement already satisfied: scipy>=0.13.3 in /usr/local/lib/python3.12/dist-packages (from openml) (1.16.3)
Requirement already satisfied: numpy>=1.6.2 in /usr/local/lib/python3.12/dist-packages (from openml) (2.0.2)
Collecting minio (from openml)
  Downloading minio-7.2.20-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: pyarrow in /usr/local/lib/python3.12/dist-packages (from openml) (18.1.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from openml) (4.67.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from openml) (26.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.0.0->openml) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.0.0->openml) (2026.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil->openml) (1.17.0)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=0.18->openml) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=0.18->openml) (3.6.0)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.12/dist-packages (from minio->openml) (25.1.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from minio->openml) (2026.5.20)
Collecting pycryptodome (from minio->openml)
  Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.12/dist-packages (from minio->openml) (4.15.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.12/dist-packages (from minio->openml) (2.5.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->openml) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->openml) (3.15)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.12/dist-packages (from argon2-cffi->minio->openml) (25.1.0)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from argon2-cffi-bindings->argon2-cffi->minio->openml) (2.0.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.12/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->minio->openml) (3.0)
Downloading openml-0.15.1-py3-none-any.whl (160 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 160.4/160.4 kB 5.6 MB/s eta 0:00:00
[?25hDownloading minio-7.2.20-py3-none-any.whl (93 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93.8/93.8 kB 4.1 MB/s eta 0:00:00
[?25hDownloading xmltodict-1.0.4-py3-none-any.whl (13 kB)
Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 43.4 MB/s eta 0:00:00
[?25hBuilding wheels for collected packages: liac-arff
  Building wheel for liac-arff (setup.py) ... [?25l[?25hdone
  Created wheel for liac-arff: filename=liac_arff-2.5.0-py3-none-any.whl size=11717 sha256=1712308773e768b14802e06924a1641aeb578eba6492ef6349c74d591139c5bf
  Stored in directory: /root/.cache/pip/wheels/a9/ac/cf/c2919807a5c623926d217c0a18eb5b457e5c19d242c3b5963a
Successfully built liac-arff
Installing collected packages: xmltodict, pycryptodome, liac-arff, minio, openml
Successfully installed liac-arff-2.5.0 minio-7.2.20 openml-0.15.1 pycryptodome-3.23.0 xmltodict-1.0.4

Listing 6. Code listing 6

!rm -rf ~/scikit_learn_data/openml

Listing 7. Datasets are fetched by name or numeric id

# datasets are fetched by name or numeric ID
import openml
from sklearn import datasets
from sklearn.datasets import fetch_openml
face_dataset = datasets.fetch_openml(name = "olivetti_faces", version = 1) # or face_dataset = datasets.fetch_openml(data_id = 61)

Listing 8. Fetched without openml because openml is not working here

# fetched without OpenML because OpenML is not working here
from sklearn import datasets
from sklearn.datasets import fetch_olivetti_faces
face_dataset = fetch_olivetti_faces()

Listing 9. Code listing 9

face_dataset.keys()
Expected text output or note
dict_keys(['data', 'images', 'target', 'DESCR'])

For OpenML objects:

Listing 10. Code listing 10

face_dataset.DESCR
Expected text output or note
'.. _olivetti_faces_dataset:\n\nThe Olivetti faces dataset\n--------------------------\n\n`This dataset contains a set of face images`_ taken between April 1992 and\nApril 1994 at AT&T Laboratories Cambridge. The\n:func:`sklearn.datasets.fetch_olivetti_faces` function is the data\nfetching / caching function that downloads the data\narchive from AT&T.\n\n.. _This dataset contains a set of face images: https://cam-orl.co.uk/facedatabase.html\n\nAs described on the original website:\n\n    There are ten different images of each of 40 distinct subjects. For some\n    subjects, the images were taken at different times, varying the lighting,\n    facial expressions (open / closed eyes, smiling / not smiling) and facial\n    details (glasses / no glasses). All the images were taken against a dark\n    homogeneous background with the subjects in an upright, frontal position\n    (with tolerance for some side movement).\n\n**Data Set Characteristics:**\n\n=================   =====================\nClasses                                40\nSamples total                         400\nDimensionality                       4096\nFeatures            real, between 0 and 1\n=================   =====================\n\nThe image is quantized to 256 grey levels and stored as unsigned 8-bit\nintegers; the loader will convert these to floating point values on the\ninterval [0, 1], which are easier to work with for many algorithms.\n\nThe "target" for this database is an integer from 0 to 39 indicating the\nidentity of the person pictured; however, with only 10 examples per class, this\nrelatively small dataset is more interesting from an unsupervised or\nsemi-supervised perspective.\n\nThe original dataset consisted of 92 x 112, while the version available here\nconsists of 64x64 images.\n\nWhen using these images, please give credit to AT&T Laboratories Cambridge.\n'

Listing 11. Code listing 11

face_dataset.data.shape
Expected text output or note
(400, 4096)

Listing 12. Pandas form; learn how to work with pandas or convert to a numpy array

# pandas form; learn how to work with pandas or convert to a NumPy array
face_dataset.data
Expected text output or note
array([[0.30991736, 0.3677686 , 0.41735536, ..., 0.15289256, 0.16115703,
        0.1570248 ],
       [0.45454547, 0.47107437, 0.5123967 , ..., 0.15289256, 0.15289256,
        0.15289256],
       [0.3181818 , 0.40082645, 0.49173555, ..., 0.14049587, 0.14876033,
        0.15289256],
       ...,
       [0.5       , 0.53305787, 0.607438  , ..., 0.17768595, 0.14876033,
        0.19008264],
       [0.21487603, 0.21900827, 0.21900827, ..., 0.57438016, 0.59090906,
        0.60330576],
       [0.5165289 , 0.46280992, 0.28099173, ..., 0.35950413, 0.3553719 ,
        0.38429752]], dtype=float32)

Listing 13. Target contains the class label for each image: 40 different people, 10 images per person

face_dataset.target # target contains the class label for each image: 40 different people, 10 images per person
Expected text output or note
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  5,
        5,  5,  5,  5,  5,  5,  5,  5,  5,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  8,  8,  8,  8,  8,
        8,  8,  8,  8,  8,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11,
       11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
       13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15,
       15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
       17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18,
       18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20,
       20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22,
       22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
       23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25,
       25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27,
       27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 28,
       28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30,
       30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
       34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35,
       35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39,
       39, 39, 39, 39, 39, 39, 39, 39, 39])

Converting data and displaying images

Listing 14. Convert to numpy

# convert to NumPy
import numpy as np
X = np.array(face_dataset.data) # X is the feature matrix; X = input data
y = np.array(face_dataset.target) # y contains the class labels; y = labels
print(X)
print("----------------------------------------------------------------------")
print(y)
Expected text output or note
[[0.30991736 0.3677686  0.41735536 ... 0.15289256 0.16115703 0.1570248 ]
 [0.45454547 0.47107437 0.5123967  ... 0.15289256 0.15289256 0.15289256]
 [0.3181818  0.40082645 0.49173555 ... 0.14049587 0.14876033 0.15289256]
 ...
 [0.5        0.53305787 0.607438   ... 0.17768595 0.14876033 0.19008264]
 [0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576]
 [0.5165289  0.46280992 0.28099173 ... 0.35950413 0.3553719  0.38429752]]
----------------------------------------------------------------------
[ 0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  2  2  2  2
  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4
  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  7  7
  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9
  9  9  9  9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16
 16 16 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 19 19
 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21
 21 21 21 21 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 23 23 23 23
 24 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 26 26 26 26
 26 26 26 26 26 26 27 27 27 27 27 27 27 27 27 27 28 28 28 28 28 28 28 28
 28 28 29 29 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 30 30 31 31
 31 31 31 31 31 31 31 31 32 32 32 32 32 32 32 32 32 32 33 33 33 33 33 33
 33 33 33 33 34 34 34 34 34 34 34 34 34 34 35 35 35 35 35 35 35 35 35 35
 36 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 37 37 38 38 38 38
 38 38 38 38 38 38 39 39 39 39 39 39 39 39 39 39]

Listing 15. Get the first image data and reshape it into 64x64 pixels

import matplotlib.pyplot as plt
import numpy as np

# get the first image data and reshape it into 64x64 pixels
first_face_image = X[0].reshape(64, 64)
plt.imshow(first_face_image, cmap = "gray") # for grayscale, reshape the vector into an image and set cmap
# or plt.imshow(first_face_image, cmap = "binary") for a black-and-white colormap
# or plt.imshow(first_face_image, cmap = "cool") for the cool colormap
Expected text output or note
<matplotlib.image.AxesImage at 0x7d813582bcb0>

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]
Practice task. from the first 100 images, display every fifth image.

Listing 16. For subplot numbering, start from 1 when drawing multiple plots, but be careful with data indexes

i = 1    # for subplot numbering, start from 1 when drawing multiple plots, but be careful with data indexes

for row in X[0:100:5]:
  plt.subplot(4, 5, i) # creates a grid of 4 rows and 5 columns and selects the i-th position
  image = row.reshape(64, 64)
  plt.imshow(image, cmap = "gray")
  i += 1
Expected text output or note
<Figure size 640x480 with 20 Axes>

[visual output omitted; run the code to display the image or chart]

Listing 17. Second method

# second method:
for i,data in enumerate(X[0:100:5], start=1):  # if start is not defined, it starts from 0
  plt.subplot(4, 5, i)
  image=data.reshape(64,64)
  plt.imshow(image, cmap='gray')
# enumerate is useful when a loop needs both the index and the value
Expected text output or note
<Figure size 640x480 with 20 Axes>

[visual output omitted; run the code to display the image or chart]
Practice task. display images from the 11th to the 29th, every second image, in 2 rows and 5 columns.

Listing 18. Code listing 18

i = 1

for row in X[10:30:2]:
  plt.subplot(2, 5, i)
  image = row.reshape(64, 64)
  plt.imshow(image, cmap = "gray")
  i += 1
Expected text output or note
<Figure size 640x480 with 10 Axes>

[visual output omitted; run the code to display the image or chart]

Back to overview