Image Datasets and OpenML

This page covers image datasets, the Olivetti faces dataset, OpenML fetching, vector-to-image reshaping, and subplot grids.

What you should be able to do

Understand flat image vectors versus 2D image arrays.
Fetch datasets by name or ID.
Convert face data to NumPy and reshape a 4096-value vector into 64 x 64 pixels.
Display many images with subplot grids.

Reusable patterns

A 64 x 64 image becomes a vector of length 4096 when flattened.
plt.subplot(rows, columns, position) uses positions starting from 1.
enumerate is useful when a loop needs both a counter and each data row.

Datasets that contain images: Olivetti faces

Listing 1. Fetch the olivetti_faces dataset from scikit-learn

# fetch the olivetti_faces dataset from scikit-learn
from sklearn import datasets
olivetti_dataset = datasets.fetch_olivetti_faces()

Expected text output or note

downloading Olivetti faces from https://ndownloader.figshare.com/files/5976027 to /root/scikit_learn_data

The difference between load and fetch: load_iris() loads a small dataset that comes with the library, while fetch_olivetti_faces() can download a dataset from the internet and save it in the local cache.

Listing 2. Inspect the available Olivetti dataset fields

olivetti_dataset.keys()

Expected text output or note

dict_keys(['data', 'images', 'target', 'DESCR'])

data: images converted into flat vectors; one image has shape (4096,) because 64 x 64 = 4096.
images: images in their original 2D shape, 64 x 64.
target: person labels.
DESCR: dataset description.

Listing 3. Print the flattened face pixel matrix

print(olivetti_dataset.data)

Expected text output or note

[[0.30991736 0.3677686  0.41735536 ... 0.15289256 0.16115703 0.1570248 ]
 [0.45454547 0.47107437 0.5123967  ... 0.15289256 0.15289256 0.15289256]
 [0.3181818  0.40082645 0.49173555 ... 0.14049587 0.14876033 0.15289256]
 ...
 [0.5        0.53305787 0.607438   ... 0.17768595 0.14876033 0.19008264]
 [0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576]
 [0.5165289  0.46280992 0.28099173 ... 0.35950413 0.3553719  0.38429752]]

Listing 4. Read the Olivetti dataset description

print(olivetti_dataset.DESCR)

Expected text output or note

.. _olivetti_faces_dataset:

The Olivetti faces dataset
--------------------------

`This dataset contains a set of face images`_ taken between April 1992 and
April 1994 at AT&T Laboratories Cambridge. The
:func:`sklearn.datasets.fetch_olivetti_faces` function is the data
fetching / caching function that downloads the data
archive from AT&T.

.. _This dataset contains a set of face images: https://cam-orl.co.uk/facedatabase.html

As described on the original website:

    There are ten different images of each of 40 distinct subjects. For some
    subjects, the images were taken at different times, varying the lighting,
    facial expressions (open / closed eyes, smiling / not smiling) and facial
    details (glasses / no glasses). All the images were taken against a dark
    homogeneous background with the subjects in an upright, frontal position
    (with tolerance for some side movement).

**Data Set Characteristics:**

=================   =====================
Classes                                40
Samples total                         400
Dimensionality                       4096
Features            real, between 0 and 1
=================   =====================

The image is quantized to 256 grey levels and stored as unsigned 8-bit
integers; the loader will convert these to floating point values on the
interval [0, 1], which are easier to work with for many algorithms.

The "target" for this database is an integer from 0 to 39 indicating the
identity of the person pictured; however, with only 10 examples per class, this
relatively small dataset is more interesting from an unsupervised or
semi-supervised perspective.

The original dataset consisted of 92 x 112, while the version available here
consists of 64x64 images.

When using these images, please give credit to AT&T Laboratories Cambridge.

OpenML and dataset fetching

Listing 5. Install the OpenML helper package in Colab

!pip install openml

Expected text output or note

Collecting openml
  Downloading openml-0.15.1-py3-none-any.whl.metadata (10 kB)
Collecting liac-arff>=2.4.0 (from openml)
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting xmltodict (from openml)
  Downloading xmltodict-1.0.4-py3-none-any.whl.metadata (14 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from openml) (2.32.4)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.12/dist-packages (from openml) (1.6.1)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.12/dist-packages (from openml) (2.9.0.post0)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from openml) (2.2.2)
Requirement already satisfied: scipy>=0.13.3 in /usr/local/lib/python3.12/dist-packages (from openml) (1.16.3)
Requirement already satisfied: numpy>=1.6.2 in /usr/local/lib/python3.12/dist-packages (from openml) (2.0.2)
Collecting minio (from openml)
  Downloading minio-7.2.20-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: pyarrow in /usr/local/lib/python3.12/dist-packages (from openml) (18.1.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from openml) (4.67.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from openml) (26.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.0.0->openml) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.0.0->openml) (2026.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil->openml) (1.17.0)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=0.18->openml) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=0.18->openml) (3.6.0)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.12/dist-packages (from minio->openml) (25.1.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from minio->openml) (2026.5.20)
Collecting pycryptodome (from minio->openml)
  Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.12/dist-packages (from minio->openml) (4.15.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.12/dist-packages (from minio->openml) (2.5.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->openml) (3.4.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->openml) (3.15)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.12/dist-packages (from argon2-cffi->minio->openml) (25.1.0)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from argon2-cffi-bindings->argon2-cffi->minio->openml) (2.0.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.12/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->minio->openml) (3.0)
Downloading openml-0.15.1-py3-none-any.whl (160 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.4/160.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading minio-7.2.20-py3-none-any.whl (93 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.8/93.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xmltodict-1.0.4-py3-none-any.whl (13 kB)
Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: liac-arff
  Building wheel for liac-arff (setup.py) ... [?25l[?25hdone
  Created wheel for liac-arff: filename=liac_arff-2.5.0-py3-none-any.whl size=11717 sha256=1712308773e768b14802e06924a1641aeb578eba6492ef6349c74d591139c5bf
  Stored in directory: /root/.cache/pip/wheels/a9/ac/cf/c2919807a5c623926d217c0a18eb5b457e5c19d242c3b5963a
Successfully built liac-arff
Installing collected packages: xmltodict, pycryptodome, liac-arff, minio, openml
Successfully installed liac-arff-2.5.0 minio-7.2.20 openml-0.15.1 pycryptodome-3.23.0 xmltodict-1.0.4

Listing 6. Clear the local OpenML cache before refetching

!rm -rf ~/scikit_learn_data/openml

Listing 7. Datasets are fetched by name or numeric id

# datasets are fetched by name or numeric ID
import openml
from sklearn import datasets
from sklearn.datasets import fetch_openml
face_dataset = datasets.fetch_openml(name = "olivetti_faces", version = 1) # or face_dataset = datasets.fetch_openml(data_id = 61)

Listing 8. Fetched without openml because openml is not working here

# fetched without OpenML because OpenML is not working here
from sklearn import datasets
from sklearn.datasets import fetch_olivetti_faces
face_dataset = fetch_olivetti_faces()

Listing 9. Inspect fields after loading faces by OpenML ID

face_dataset.keys()

Expected text output or note

dict_keys(['data', 'images', 'target', 'DESCR'])

For OpenML objects:

data: input data.
target: class labels.
frame: tabular data, if available.
categories: categorical information.
feature_names: feature names.
target_names: target-variable names.
DESCR: description.
details: dataset metadata.
url: OpenML link.

Listing 10. Read the face dataset description

face_dataset.DESCR

Expected text output or note

'.. _olivetti_faces_dataset:\n\nThe Olivetti faces dataset\n--------------------------\n\n`This dataset contains a set of face images`_ taken between April 1992 and\nApril 1994 at AT&T Laboratories Cambridge. The\n:func:`sklearn.datasets.fetch_olivetti_faces` function is the data\nfetching / caching function that downloads the data\narchive from AT&T.\n\n.. _This dataset contains a set of face images: https://cam-orl.co.uk/facedatabase.html\n\nAs described on the original website:\n\n    There are ten different images of each of 40 distinct subjects. For some\n    subjects, the images were taken at different times, varying the lighting,\n    facial expressions (open / closed eyes, smiling / not smiling) and facial\n    details (glasses / no glasses). All the images were taken against a dark\n    homogeneous background with the subjects in an upright, frontal position\n    (with tolerance for some side movement).\n\n**Data Set Characteristics:**\n\n=================   =====================\nClasses                                40\nSamples total                         400\nDimensionality                       4096\nFeatures            real, between 0 and 1\n=================   =====================\n\nThe image is quantized to 256 grey levels and stored as unsigned 8-bit\nintegers; the loader will convert these to floating point values on the\ninterval [0, 1], which are easier to work with for many algorithms.\n\nThe "target" for this database is an integer from 0 to 39 indicating the\nidentity of the person pictured; however, with only 10 examples per class, this\nrelatively small dataset is more interesting from an unsupervised or\nsemi-supervised perspective.\n\nThe original dataset consisted of 92 x 112, while the version available here\nconsists of 64x64 images.\n\nWhen using these images, please give credit to AT&T Laboratories Cambridge.\n'

Listing 11. Check the flattened image matrix shape

face_dataset.data.shape

Expected text output or note

(400, 4096)

Listing 12. Pandas form; learn how to work with pandas or convert to a numpy array

# pandas form; learn how to work with pandas or convert to a NumPy array
face_dataset.data

Expected text output or note

array([[0.30991736, 0.3677686 , 0.41735536, ..., 0.15289256, 0.16115703,
        0.1570248 ],
       [0.45454547, 0.47107437, 0.5123967 , ..., 0.15289256, 0.15289256,
        0.15289256],
       [0.3181818 , 0.40082645, 0.49173555, ..., 0.14049587, 0.14876033,
        0.15289256],
       ...,
       [0.5       , 0.53305787, 0.607438  , ..., 0.17768595, 0.14876033,
        0.19008264],
       [0.21487603, 0.21900827, 0.21900827, ..., 0.57438016, 0.59090906,
        0.60330576],
       [0.5165289 , 0.46280992, 0.28099173, ..., 0.35950413, 0.3553719 ,
        0.38429752]], dtype=float32)

Listing 13. Target contains the class label for each image: 40 different people, 10 images per person

face_dataset.target # target contains the class label for each image: 40 different people, 10 images per person

Expected text output or note

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  5,
        5,  5,  5,  5,  5,  5,  5,  5,  5,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  8,  8,  8,  8,  8,
        8,  8,  8,  8,  8,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11,
       11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
       13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15,
       15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
       17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18,
       18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20,
       20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22,
       22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
       23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25,
       25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27,
       27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 28,
       28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30,
       30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32,
       32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
       34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35,
       35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39,
       39, 39, 39, 39, 39, 39, 39, 39, 39])

Converting data and displaying images

Listing 14. Convert to numpy

# convert to NumPy
import numpy as np
X = np.array(face_dataset.data) # X is the feature matrix; X = input data
y = np.array(face_dataset.target) # y contains the class labels; y = labels
print(X)
print("----------------------------------------------------------------------")
print(y)

Expected text output or note

[[0.30991736 0.3677686  0.41735536 ... 0.15289256 0.16115703 0.1570248 ]
 [0.45454547 0.47107437 0.5123967  ... 0.15289256 0.15289256 0.15289256]
 [0.3181818  0.40082645 0.49173555 ... 0.14049587 0.14876033 0.15289256]
 ...
 [0.5        0.53305787 0.607438   ... 0.17768595 0.14876033 0.19008264]
 [0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576]
 [0.5165289  0.46280992 0.28099173 ... 0.35950413 0.3553719  0.38429752]]
----------------------------------------------------------------------
[ 0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  2  2  2  2
  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4
  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  7  7
  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9
  9  9  9  9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16
 16 16 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 19 19
 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21
 21 21 21 21 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 23 23 23 23
 24 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 26 26 26 26
 26 26 26 26 26 26 27 27 27 27 27 27 27 27 27 27 28 28 28 28 28 28 28 28
 28 28 29 29 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 30 30 31 31
 31 31 31 31 31 31 31 31 32 32 32 32 32 32 32 32 32 32 33 33 33 33 33 33
 33 33 33 33 34 34 34 34 34 34 34 34 34 34 35 35 35 35 35 35 35 35 35 35
 36 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 37 37 38 38 38 38
 38 38 38 38 38 38 39 39 39 39 39 39 39 39 39 39]

Listing 15. Get the first image data and reshape it into 64x64 pixels

import matplotlib.pyplot as plt
import numpy as np

# get the first image data and reshape it into 64x64 pixels
first_face_image = X[0].reshape(64, 64)
plt.imshow(first_face_image, cmap = "gray") # for grayscale, reshape the vector into an image and set cmap
# or plt.imshow(first_face_image, cmap = "binary") for a black-and-white colormap
# or plt.imshow(first_face_image, cmap = "cool") for the cool colormap

Expected text output or note

<matplotlib.image.AxesImage at 0x7d813582bcb0>

<Figure size 640x480 with 1 Axes>

[visual output omitted; run the code to display the image or chart]

Practice task. from the first 100 images, display every fifth image.

Listing 16. For subplot numbering, start from 1 when drawing multiple plots, but be careful with data indexes

i = 1    # for subplot numbering, start from 1 when drawing multiple plots, but be careful with data indexes

for row in X[0:100:5]:
  plt.subplot(4, 5, i) # creates a grid of 4 rows and 5 columns and selects the i-th position
  image = row.reshape(64, 64)
  plt.imshow(image, cmap = "gray")
  i += 1

Expected text output or note

<Figure size 640x480 with 20 Axes>

[visual output omitted; run the code to display the image or chart]

Listing 17. Second method

# second method:
for i,data in enumerate(X[0:100:5], start=1):  # if start is not defined, it starts from 0
  plt.subplot(4, 5, i)
  image=data.reshape(64,64)
  plt.imshow(image, cmap='gray')
# enumerate is useful when a loop needs both the index and the value

Expected text output or note

<Figure size 640x480 with 20 Axes>

[visual output omitted; run the code to display the image or chart]

Practice task. display images from the 11th to the 29th, every second image, in 2 rows and 5 columns.

Listing 18. Display every second selected face in a subplot grid

i = 1

for row in X[10:30:2]:
  plt.subplot(2, 5, i)
  image = row.reshape(64, 64)
  plt.imshow(image, cmap = "gray")
  i += 1

Expected text output or note

<Figure size 640x480 with 10 Axes>

[visual output omitted; run the code to display the image or chart]

Back to overview

Python Data Foundations Documentation

Image Datasets and OpenML

Listing 1. Fetch the olivetti_faces dataset from scikit-learn

Listing 2. Inspect the available Olivetti dataset fields

Listing 3. Print the flattened face pixel matrix

Listing 4. Read the Olivetti dataset description

Listing 5. Install the OpenML helper package in Colab

Listing 6. Clear the local OpenML cache before refetching

Listing 7. Datasets are fetched by name or numeric id

Listing 8. Fetched without openml because openml is not working here

Listing 9. Inspect fields after loading faces by OpenML ID

Listing 10. Read the face dataset description

Listing 11. Check the flattened image matrix shape

Listing 12. Pandas form; learn how to work with pandas or convert to a numpy array

Listing 13. Target contains the class label for each image: 40 different people, 10 images per person

Listing 14. Convert to numpy

Listing 15. Get the first image data and reshape it into 64x64 pixels

Listing 16. For subplot numbering, start from 1 when drawing multiple plots, but be careful with data indexes

Listing 17. Second method

Listing 18. Display every second selected face in a subplot grid