Image dataset performance when using map

I am attempting a simple inference task on a collection of images stored in a Dataset that I have built using the approach below.

data = ["/paths/to/files.jpg"]
features = Features({"image": Image(), "image_path": Value(dtype="string")})
dataset = Dataset.from_dict(data, features=features)

The task is essentially modeled on this example: Image Similarity with Hugging Face Datasets and Transformers

I have two related questions about performance.

  1. When I call dataset.map(extract_embeddings, batched=True, batch_size=batch_size) there is an enormous performance penalty if I fail to specify remove_columns=["image"]. A profiler shows all most all the additional time is spent in arrow_writer.py. Apparently something is being written after the embeddings are computed; perhaps the entire dataset is being rewritten incrementally in some staging/cache location? Passing keep_in_memory=True does not affect the performance. The difference is so stark that I’m surprised it isn’t called out in the example and wonder if there is some documentation on how a Dataset behaves when a new column is (incrementally?) added to it. Am I building the dataset wrong? Is it normal to have to add remove_columns?

  2. When I load the model onto a GPU, there is a large up-front delay (about 7 seconds, and it doesn’t obviously depend on the data size) in starting the map process that is not present when working on a CPU. A profiler shows the difference to be attributable to update_fingerprint. Why would that be different?

I’m interested in getting a better understanding of these and other performance issues so any insight or recommended reading to understand the api better is appreciated!

2 Likes

seems like still not solved

1 Like

For now, I’ve managed to reproduce and work around the issue.


0) First: your construction snippet is slightly off

Dataset.from_dict(...) expects a dict of columns → lists. For your schema you typically want:

from datasets import Dataset, Features, Image, Value

paths = ["/paths/to/file.jpg", ...]
features = Features({"image": Image(), "image_path": Value("string")})

ds = Dataset.from_dict({"image": paths, "image_path": paths}, features=features)

If you only need paths (often the fastest route for embedding inference), skip the Image() column entirely and open/decode inside map().


1) Why keeping "image" makes map() much slower (and why remove_columns=["image"] is “normal”)

1.1 What map() really does

Dataset.map() materializes a new dataset state. Concretely, it:

  • iterates over your dataset in batches,
  • runs your function,
  • then writes a new Arrow table (often to cache shards) containing the columns that remain plus any new columns.

This is why your profiler points at the writer path: after your embeddings are computed, the library still has to serialize the output table.

The docs emphasize that operations like removing columns in-place can be faster precisely because rebuilding/copying is expensive; Dataset.remove_columns() doesn’t copy the remaining columns, whereas map() generally does. (Hugging Face)

1.2 Why "image" is a special (expensive) column

An Image feature is designed to present images to you as decoded objects when accessed, but it can be stored internally as either:

  • { "path": ..., "bytes": None } (cheap), or
  • { "path": None, "bytes": ... } (potentially huge). (Hugging Face)

During map(), if the batch contains decoded PIL.Image objects, the writer needs to encode them back into an Arrow-storable form. The Image feature supports PIL inputs and can convert them to bytes (see the implementation around PIL→bytes conversion). (Hugging Face)

That encode step can dominate runtime, especially if it ends up embedding bytes for every row.

1.3 Why remove_columns=["image"] is so effective

If you drop "image" during map():

  • the output writer never needs to serialize that media payload,
  • the output cache shards stay small,
  • arrow_writer time collapses.

This is consistent with the official “Process image data” guide: it explicitly calls out that batch_size/writer_batch_size defaults can be expensive if you are storing images. (Hugging Face)

1.4 Why keep_in_memory=True didn’t change much

keep_in_memory=True changes whether the result is written to disk, but it doesn’t remove the need to:

  • build Arrow arrays for the output, and
  • encode Python objects (e.g., PIL images) into Arrow-compatible values.

So you can still pay the same “encode the image column” cost.

1.5 Practical “best” patterns for your use case (embeddings)

Pattern A (recommended): store only paths; decode/open inside map()

Fastest and simplest for embedding inference:

from datasets import Dataset, Features, Value
from PIL import Image as PILImage

paths = [...]
ds = Dataset.from_dict({"image_path": paths}, features=Features({"image_path": Value("string")}))

def extract_embeddings(batch):
    imgs = [PILImage.open(p).convert("RGB") for p in batch["image_path"]]
    # preprocess -> model -> embeddings
    return {"embeddings": ...}

ds_emb = ds.map(extract_embeddings, batched=True, batch_size=64)

Pattern B: keep an Image column but prevent expensive decoded-object roundtrips

Keep the column as non-decoded so it stays {path, bytes}:

from datasets import Image
ds = ds.cast_column("image", Image(decode=False))

Now ds[i]["image"] is typically a dict like { "path": "...", "bytes": None } (cheap), and you decode from path inside your function. (Hugging Face)

Pattern C: if you don’t need "image" after embeddings, drop it before mapping

Since Dataset.remove_columns() is in-place and avoids copying the remaining data, it can be cheaper than carrying "image" through a map() at all. (Hugging Face)


2) Why GPU use can add a fixed “startup” delay in update_fingerprint

2.1 What fingerprinting is doing

Datasets uses a fingerprint to identify dataset states for caching and reuse. The fingerprint is updated by hashing:

  • the function passed to map, and
  • the map parameters (e.g., batch_size, remove_columns, etc.). (Hugging Face)

Community explanations go one step further: the hash can include the code and variables used/captured by your map function, and hashing large captured objects is slow. (Hugging Face Forums)

2.2 Why this can change when the model is moved to CUDA

If your extract_embeddings is a closure that captures model (common pattern), then fingerprinting may end up hashing/serializing the model object or parts of it. Once the model is on GPU, that process can trigger:

  • CUDA runtime initialization,
  • synchronization points,
  • device↔host interactions during serialization/hashing.

That creates the “fixed ~seconds” overhead you’re observing: it’s function/closure hashing cost, not dataset-size-dependent work.

A recent GitHub issue (Feb 2026) shows the same class of problem: closures capturing self with non-deterministic/large state can cause fingerprint changes and cache misses—fingerprinting is sensitive to captured state. (GitHub)

2.3 Mitigations

Mitigation A (best): avoid capturing the CUDA model in the mapped callable

Make the hashed function “small”:

  • define it at module scope,
  • load/init the model lazily inside, or store it in a global that isn’t part of the closure.

Mitigation B: pass new_fingerprint=... explicitly

Many transforms accept new_fingerprint; if you provide one, Datasets doesn’t need to derive it by hashing everything. (Hugging Face)

Tradeoff: you must update the fingerprint yourself when you change the model/logic, or you risk reusing stale cache.

Mitigation C: if you’re iterating quickly and don’t want caching behavior, understand what disabling cache does (and doesn’t do)

Disabling caching can prevent reuse of cache files, but fingerprinting logic still exists for dataset-state tracking; large captured objects can still cost time to hash (as noted in the “large objects” discussion). (Hugging Face Forums)


3) Good references and “similar cases” worth reading

Official docs (most directly relevant)

  • The cache / hashing: explains that fingerprints update by hashing the map function and parameters. (Hugging Face)
  • Main classes / transforms: documents new_fingerprint and column ops. (Hugging Face)
  • Process image data: explicitly warns that defaults can be expensive if you are storing images. (Hugging Face)
  • Dataset features (Image decode=False): shows path-vs-bytes representation. (Hugging Face)
  • Image feature implementation: shows supported input forms and PIL/bytes handling. (Hugging Face)

Issues/discussions that match your symptoms

  • Large objects in map() → slow fingerprinting: clear explanation of hashing captured variables. (Hugging Face Forums)
  • Closure capture causing fingerprint/cache issues (Feb 2026): current example of closure sensitivity. (GitHub)
  • Inconsistent caching/fingerprinting behavior: illustrates how fingerprinting interacts with caching and determinism. (GitHub)
  • Large image datasets / using transforms without materializing huge caches (with_transform): suggests loading images on the fly so only the current batch occupies memory. (Hugging Face Forums)
  • Image dataset storage best practices (Parquet/streaming perspective). (Hugging Face Forums)

4) A concrete “fast and predictable” setup for embedding inference

If the goal is “compute embeddings once, keep embeddings + path”:

  1. Keep only image_path in the dataset.
  2. Open images inside map().
  3. Ensure the mapped function does not close over a CUDA model (or set new_fingerprint).
  4. Write embeddings as a compact column (e.g., fixed-size list/array), and keep the dataset small.

If you later need the original image bytes again, don’t carry them through the embedding map; keep paths and re-open.

For larger-scale pipelines, a “stream and transform” approach can avoid repeated materialization costs.


"""
Repro vs fix: (1) large slowdown when keeping an Image column during Dataset.map (writer must rewrite heavy image payloads)
             (2) large startup delay in update_fingerprint when the mapped function captures a big CUDA model

Deps:
  pip install -U "datasets>=2.18" "transformers>=4.40" "torch" "pillow" "numpy"

Key docs (URLs):
  - Dataset.map / remove_columns / new_fingerprint / writer_batch_size:
    https://huggingface.co/docs/datasets/en/package_reference/main_classes
  - Image feature + decode=False:
    https://huggingface.co/docs/datasets/en/image_load
  - Image feature stores PIL from bytes; decode=False returns {"path","bytes"}:
    https://huggingface.co/docs/datasets/en/about_dataset_features
  - Fingerprinting hashes function + captured variables (large objects => slow):
    /static-proxy?url=https%3A%2F%2Fdiscuss.huggingface.co%2Ft%2Fdealing-with-large-objects-as-arguments-in-datasets-map%2F10946
  - Closure/fingerprint pitfalls (recent issue):
    https://github.com/huggingface/datasets/issues/7986

Notes:
  - This script intentionally makes (1) gap BIG by constructing an Image column backed by large *bytes* (PNG).
    Keeping that column forces map() to rewrite hundreds of MB into Arrow cache shards.
    Dropping it avoids that rewrite.
  - To isolate writer cost, the (1) mapping function uses input_columns="image_path" so it does NOT decode images.
    That mirrors the “I only want derived features, don’t touch raw payload” scenario.
  - For (2), we load CLIP on CUDA and show fingerprint cost when the map function captures the model.

No argparse; Colab-friendly; T4-safe (batch sizes small; fp16 on CUDA).
"""

import os
import time
import tempfile
from pathlib import Path

import numpy as np
import torch
from PIL import Image as PILImage
from datasets import Dataset, Features, Image, Value
from transformers import CLIPModel


# ----------------------------
# Helpers
# ----------------------------
def make_big_pngs(out_dir: Path, n: int = 128, size: int = 768, seed: int = 0) -> list[str]:
    """
    Generate n random-ish PNG images. Random noise compresses poorly => big bytes => big writer cost.
    size=768, n=128 typically yields ~100s of MB total bytes.
    """
    rng = np.random.default_rng(seed)
    out_dir.mkdir(parents=True, exist_ok=True)
    paths = []
    for i in range(n):
        arr = rng.integers(0, 256, size=(size, size, 3), dtype=np.uint8)
        p = out_dir / f"img_{i:05d}.png"
        PILImage.fromarray(arr, mode="RGB").save(p, format="PNG", optimize=True)
        paths.append(str(p))
    return paths


def build_ds_paths(paths: list[str]) -> Dataset:
    """Image column stores only paths (cheap to rewrite)."""
    features = Features({"image": Image(), "image_path": Value("string")})
    return Dataset.from_dict({"image": paths, "image_path": paths}, features=features)


def build_ds_bytes(paths: list[str]) -> Dataset:
    """
    Image column stores bytes (heavy to rewrite).
    Each example is {"path": None, "bytes": <PNG bytes>}.
    """
    images = []
    total = 0
    for p in paths:
        b = Path(p).read_bytes()
        total += len(b)
        images.append({"path": None, "bytes": b})

    print(f"[setup] total image bytes stored in dataset: {total/1024/1024:.1f} MiB")
    features = Features({"image": Image(), "image_path": Value("string")})
    return Dataset.from_dict({"image": images, "image_path": paths}, features=features)


def bench_map(tag: str, ds: Dataset, fn, **kwargs) -> Dataset:
    t0 = time.perf_counter()
    out = ds.map(
        fn,
        batched=True,
        load_from_cache_file=False,  # always execute
        **kwargs,
    )
    dt = time.perf_counter() - t0
    print(f"[{tag}] {dt:8.3f}s | cols={out.column_names}")
    return out


# ----------------------------
# (1) BIG writer-gap demo
# ----------------------------
def add_tiny_embedding_from_paths(image_path):
    """
    Called with batched=True and input_columns="image_path".
    image_path is a list[str].
    Return a tiny embedding to keep compute negligible so writer cost dominates.
    """
    n = len(image_path)
    # tiny embedding (8 dims) to avoid compute/memory; stored as float16
    emb = np.zeros((n, 8), dtype=np.float16)
    return {"emb": emb}


def section_1_big_gap(ds_bytes: Dataset, ds_paths: Dataset):
    print("\n=== (1) BIG gap demo: rewriting heavy image bytes vs dropping image column ===")
    # Make writes more frequent to amplify writer overhead (smaller = more overhead)
    writer_batch_size = 16
    batch_size = 64

    # A) Bytes-backed dataset (expect BIG difference when keeping vs dropping "image")
    _ = bench_map(
        "REPRO bytes-backed: keep image (rewrites big bytes)",
        ds_bytes,
        add_tiny_embedding_from_paths,
        input_columns="image_path",
        batch_size=batch_size,
        writer_batch_size=writer_batch_size,
    )
    _ = bench_map(
        "FIX bytes-backed: remove_columns=['image']",
        ds_bytes,
        add_tiny_embedding_from_paths,
        input_columns="image_path",
        batch_size=batch_size,
        writer_batch_size=writer_batch_size,
        remove_columns=["image"],
    )

    # B) Paths-backed dataset (expect smaller difference; rewriting paths is cheap)
    _ = bench_map(
        "CONTROL paths-backed: keep image (paths are cheap)",
        ds_paths,
        add_tiny_embedding_from_paths,
        input_columns="image_path",
        batch_size=batch_size,
        writer_batch_size=writer_batch_size,
    )
    _ = bench_map(
        "CONTROL paths-backed: remove_columns=['image']",
        ds_paths,
        add_tiny_embedding_from_paths,
        input_columns="image_path",
        batch_size=batch_size,
        writer_batch_size=writer_batch_size,
        remove_columns=["image"],
    )

    print(
        "\n[reading] Why this happens:\n"
        "  - map() materializes a new dataset state and writes kept columns into Arrow cache shards.\n"
        "  - If your Image column is bytes-backed (or becomes bytes-backed), keeping it means rewriting lots of data.\n"
        "  - Dropping it avoids rewriting raw payload.\n"
        "Docs: https://huggingface.co/docs/datasets/en/package_reference/main_classes\n"
        "      https://huggingface.co/docs/datasets/en/about_dataset_features\n"
    )


# ----------------------------
# (2) Fingerprint startup demo
# ----------------------------
def trivial_map_returns_zeros(image_path):
    # image_path is list[str] because input_columns="image_path"
    return {"x": [0] * len(image_path)}


def make_trivial_that_captures_model(model):
    # Captures model in a default arg (even if unused): common “gotcha”
    def f(image_path, _model=model):
        return {"x": [0] * len(image_path)}
    return f


def section_2_fingerprint_delay(paths: list[str]):
    print("\n=== (2) Fingerprint startup overhead demo ===")
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Device: {device}")

    # Small dataset: isolate fingerprint cost
    ds_small = Dataset.from_dict(
        {"image_path": paths[:64]},
        features=Features({"image_path": Value("string")}),
    )

    # Load a big model and move to CUDA to amplify hashing cost (T4-safe in fp16)
    model_id = "openai/clip-vit-base-patch32"
    if device == "cuda":
        model = CLIPModel.from_pretrained(model_id, torch_dtype=torch.float16).eval().to(device)
    else:
        model = CLIPModel.from_pretrained(model_id).eval().to(device)

    f_captures_model = make_trivial_that_captures_model(model)

    # REPRO: function captures big model -> update_fingerprint can be slow
    _ = bench_map(
        "REPRO fingerprint (captures model)",
        ds_small,
        f_captures_model,
        input_columns="image_path",
        batch_size=64,
    )

    # FIX: provide new_fingerprint to skip expensive hashing
    _ = bench_map(
        "FIX new_fingerprint='demo_v1'",
        ds_small,
        f_captures_model,
        input_columns="image_path",
        batch_size=64,
        new_fingerprint="demo_v1",
    )

    # FIX: don’t capture model at all
    _ = bench_map(
        "FIX no capture",
        ds_small,
        trivial_map_returns_zeros,
        input_columns="image_path",
        batch_size=64,
    )

    print(
        "\n[reading] Fingerprinting notes:\n"
        "  - Fingerprints are used for caching; hashing can include captured objects.\n"
        "  - Capturing a large CUDA model can add seconds of fixed startup cost.\n"
        "Docs: https://huggingface.co/docs/datasets/en/about_cache\n"
        "      /static-proxy?url=https%3A%2F%2Fdiscuss.huggingface.co%2Ft%2Fdealing-with-large-objects-as-arguments-in-datasets-map%2F10946%5Cn"
        "Issue: https://github.com/huggingface/datasets/issues/7986\n"
    )


# ----------------------------
# Main
# ----------------------------
def main():
    # Put datasets cache in a temporary directory so this demo doesn’t bloat persistent storage.
    with tempfile.TemporaryDirectory() as tmp:
        tmp = Path(tmp)
        os.environ["HF_DATASETS_CACHE"] = str(tmp / "hf_datasets_cache")

        img_dir = tmp / "imgs"
        paths = make_big_pngs(img_dir, n=128, size=768, seed=0)

        # Build two datasets:
        #  - ds_bytes: image column stores bytes (big rewrite cost)
        #  - ds_paths: image column stores paths only (small rewrite cost)
        ds_bytes = build_ds_bytes(paths)
        ds_paths = build_ds_paths(paths)

        section_1_big_gap(ds_bytes, ds_paths)
        section_2_fingerprint_delay(paths)

        print("\nDone.")
        print("If (1) gap isn’t big enough on your runtime, increase either:")
        print("  - n (e.g., 256) or")
        print("  - size (e.g., 1024)")
        print("Expect higher RAM usage because ds_bytes stores all image bytes in memory.")


if __name__ == "__main__":
    main()