Hmm… I can’t pinpoint a single cause…
This error is happening in the commit phase, not in the data-transfer phase.
hf upload-large-folder (same as HfApi.upload_large_folder) does three separate things:
- Hash local files.
- Pre-upload file blobs (LFS or Xet backend).
- Commit: send an HTTP “create commit” request that tells the Hub “these N paths now exist and point to these uploaded blobs”.
Your logs say hashing + pre-upload succeed, then “Failed to commit: The read operation timed out”. That means the client waited for the Hub’s response to the commit request and timed out before it got one.
The main underlying cause: commit requests can exceed the server timeout
Hugging Face explicitly documents that when pushing through HTTP, the Hub enforces a 60s timeout on the commit request, and that each file operation is validated server-side. They also note that in rare cases the server may still finish the commit even if the client times out, and recommend keeping commits to roughly 50–100 files per commit to reduce timeout risk. (Hugging Face)
So the failure mode is:
- Your blobs are already uploaded.
- The final “create commit” HTTP call is taking too long (Hub load, repo state, validation cost).
- The client reports a timeout and retries with smaller batches.
Important nuance: a client-side timeout does not always mean “nothing happened”. HF warns the server may complete the commit anyway. (Hugging Face)
A second likely factor in your specific case: you are over the “>1TB dataset” threshold
HF also documents that datasets bigger than 1TB require Team/Enterprise or an explicit storage grant, and they require you to email [email protected] when crossing that scale. (Hugging Face)
Your repo UI currently shows about 1.07 TB already stored. (Hugging Face)
So you are exactly at the point where storage policy and internal throttles can start to matter, even if you were able to upload most of it.
What to do first (fast sanity checks)
1) Check whether the “failed” commit actually landed anyway
Because HF says timeouts can still complete server-side, do this:
- Open the repo “Files and versions”.
- Look for the last batch of files you expected.
- If they appear, your retry loop may be re-attempting work that already succeeded.
This is specifically called out in HF docs: timeout can be raised client-side even if server completes, and you can verify by browsing the repo. (Hugging Face)
2) Confirm you are not hitting the >1TB hosting gate
If you have not already: email [email protected] with dataset name, size, format, and intended community use. HF explicitly requires this for >1TB datasets. (Hugging Face)
If you do not, you can end up in a situation where uploads “mostly work” but commits become unreliable or blocked.
Practical fixes that usually work
Fix A: Reduce commit “work” by reducing file-count pressure (best long-term)
If your dataset is many files, the Hub has multiple scaling limits that can degrade reliability:
- ≤10k files per folder recommendation (use subdirectories). (Hugging Face)
- Repo UX degrades after thousands of commits. (Hugging Face)
- Commit requests can timeout when validation takes too long; HF suggests 50–100 files per commit. (Hugging Face)
For very large datasets, HF explicitly recommends using Parquet or WebDataset to share large data efficiently and keep the ecosystem tools working. (Hugging Face)
Concrete approach for CFD data:
-
Pack samples into shards:
- WebDataset:
data-00000.tar, data-00001.tar, …
- Or Parquet row-groups / multiple parquet files per split.
-
Target shard sizes like 1–10 GB (or bigger if you prefer fewer files), but avoid single huge objects.
-
Keep directory fanout sane (e.g., train/000/, train/001/, …).
Even if your current upload finishes, this restructuring typically prevents future “commit timeout” loops and makes downloads more robust.
Fix B: Use Xet-backed uploading properly (performance and robustness)
HF’s upload guide recommends hf_xet and notes you can enable high performance mode with HF_XET_HIGH_PERFORMANCE=1. (Hugging Face)
They also recommend putting the Xet cache on local disk (NVMe/SSD) when uploading from network/distributed filesystems, via HF_XET_CACHE, because the default cache is under HF_HOME which might live on slower network storage. (Hugging Face)
This does not directly remove the 60s commit timeout, but it reduces overall upload friction and can reduce “commit phase” lag indirectly by lowering contention and retry churn.
Fix C: Make sure resumability is actually working (do not delete the upload cache)
upload_large_folder is resumable because it caches task results locally in a ./cache/huggingface directory inside the folder being uploaded. (Hugging Face)
If you:
- run from a different path,
- wipe that cache,
- or upload from different machines without shared cache,
then “resume” becomes much weaker and you can waste time re-hashing/re-preuploading.
Fix D: If you are stuck in a commit-timeout loop, stop relying on one big repo state transition
When the commit step is the bottleneck, the most reliable pattern is:
- Commit fewer file operations per commit (even 1–5 at a time if needed).
- Or commit “by shard” after repacking.
Even HF’s own pain-point threads emphasize that large uploads can hit rate limits or commit-step failures and that chunked workflows are necessary at TB scale. (GitHub)
Why your batch is already “small” but still failing (what that implies)
You said it fails committing 20 files at once, then retries with fewer.
20 is not large. If even very small batches still fail, the likely explanations narrow to:
- Server-side slowness or transient infrastructure issues around commit endpoints (you will see similar reports as 503/504 during upload of very large datasets). (GitHub)
- Crossing an account/repo storage policy boundary (you are already at ~1.07 TB). (Hugging Face)
- Too many total repo entries / heavy repo state (many files, many commits), making each additional commit slower to validate and apply. HF warns UX degrades after thousands of commits. (Hugging Face)
Given your repo is already over 1TB, item (2) is the first thing I would address.
Similar cases online (high-signal references)
Timeouts / errors during commit step
Repo scaling limits and “too many files”
Official “how to upload large folders” mechanics
What I would do for your repo, specifically
-
Email [email protected] now with your repo link and size. You are past 1TB, and HF explicitly requires contact for support/storage at that scale. (Hugging Face)
-
Verify whether the “timed out” commit actually landed by checking “Files and versions”. HF says it can. (Hugging Face)
-
If commits truly are not landing:
- Repack into shards (WebDataset tar shards or Parquet files) and upload those instead of many raw files. HF explicitly recommends Parquet/WebDataset for large dataset sharing. (Hugging Face)
-
Ensure the upload is running with stable resumability:
- same working directory,
- do not wipe
./cache/huggingface,
- if on cluster, configure
HF_XET_CACHE to a local SSD path. (Hugging Face)
Summary
- The error is from the final commit HTTP request timing out, not from uploading bytes. HF documents a 60s commit timeout and recommends limiting commit workload. (Hugging Face)
- Your repo is already ~1.07 TB, and HF requires contacting them for datasets >1TB. (Hugging Face)
- The most robust fix is usually sharding into fewer larger files (WebDataset/Parquet) plus making sure you have the proper >1TB storage grant. (Hugging Face)