Bug in Google Colab Assemble Everything (PyTorch)

Description :

  • Notebook link : section6_pt.ipynb - Colab

  • Error: When executing the 2nd cell of the notebook, the following error:

  • TypeError                                 Traceback (most recent call last)
    
    
    /tmp/ipykernel_6268/743095204.py in <cell line: 0>()
          2 
          3 checkpoint = "tblard/tf-allocine"
    ----> 4 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
          5 
          6 sequence = "J'ai attendu un cours d’HuggingFace toute ma vie."
    
    
    

    3 frames


    /usr/local/lib/python3.12/dist-packages/transformers/models/camembert/tokenization_camembert.py in __init__(self, bos_token, eos_token, sep_token, cls_token, unk_token, pad_token, mask_token, additional_special_tokens, add_prefix_space, vocab_file, vocab, **kwargs)
        117             self._vocab = vocab
        118             unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
    --> 119             self._tokenizer = Tokenizer(Unigram(self._vocab, unk_id=unk_index, byte_fallback=False))
        120         else:
        121             self._vocab = [
    
    
    
    TypeError: argument 'vocab': 'str' object cannot be converted to 'PyTuple'
    
  • Model concerned : tblard/tf-allocine

Would it be possible to have a correction of the notebook in order to be able to run and test the code without errors?

1 Like

The cause is likely version drift resulting from the major update from Transformers v4 to v5. While this is a common occurrence, this particular model seems to have slightly more compatibility issues than typical models:


The main cause is version drift.

Your notebook installs transformers[sentencepiece] with no version pin, then immediately loads tblard/tf-allocine through AutoTokenizer.from_pretrained(...). That model repo is old, CamemBERT-based, and its files are legacy-shaped: sentencepiece.bpe.model, special_tokens_map.json, a 2-byte tokenizer_config.json, and TensorFlow weights only as tf_model.h5. Hugging Face released transformers v5 on January 26, 2026, and PyPI now serves 5.4.0 as latest, so a notebook that once worked can now pull a substantially different tokenizer stack than it originally expected. (Hugging Face)

What is happening in your notebook

The first fragile step is not the PyTorch model load. It is the tokenizer load.

In your notebook, the first real model-related operation is:

from transformers import AutoTokenizer

checkpoint = "tblard/tf-allocine"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

So if the run fails there, the problem is already present before PyTorch inference logic matters. That fits the current upstream bug pattern almost exactly. In March 2026, Hugging Face had multiple reports where AutoTokenizer.from_pretrained(...) failed for older CamemBERT-family models inside transformers/models/camembert/tokenization_camembert.py with ValueError: too many values to unpack (expected 2), and the same reports say the models worked on transformers 4.57.x but failed on 5.x. (GitHub)

Why it breaks now

transformers v5 changed tokenizer internals in a major way.

The v5 release notes and migration guide say Hugging Face is moving away from the old slow/fast tokenizer split, consolidating to a single tokenizer file per model, preferring the tokenizers backend, and supporting SentencePiece through a lighter compatibility layer. The release notes also describe v5 as the first major release in five years and explicitly call out tokenization as one of the significant API changes. That is exactly the area touched by your checkpoint. (GitHub)

CamemBERT is also the right family to suspect here. The official CamemBERT docs say its tokenizer uses a SentencePiece vocab file and that the fast tokenizer is Unigram-based. Your model card labels the checkpoint as CamemBERT, and the files page shows a SentencePiece model file. So the upstream tokenizer refactor and the model’s storage format intersect directly in your case. (Hugging Face)

Why this checkpoint is extra fragile

This checkpoint is not a modern PyTorch-first repo.

The model card’s usage example is:

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine")
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")

And the files page shows tf_model.h5, not a native PyTorch weight file. That means the notebook is doing two things at once:

  1. loading an older tokenizer format, and
  2. asking PyTorch to load from TensorFlow weights later with from_tf=True.

The second part is legal. Older AutoModel docs explicitly show from_tf=True for loading a TensorFlow checkpoint into a PyTorch auto-model. But it is still an extra compatibility layer after the tokenizer problem. (Hugging Face)

So what is the actual root cause

For your case, I would rank the causes like this:

1. Primary cause

A current transformers v5 tokenizer regression or incompatibility with some older CamemBERT-family SentencePiece checkpoints. The closest public reports show the same code path and same exception, and both point to v5 breaking cases that worked on v4.57.x. (GitHub)

2. Enabler

The notebook’s install line is unpinned, so Colab fetches the latest library stack instead of the stack the lesson originally expected. (PyPI)

3. Secondary complication

The checkpoint is TensorFlow-native on the Hub, so the PyTorch version of the notebook depends on from_tf=True and conversion logic after tokenizer loading succeeds. (Hugging Face)

What it is not

It is probably not primarily a PyTorch bug.

Why: the strongest matching public failures break at AutoTokenizer.from_pretrained(...), not deep inside a model forward pass. Also, your notebook installs transformers[sentencepiece], so this is less likely to be the simple “you forgot SentencePiece” class of failure. A stale Colab kernel can still make optional dependencies invisible, but the bigger pattern here is the v5 tokenizer change plus an old checkpoint. (GitHub)

Best fix for this notebook

Use a known-good v4 stack and restart the runtime.

Replace the install cell with:

!pip -q uninstall -y transformers tokenizers
!pip -q install "transformers==4.57.1" "tokenizers==0.22.1" "sentencepiece>=0.1.99,<0.3"

Then restart the Colab runtime and run the notebook again from the top.

Why this is the safest fix:

  • a very similar public issue reports broken on transformers 5.2.0 / tokenizers 0.22.2 and working on transformers 4.57.1 / tokenizers 0.22.1;
  • another similar issue reports broken on 5.3.0 and working on 4.57.6. (GitHub)

Keep this line in the PyTorch cell

This part of the notebook is still correct:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, from_tf=True)

The reason is simple: the Hub repo exposes TensorFlow weights, not native PyTorch weights. If you remove from_tf=True, you would be fighting the wrong problem. (Hugging Face)

A cleaner alternative

If your goal is only to finish the lesson with minimal friction, there are two cleaner paths:

Path A

Stay with this checkpoint, but run it in the TensorFlow style shown on the model card with TFAutoModelForSequenceClassification. That matches how the checkpoint was published. (Hugging Face)

Path B

Keep the lesson in PyTorch, but switch to a newer PyTorch-native sentiment checkpoint instead of using an old TensorFlow-first repo. That is the more future-proof route, because it removes both the tokenizer-compatibility risk and the TF-to-PT conversion layer. This second option would need model selection work, so it is more of a migration choice than a quick fix.

Background lesson

This is a standard ML-notebook failure mode:

  • the notebook code still looks correct,
  • the model repo still exists,
  • the tutorial step is still logically fine,
  • but a floating dependency pulls in a major library release,
  • and an older checkpoint hits a new compatibility edge.

That is why this feels sudden. The notebook did not become conceptually wrong. Its environment assumptions expired. The v5 release notes explicitly frame v5 as a major release with significant API and tokenizer changes, and the current PyPI release cadence makes unpinned notebooks drift quickly. (GitHub)

Bottom line

For your notebook, the clearest explanation is:

  • old checkpoint
  • new tokenizer stack
  • unpinned Colab install
  • TensorFlow-first model used from a PyTorch notebook

The safest solution is to pin back to a v4.57.x environment, restart the runtime, and keep from_tf=True for the model load. That addresses the real failure point and matches the closest current upstream evidence. (GitHub)


Cells that truly need fixing

Cell 2

Current:

!pip install transformers[sentencepiece]

This is the main problem. It installs a floating latest version, and transformers v5 introduced major tokenization changes with weekly minor releases after the v5 launch. That makes old notebooks drift into new behavior. Current upstream reports show very similar CamemBERT-family tokenizer failures on v5 that work on v4.57.x. (GitHub)

Replace it with:

# Run once, then restart the Colab runtime.
!pip -q uninstall -y transformers tokenizers sentencepiece
!pip -q install "transformers==4.57.1" "tokenizers==0.22.1" "sentencepiece>=0.1.99,<0.3"

Cell 11

Current:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "tblard/tf-allocine"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, from_tf=True)
sequences = [
    "J'ai attendu un cours de HuggingFace toute ma vie.",
    "Moi aussi !",
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

This cell is mostly correct already. The important part is from_tf=True, because that checkpoint exposes TensorFlow weights (tf_model.h5) rather than native PyTorch weights. So this cell does not need a conceptual fix, only a light cleanup. (GitHub)

Cleaner version:

import torch
from transformers import AutoModelForSequenceClassification

checkpoint = "tblard/tf-allocine"

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    from_tf=True,
)

sequences = [
    "J'ai attendu un cours de HuggingFace toute ma vie.",
    "Moi aussi !",
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**tokens)

print(outputs.logits)

Cells that do not need fixing

Cell 3

from transformers import AutoTokenizer

checkpoint = "tblard/tf-allocine"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "J'ai attendu un cours d’HuggingFace toute ma vie."

model_inputs = tokenizer(sequence)

This is fine once Cell 2 is fixed and the runtime is restarted. The failure is here, but the root cause is upstream of it: the environment. So this cell does not need a logic fix. It only needs the install cell above it to stop pulling the bad version range.

Cells 4 to 10

These are demo cells for:

  • single example tokenization
  • batch tokenization
  • padding
  • truncation
  • tensor return types
  • token IDs and decoding

They are not the cause of the break. They may be cleaned up, but they do not need fixing for compatibility.

Small nuance

If you want the notebook to be more robust and easier to read, then I would still rewrite Cells 3 to 10 for consistency. But that is optional cleanup, not a necessary fix.

So the minimum necessary patch is:

  • Fix Cell 2
  • Restart runtime
  • Keep Cell 3
  • Keep Cell 11, maybe with a light cleanup

Minimum-diff patch

If you want the least editing possible, use only these two replacements.

Replace Cell 2 with

# Run once, then restart the Colab runtime.
!pip -q uninstall -y transformers tokenizers sentencepiece
!pip -q install "transformers==4.57.1" "tokenizers==0.22.1" "sentencepiece>=0.1.99,<0.3"

Replace Cell 11 with

import torch
from transformers import AutoModelForSequenceClassification

checkpoint = "tblard/tf-allocine"

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    from_tf=True,
)

sequences = [
    "J'ai attendu un cours de HuggingFace toute ma vie.",
    "Moi aussi !",
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**tokens)

print(outputs.logits)

Bottom line

Necessary fixes: Cell 2.
Recommended cleanup: Cell 11.
Optional cleanup only: Cells 3 to 10.