Lance-3B-Video-MLX

Video variant of Lance-3B-MLX. First native MLX port of ByteDance Research's Lance β€” a 3 B-parameter unified multimodal model for image/video generation, editing, and understanding. Runs natively on Apple Silicon, no CUDA required.

The architecture is Qwen2.5-VL-3B + parallel MoE-gen experts + Wan 2.2 VAE. Lance uses a "Mixture-of-Tokens" routing: every attention block and MLP has a parallel *_moe_gen branch. Text tokens go through normal weights; VAE-latent (generation) tokens go through the _moe_gen weights, in the same forward pass.

Quick start (self-contained β€” no external repo needed)

# 1. Download the model (one-time, ~30 GB total)
hf download RockTalk/Lance-3B-Video-MLX --local-dir Lance-3B-Video-MLX

# 2. Install runtime deps
cd Lance-3B-Video-MLX
pip install -r requirements.txt

# 3. Generate a 9-frame video (T_lat=3 β†’ 9 output frames)
python inference.py --prompt "a calm ocean wave rolling onto a sandy beach"

# Longer video (29 frames):
python inference.py --prompt "..." --t-lat 8

# Pure image (T_lat=1):
python inference.py --prompt "..." --t-lat 1 --size 512 --steps 30

First run auto-fetches the companion VAE (RockTalk/Wan2.2-VAE-MLX, ~2.6 GB, cached as wan22_vae.safetensors) so all subsequent runs are fully offline.

CLI options

python inference.py \
  --prompt "..." \
  --out output.png       # frame strip + per-frame PNGs saved next to it
  --size 256             # 256 recommended for T2V
  --t-lat 3              # latent frames; output = (t_lat-1)*4 + 1 frames
  --steps 24             # 24 typical for T2V
  --cfg 4.0
  --seed 0
  --mp4                  # also emit output.mp4 (needs `pip install 'imageio[ffmpeg]'`)
  --fps 8

Frame-count table: T_lat=1 β†’ 1, T_lat=3 β†’ 9, T_lat=8 β†’ 29, T_lat=31 β†’ 121 (max).

Programmatic use

from inference import build_lance_config, ensure_vae_weights
from lance_mlx.lance import Lance, LanceConfig
from lance_mlx.vae_wan22 import Wan2_2_VAE
from transformers import AutoTokenizer
import json, mlx.core as mx
from pathlib import Path

repo = Path(".")
cfg_json = json.loads((repo / "config.json").read_text())
lance_cfg = build_lance_config(cfg_json)

model = Lance(lance_cfg)
weights = mx.load("model.safetensors")
non_vit = {k: v for k, v in weights.items() if not k.startswith("vit_model.")}
model.load_weights(list(non_vit.items()), strict=True)

vae = Wan2_2_VAE(z_dim=48, c_dim=160, dim_mult=(1, 2, 4, 4),
                 temperal_downsample=(False, True, True))
vae.model.load_weights(list(mx.load(str(ensure_vae_weights(repo))).items()), strict=True)

tok = AutoTokenizer.from_pretrained(".")
text_ids = mx.array(tok("a calm ocean wave", add_special_tokens=False,
                        return_tensors="np").input_ids[0], dtype=mx.int32)

latent = model.sample_t2i(
    prompt_token_ids=text_ids,
    latent_shape=(3, 16, 16),    # T_lat=3 β†’ 9 output frames @ 256Γ—256
    special_token_ids={"bos": 151644, "eos": 151645,
                       "start_of_image": 151652, "end_of_image": 151653,
                       "image_token_id": 151655},
    num_steps=24, timestep_shift=3.5, cfg_scale=4.0, seed=0,
)
video = vae.decode(latent)        # (1, 9, 256, 256, 3) in [-1, 1]

What works

Capability Status
Text-to-video (T2V) βœ… Working β€” verified at T_lat=3 (9 frames @ 256Γ—256) using Wan 2.2 VAE v0.1.0 streaming cache
Text-to-image (T2I) βœ… Working (same code path as T2V with T_lat=1)
Xβ†’T (image understanding) βœ… Working β€” same code path as in Lance-3B-MLX, ViT weights bundled in model.safetensors under vit_model.*
TI2I (image editing) βœ… Working β€” same code path as in Lance-3B-MLX
TIV2V (text + image β†’ video edit) ⚠ Architecture in place β€” extension of TI2I with T_lat>1, untested
Strict load of all 1021 LLM/adapter tensors (+ 390 ViT) βœ… Working
Wan 2.2 VAE encode/decode (T=1 and T>1 streaming) βœ… Working (uses RockTalk/Wan2.2-VAE-MLX)
Flow-matching + 3-component CFG + MoE-gen routing + 3D mrope βœ… Working

Sample generations

Text-to-video (T2V)

Verified on M4 Studio (128 GB). 24 steps, CFG=4, 256Γ—256, T_lat=3 β†’ 9 frames:

"a calm ocean wave rolling onto a sandy beach" β€” 9-frame strip (left-to-right):

ocean wave strip

Single frames (frame 0, 4, 8):

Frame 0 Frame 4 Frame 8
f0 f4 f8

Performance

Measured on M4 Studio (128 GB) at CFG=4 (one conditional + one unconditional forward per step):

Mode Resolution Γ— Frames Steps Per-step Total sample VAE decode
T2I 256Γ—256 Γ— 1 24 ~400 ms ~9.6 s ~0.1 s
T2I 512Γ—512 Γ— 1 30 ~1.2 s ~36 s ~0.5 s
T2V 256Γ—256 Γ— 9 (T_lat=3) 24 ~900 ms ~22 s ~0.9 s

First-call kernel-compile penalty: ~few seconds per new resolution.

Differences vs Lance-3B-MLX

This is the same architecture as the image variant, with two differences:

  • model.safetensors: 26.5 GB (vs 23 GB) β€” extra weights for multi-frame attention
  • latent_pos_embed.pos_embed: 31 Γ— 64 Γ— 64 = 126,976 positions (vs 1 Γ— 64 Γ— 64 = 4,096) β€” supports up to 31 latent frames (β‰ˆ 121 video frames @ 4Γ— temporal downsample)

T2I via this checkpoint works the same as Lance-3B-MLX. T2V is now live β€” uses the Wan 2.2 VAE v0.1.0 streaming cache under the hood. Pass latent_shape=(T_lat, H_lat, W_lat) with T_lat > 1 to sample_t2i to generate a video.

Files

File Size Description
model.safetensors 26.5 GB LLM (Qwen2.5-VL with MoE-gen) + Lance adapters + bundled ViT (vit_model.* prefix), 1411 tensors total
vit.safetensors 1.25 GB Qwen2.5-VL ViT, also extractable from model.safetensors
vae.safetensors 2.62 GB Wan 2.2 VAE (older "nested-conv" keying, kept for archival β€” inference.py auto-fetches the cleanly-keyed RockTalk/Wan2.2-VAE-MLX instead)
config.json β€” Distilled architecture config
tokenizer.json, vocab.json, merges.txt β€” Qwen2.5-VL tokenizer, verbatim
samples/ocean_wave_*.png β€” Verified 9-frame T2V outputs
lance_mlx/ β€” Bundled MLX implementation (model + VAE + utils)
inference.py β€” Self-contained T2V/T2I runner
requirements.txt β€” Pip dependencies

How the MoE-gen routing is implemented in MLX

Lance's checkpoint contains two sets of weights per Qwen2 block:

self_attn.{q,k,v,o}_proj         self_attn.{q,k,v,o}_proj_moe_gen
self_attn.{q,k}_norm             self_attn.{q,k}_norm_moe_gen
mlp.{gate,down,up}_proj          mlp_moe_gen.{gate,down,up}_proj
input_layernorm                  input_layernorm_moe_gen
post_attention_layernorm         post_attention_layernorm_moe_gen

For T2I/T2V the sequence layout is:

<|im_start|> [prompt tokens] <|im_end|> <|vision_start|> [N latent placeholders] <|vision_end|>
                                                          └──── routed through moe_gen β”€β”€β”€β”€β”˜
                                                          ↑ everything else: normal weights

The MLX port (lance_mlx/qwen2_navit_mlx.py) routes by slicing the sequence into the latent slab vs the surrounding text, applying the appropriate expert to each slab, and concatenating. mrope position ids continue to flow normally across both slabs (with axis-T/H/W coordinates only varying inside the latent slab).

Conversion source

Converted from bytedance-research/Lance/Lance_3B/* using a local conversion pipeline. Layout transforms:

  • Conv weights: PT (O, I, [T,] H, W) β†’ MLX (O, [T,] H, W, I)
  • Embedding weights: shape preserved
  • lm_head.weight tied to embed_tokens.weight (Qwen default)
  • All *_moe_gen.* keys copied verbatim under the same names

Related ports

A parallel MLX port exists at mlx-community/Lance-3B-Video-bf16 (Apache-2.0). The two checkpoints have been verified numerically equivalent: remapping this repo's F32 weights into their layout and casting to bf16 produces byte-identical pixel output through their pipeline. Use whichever fits your workflow.

License

Apache 2.0, inherited from upstream bytedance-research/Lance. The Wan 2.2 VAE component is also Apache 2.0 from Alibaba's Wan team.

Acknowledgements

  • ByteDance Research β€” original Lance training + PT release
  • Qwen team β€” Qwen2.5-VL-3B-Instruct backbone
  • Alibaba Wan team β€” Wan 2.2 VAE training
  • Apple mlx and mlx-vlm teams β€” the underlying frameworks
  • mlx-community Lance porters β€” parallel bf16 port, numerically cross-checked against this one
  • This MLX port β€” RockTalk

Citation

@misc{lance_mlx,
  title  = {Lance-3B-MLX β€” First MLX port of ByteDance's Lance},
  author = {RockTalk},
  year   = {2026},
  url    = {https://huggingface.co/RockTalk/Lance-3B-MLX}
}
Downloads last month
372
Safetensors
Model size
7B params
Tensor type
F32
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RockTalk/Lance-3B-Video-MLX

Finetuned
(793)
this model