Lance-3B-MLX

First native MLX port of ByteDance Research's Lance β€” a 3 B-parameter unified multimodal model for image/video generation, editing, and understanding. Runs natively on Apple Silicon, no CUDA required.

The architecture is Qwen2.5-VL-3B + parallel MoE-gen experts + Wan 2.2 VAE. Lance uses a "Mixture-of-Tokens" routing: every attention block and MLP has a parallel *_moe_gen branch. Text tokens go through normal weights; VAE-latent (generation) tokens go through the _moe_gen weights, in the same forward pass.

Quick start (self-contained β€” no external repo needed)

# 1. Download the model (one-time, ~27 GB total)
hf download RockTalk/Lance-3B-MLX --local-dir Lance-3B-MLX

# 2. Install runtime deps
cd Lance-3B-MLX
pip install -r requirements.txt

# 3. Generate
python inference.py --prompt "a photo of a sunset over mountains" --out sunset.png

The first run auto-fetches the companion VAE (RockTalk/Wan2.2-VAE-MLX, ~2.6 GB, cached as wan22_vae.safetensors) so all subsequent runs are fully offline.

CLI options

python inference.py \
  --prompt "..." \
  --out output.png \
  --size 512        # 256 or 512
  --steps 30        # 24-30 typical
  --cfg 4.0
  --seed 0

Programmatic use

import json, mlx.core as mx
from lance_mlx.lance import Lance, LanceConfig
from lance_mlx.vae_wan22 import Wan2_2_VAE
from mlx_vlm.models.qwen2_5_vl.config import ModelConfig, TextConfig, VisionConfig
from transformers import AutoTokenizer

# Build LanceConfig from config.json β€” full helper lives in inference.py
# (`build_lance_config`) and is reusable as a library function.
from inference import build_lance_config, ensure_vae_weights
from pathlib import Path

repo = Path(".")
cfg_json = json.loads((repo / "config.json").read_text())
lance_cfg = build_lance_config(cfg_json)

model = Lance(lance_cfg)
model.load_weights(list(mx.load("model.safetensors").items()), strict=True)

vae = Wan2_2_VAE(z_dim=48, c_dim=160, dim_mult=(1, 2, 4, 4),
                 temperal_downsample=(False, True, True))
vae.model.load_weights(list(mx.load(str(ensure_vae_weights(repo))).items()), strict=True)

tok = AutoTokenizer.from_pretrained(".")
text_ids = mx.array(tok("a sunset over mountains", add_special_tokens=False,
                        return_tensors="np").input_ids[0], dtype=mx.int32)

latent = model.sample_t2i(
    prompt_token_ids=text_ids,
    latent_shape=(1, 32, 32),                   # (T_lat=1, H_lat, W_lat) β†’ 512Γ—512
    special_token_ids={"bos": 151644, "eos": 151645,
                       "start_of_image": 151652, "end_of_image": 151653,
                       "image_token_id": 151655},
    num_steps=30, timestep_shift=3.5, cfg_scale=4.0, seed=0,
)
img = vae.decode(latent)                        # (1, 1, 512, 512, 3) in [-1, 1]

What works

Capability Status
Text-to-image (T2I), single image, CFG βœ… Working, verified
Strict load of all 1021 LLM/adapter tensors βœ… Working
Wan 2.2 VAE encode/decode (T=1) βœ… Working (uses RockTalk/Wan2.2-VAE-MLX)
Flow-matching denoising loop βœ… Working
Classifier-free guidance βœ… Working
3D mrope position embeddings βœ… Working
MoE-gen routing (per-token attention + MLP + layernorm) βœ… Working
Text-to-video (T2V) βœ… Working on Lance-3B-Video-MLX (verified at T_lat=3, 9 frames @ 256Γ—256)
Xβ†’T (image understanding) βœ… Working β€” accurate captioning at ~29 tok/s with KV cache
Image editing (TI2I) βœ… Working β€” ViT + VAE dual conditioning, Lance chat template, three-component CFG (cfg_text + cfg_vit). Semantic edits verified (color change, object addition).

Sample generations

T2I β€” text to image

Verified on M4 Studio (128 GB). 30 steps, CFG=4, 512Γ—512:

Prompt Output
"a photo of a sunset over mountains" sunset
"a fluffy orange cat sitting on a wooden chair, photorealistic" cat
"a majestic snowy mountain peak with a dramatic blue sky and clouds" mountain

TI2I β€” image editing

End-to-end edit pipeline: input image β†’ ViT (UND tokens) + VAE-encode (cond latent) β†’ Lance edit-mode chat template β†’ three-component CFG flow-matching β†’ VAE decode.

Three-component CFG (mirrors PT Lance): v_final = v_tv_uncond + cfg_text * (v_full - v_t_uncond) + cfg_vit * (v_t_uncond - v_tv_uncond). CFG settings: cfg_text=3.0, cfg_vit=1.0. ~1.5 s/step at 256Β² (three forward passes per step), 24 steps β‰ˆ 37 s.

Input Instruction Output
cat "Add a small red bow tie to the cat." bowtie
cat "Make the cat completely black, like a panther." panther

X→T — image understanding

Same M4 Studio. AR generation with KV cache, ~29 tok/s. Question: "Describe this image briefly."

Image Generated description
cat "The image shows orange cats sitting closely together on a wooden surface. The wooden surface has a warm, orange hue that complements the color of the cats."
mountain "A majestic, snow-covered mountain peak. The mountain is partially shrouded in clouds, creating a dramatic and ethereal atmosphere..."
sunset "A stunning sunset over a mountain range, with the sky painted in rich hues of orange, red, and yellow. The sun is just below the horizon, casting a warm glow..."

Performance

Measured on M4 Studio (128 GB) at CFG=4 (one conditional + one unconditional forward per step):

Mode Resolution Γ— Frames Steps Per-step Total Notes
T2I 256Γ—256 Γ— 1 24 ~400 ms ~9.6 s CFG=4
T2I 512Γ—512 Γ— 1 30 ~1.2 s ~36 s CFG=4
TI2I 256Γ—256 Γ— 1 24 ~1.5 s ~37 s 3-component CFG (3 forwards/step)
X→T 504×504 input — ~30 tok/s ~2 s for 60 tokens KV cache active

First-call kernel-compile penalty: ~few seconds per new resolution.

Files

File Size Description
model.safetensors 23 GB LLM (Qwen2.5-VL with MoE-gen) + Lance adapters, 1021 tensors
vit.safetensors 1.25 GB Qwen2.5-VL ViT (used by X→T and TI2I)
vae.safetensors 2.62 GB Wan 2.2 VAE (older "nested-conv" keying, kept for archival β€” inference.py auto-fetches the cleanly-keyed RockTalk/Wan2.2-VAE-MLX instead)
config.json β€” Distilled architecture config
tokenizer.json, vocab.json, merges.txt β€” Qwen2.5-VL tokenizer, verbatim
samples/*.png β€” Verified outputs (T2I + TI2I edit) from this checkpoint
lance_mlx/ β€” Bundled MLX implementation (model + VAE + utils)
inference.py β€” Self-contained T2I runner
requirements.txt β€” Pip dependencies

How the MoE-gen routing is implemented in MLX

Lance's checkpoint contains two sets of weights per Qwen2 block:

self_attn.{q,k,v,o}_proj         self_attn.{q,k,v,o}_proj_moe_gen
self_attn.{q,k}_norm             self_attn.{q,k}_norm_moe_gen
mlp.{gate,down,up}_proj          mlp_moe_gen.{gate,down,up}_proj
input_layernorm                  input_layernorm_moe_gen
post_attention_layernorm         post_attention_layernorm_moe_gen

Each mode has its own sequence layout. The MLX port (lance_mlx/qwen2_navit_mlx.py) routes by slicing the sequence into the GEN slab vs the surrounding UND text/vision, applying the appropriate expert to each slab, and concatenating.

T2I / T2V β€” text prompt then target noise:

<|im_start|> [prompt] <|im_end|> <|vision_start|> [N latent placeholders] <|vision_end|>
                                                  └──── routed through moe_gen β”€β”€β”€β”€β”˜

X→T (understanding) — image then text question, autoregressive answer:

<|im_start|>system\n[Lance sys]<|im_end|>\n<|im_start|>user\n
  <|vision_start|>[N_vit placeholders]<|vision_end|>[question]
<|im_end|>\n<|im_start|>assistant\n[AR generated tokens...]

All tokens route through normal weights. Per-layer KV cache for the AR loop. Image positions inside <|vision_start|>..<|vision_end|> use 3D mrope grid coords (h_patches/sms Γ— w_patches/sms).

TI2I (editing) β€” input image (ViT + VAE-cond) + instruction β†’ target noise:

<|im_start|>system\n[Lance edit-mode sys]<|im_end|>\n<|im_start|>user\n
  <|vision_start|>[N_vit placeholders]<|vision_end|>[instruction]
<|im_end|>\n<|im_start|>assistant\n
  <|vision_start|>[N_cond VAE-latent placeholders]<|vision_end|>
  <|vision_start|>[N_tgt noise placeholders]<|vision_end|>
                  └──── routed through moe_gen β”€β”€β”€β”€β”˜

ViT tokens and VAE-cond tokens use normal weights; only the target-noise block uses moe_gen. The Lance edit-mode system prompt is verbatim from PT and is required for the model to recognize edit intent. Three-component CFG (cfg_text, cfg_vit) gives separate control over text vs visual conditioning strength.

Conversion source

Converted from bytedance-research/Lance/Lance_3B/* using a local conversion pipeline. Layout transforms:

  • Conv weights: PT (O, I, [T,] H, W) β†’ MLX (O, [T,] H, W, I)
  • Embedding weights: shape preserved
  • lm_head.weight tied to embed_tokens.weight (Qwen default)
  • All *_moe_gen.* keys copied verbatim under the same names

Related ports

A parallel MLX port exists at mlx-community/Lance-3B-Video-bf16 (Apache-2.0). The two checkpoints have been verified numerically equivalent: remapping this repo's F32 weights into their layout and casting to bf16 produces byte-identical pixel output through their pipeline. Use whichever fits your workflow.

License

Apache 2.0, inherited from upstream bytedance-research/Lance. The Wan 2.2 VAE component is also Apache 2.0 from Alibaba's Wan team.

Acknowledgements

  • ByteDance Research β€” original Lance training + PT release
  • Qwen team β€” Qwen2.5-VL-3B-Instruct backbone
  • Alibaba Wan team β€” Wan 2.2 VAE training
  • Apple mlx and mlx-vlm teams β€” the underlying frameworks
  • mlx-community Lance porters β€” parallel bf16 port, numerically cross-checked against this one
  • This MLX port β€” RockTalk

Citation

@misc{lance_mlx,
  title  = {Lance-3B-MLX β€” First MLX port of ByteDance's Lance},
  author = {RockTalk},
  year   = {2026},
  url    = {https://huggingface.co/RockTalk/Lance-3B-MLX}
}
Downloads last month
837
Safetensors
Model size
6B params
Tensor type
F32
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RockTalk/Lance-3B-MLX

Finetuned
(793)
this model