OpenGVLab

community

https://github.com/opengvlab

opengvlab

OpenGVLab

Activity Feed Request to join this org

AI & ML interests

Computer Vision

Recent Activity

kpzhang996 submitted a paper 2 days ago

Yume-1.5: A Text-Controlled Interactive World Generation Model

lll2343 updated a model 5 days ago

OpenGVLab/SDLM-32B-D4

lll2343 updated a model 5 days ago

OpenGVLab/SDLM-3B-D4

View all activity

Papers

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

View all Papers

OpenGVLab 's collections 34

InternVideo-Next

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Paper • 2512.01342 • Published Dec 1, 2025 • 15
revliter/internvideo_next_base_p14_res224_f16

91M • Updated 14 days ago • 210 • 3
revliter/internvideo_next_large_p14_res224_f16

0.3B • Updated 14 days ago • 319 • 4
revliter/internvideo_next_large_p14_res224_f16_stage1

Updated 14 days ago • 11 • 1

NaViL

[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published Oct 9, 2025 • 19
OpenGVLab/NaViL-2B

4B • Updated Oct 10, 2025 • 90
OpenGVLab/NaViL-9B

16B • Updated Oct 10, 2025 • 67 • 1

InternVL3.5-Core

This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 211
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8, 2025 • 81 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8, 2025 • 559 • 6
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8, 2025 • 177 • 5

ScaleCUA

OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 440 • 11
OpenGVLab/ScaleCUA-7B

Image-Text-to-Text • 8B • Updated Sep 18, 2025 • 261 • 8
OpenGVLab/ScaleCUA-32B

Image-Text-to-Text • 33B • Updated Sep 18, 2025 • 63 • 20
OpenGVLab/ScaleCUA-Data

Preview • Updated Sep 27, 2025 • 995 • 27

Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

OpenGVLab/Docopilot-2B

Image-Text-to-Text • 2B • Updated Jul 20, 2025 • 62 • 8
OpenGVLab/Docopilot-8B

Image-Text-to-Text • 8B • Updated Jul 20, 2025 • 60 • 3
OpenGVLab/Doc-750K

Preview • Updated Jul 22, 2025 • 532 • 15

InternVL3

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 306
OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11, 2025 • 119k • 77
OpenGVLab/InternVL3-2B

Image-Text-to-Text • 2B • Updated Sep 11, 2025 • 24.9k • 43
OpenGVLab/InternVL3-8B

Image-Text-to-Text • 8B • Updated Sep 11, 2025 • 131k • 103

Mono-InternVL

A Pioneering Monolithic MLLM

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Paper • 2410.08202 • Published Oct 10, 2024 • 6
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper • 2507.12566 • Published Jul 16, 2025 • 14
OpenGVLab/Mono-InternVL-2B

Image-Text-to-Text • 3B • Updated Jul 22, 2025 • 7.92k • 37
OpenGVLab/Mono-InternVL-2B-S1-1

Image-Text-to-Text • 3B • Updated Jul 22, 2025 • 57

VideoChat-R1

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

OpenGVLab/VideoChat-R1-thinking_7B

Video-Text-to-Text • 8B • Updated Apr 13, 2025 • 31
OpenGVLab/VideoChat-R1_7B

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 276 • 8
OpenGVLab/VideoChat-R1_7B_caption

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 36 • 4
OpenGVLab/VideoChat-R1_5-7B

Video-Text-to-Text • 8B • Updated Oct 2, 2025 • 13.2k • 10

VideoMAE-v2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Paper • 2303.16727 • Published Mar 29, 2023
OpenGVLab/VideoMAEv2-Base

Video Classification • 86.2M • Updated Jan 14, 2025 • 14.5k • 9
OpenGVLab/VideoMAEv2-Large

Video Classification • 0.3B • Updated Jan 14, 2025 • 3.68k • 1
OpenGVLab/VideoMAEv2-Huge

Video Classification • 0.6B • Updated Feb 25, 2025 • 622 • 1

InternVL2.5

Better than InternVL 2.0

Running

Featured

504

InternVL

⚡

504

Upload images or text for analysis and responses
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
OpenGVLab/InternVL2_5-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 180 • 192
OpenGVLab/InternVL2_5-78B-AWQ

Image-Text-to-Text • Updated Sep 11, 2025 • 152 • 14

InternVL2.0

Expanding Performance Boundaries of Open-Source MLLM

OpenGVLab/InternVL2-Pretrain-Models

Image-Text-to-Text • Updated Mar 25, 2025 • 11
OpenGVLab/InternVL2-Llama3-76B

Image-Text-to-Text • 76B • Updated Mar 25, 2025 • 198 • 211
OpenGVLab/InternVL2-Llama3-76B-AWQ

Image-Text-to-Text • Updated Mar 25, 2025 • 144 • 25
OpenGVLab/InternVL2-40B

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 276 • 93

InternVL1.0

Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 20
OpenGVLab/InternViT-6B-224px

Image Feature Extraction • Updated Dec 9, 2024 • 1.62k • 24
OpenGVLab/InternVL-14B-224px

Image Feature Extraction • 14B • Updated Dec 9, 2024 • 106 • 35
OpenGVLab/InternVL-Chat-V1-2-Plus

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 108 • 34

InternVL Adaptation

Adaptation Models for Specific Domains

OpenGVLab/Mini-InternVL2-4B-DA-DriveLM

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 100 • 3
OpenGVLab/Mini-InternVL2-4B-DA-Medical

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 95 • 6
OpenGVLab/Mini-InternVL2-4B-DA-BDD

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 55
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM

Image-Text-to-Text • 2B • Updated Mar 26, 2025 • 70 • 2

VideoChat

Chat-Centric Video Understanding

OpenGVLab/VideoChat2_stage3_Mistral_7B

Updated May 22, 2024 • 4
OpenGVLab/VideoChat2_stage2_Mistral_7B

Updated May 22, 2024 • 2
OpenGVLab/VideoChat2-IT

Viewer • Updated Jun 29, 2024 • 1.82M • 389 • 51
Running

31

VideoChat2

⚡

31

Display a web page in an iframe

InternVid

A Large-Scale Video-Text Dataset

OpenGVLab/InternVid

Viewer • Updated Aug 13, 2024 • 21.3M • 412 • 90
OpenGVLab/InternVid-10M-FLT-INFO

Viewer • Updated Jan 24, 2024 • 10.6M • 87 • 7
OpenGVLab/ViCLIP

Updated Jun 7, 2024 • 46
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Paper • 2307.06942 • Published Jul 13, 2023 • 23

All-Seeing Project

OpenGVLab/ASMv2

Text Generation • Updated Feb 29, 2024 • 195 • 16
OpenGVLab/ASM-FT

Updated Feb 21, 2024 • 34 • 6
OpenGVLab/AS-Core

Preview • Updated Mar 21, 2024 • 47 • 10
OpenGVLab/ASM-Pretrain

Updated Feb 21, 2024 • 32 • 3

PVT v2

Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

Paper • 2106.13797 • Published Jun 25, 2021
OpenGVLab/pvt_v2_b1

Image Classification • 14M • Updated Mar 12, 2024 • 49 • 1
OpenGVLab/pvt_v2_b2

Image Classification • 25.4M • Updated Mar 12, 2024 • 410 • 1
OpenGVLab/pvt_v2_b2_linear

Image Classification • 22.6M • Updated Mar 12, 2024 • 47 • 1

Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

OpenGVLab/Vlaser-2B

2B • Updated Oct 11, 2025 • 76 • 1
OpenGVLab/Vlaser-8B

8B • Updated Oct 11, 2025 • 57 • 2
OpenGVLab/Vlaser-2B-VLA

Updated Oct 11, 2025 • 3
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Paper • 2510.11027 • Published Oct 13, 2025 • 21

InternVL3.5-Flash

InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Paper • 2510.12793 • Published Oct 14, 2025 • 3
OpenGVLab/InternVL3_5-241B-A28B-Flash

Image-Text-to-Text • 242B • Updated Sep 28, 2025 • 104 • 4
OpenGVLab/InternVL3_5-38B-Flash

Image-Text-to-Text • 40B • Updated Sep 28, 2025 • 122 • 5
OpenGVLab/InternVL3_5-30B-A3B-Flash

Image-Text-to-Text • 31B • Updated Sep 28, 2025 • 193 • 6

InternVL3.5

This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 211
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8, 2025 • 81 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8, 2025 • 559 • 6
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8, 2025 • 177 • 5

SDLM

Sequential Diffusion Language Models

Sequential Diffusion Language Models

Paper • 2509.24007 • Published Sep 28, 2025 • 45
OpenGVLab/SDLM-32B-D4

Text Generation • 33B • Updated 5 days ago • 97 • 14
OpenGVLab/SDLM-3B-D4

Text Generation • 3B • Updated 5 days ago • 118 • 5

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

OpenGVLab/ZeroGUI-AndroidLab-7B

Image-Text-to-Text • 8B • Updated May 30, 2025 • 36 • 5
OpenGVLab/ZeroGUI-OSWorld-7B

Image-Text-to-Text • 8B • Updated Jun 20, 2025 • 42 • 6
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29, 2025 • 45

VisualPRM

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36
OpenGVLab/VisualPRM-8B

Image-Text-to-Text • 8B • Updated May 6, 2025 • 527 • 17
OpenGVLab/VisualPRM-8B-v1_1

Image-Text-to-Text • 8B • Updated May 29, 2025 • 69 • 9
OpenGVLab/VisualPRM400K

Preview • Updated Apr 15, 2025 • 117 • 15

PIIP

[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14, 2025 • 8
OpenGVLab/PIIP

Object Detection • Updated Apr 16, 2025 • 5
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 30
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 32

InternVideo2.5

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 21.1k • 87
OpenGVLab/InternVL_2_5_HiCo_R16

Video-Text-to-Text • 8B • Updated Feb 13, 2025 • 302 • 6
OpenGVLab/InternVL_2_5_HiCo_R64

Video-Text-to-Text • 8B • Updated May 13, 2025 • 70 • 3
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Paper • 2501.12386 • Published Jan 21, 2025 • 1

VideoChat-Flash

Faster and more powerful VideoChat.

OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448

Video-Text-to-Text • 2B • Updated Mar 16, 2025 • 1.76k • 26
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 122 • 7
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 834 • 12
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Paper • 2501.00574 • Published Dec 31, 2024 • 6

InternVL2.5-MPO

Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
OpenGVLab/InternVL2_5-78B-MPO

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 99 • 54
OpenGVLab/InternVL2_5-38B-MPO

Image-Text-to-Text • 38B • Updated Sep 11, 2025 • 175 • 20
OpenGVLab/InternVL2_5-26B-MPO

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 478 • 14

InternVL1.5

A Pioneering Open-Source Alternative to GPT-4V

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 58
OpenGVLab/InternVL-Chat-V1-5

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 6.96k • 416
OpenGVLab/InternViT-6B-448px-V1-5

Image Feature Extraction • 6B • Updated Dec 9, 2024 • 1.24k • 77
OpenGVLab/InternViT-300M-448px

Image Feature Extraction • 0.3B • Updated Jan 8, 2025 • 3.65k • 61

V2PE

Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

OpenGVLab/V2PE

Updated Dec 13, 2024 • 4
OpenGVLab/V2PE-Data

Preview • Updated Dec 14, 2024 • 662 • 7
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Paper • 2412.09616 • Published Dec 12, 2024 • 1

InternVideo2

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 27
OpenGVLab/InternVideo2-Chat-8B

Video-Text-to-Text • 8B • Updated Oct 10, 2024 • 549 • 23
OpenGVLab/InternVideo2_chat_8B_HD

Video-Text-to-Text • 8B • Updated Dec 18, 2024 • 214 • 18
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5

Video-Text-to-Text • 9B • Updated Sep 19, 2024 • 46 • 7

VideoMamba

State Space Model for Efficient Video Understanding

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 29
OpenGVLab/VideoMamba

Video Classification • Updated Apr 14, 2024 • 29
Runtime error

Featured

98

VideoMamba

🐍

98

Identify actions and objects in videos and images
Andy1621/VideoMamba

Updated Mar 13, 2024 • 2

OmniCorpus

A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 31
OpenGVLab/OmniCorpus-CC

Viewer • Updated Mar 20, 2025 • 872M • 8.64k • 22
OpenGVLab/OmniCorpus-CC-210M

Viewer • Updated Mar 20, 2025 • 208M • 203 • 33
OpenGVLab/OmniCorpus-YT

Updated Mar 20, 2025 • 686 • 13

InternImage

Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Paper • 2211.05778 • Published Nov 10, 2022
OpenGVLab/internimage_t_1k_224

Image Classification • 29.9M • Updated Mar 25, 2025 • 149 • 2
OpenGVLab/internimage_s_1k_224

Image Classification • 50.1M • Updated Mar 25, 2025 • 67 • 1
OpenGVLab/internimage_b_1k_224

Image Classification • 97.5M • Updated Mar 25, 2025 • 125 • 1

InternVL Data

OpenGVLab/InternVL-Chat-V1-2-SFT-Data

Viewer • Updated Sep 20, 2024 • 573k • 794 • 29
OpenGVLab/InternVL-SA-1B-Caption

Viewer • Updated Sep 21, 2024 • 8.63M • 175 • 19
OpenGVLab/ShareGPT-4o

Viewer • Updated Aug 17, 2024 • 59.4k • 1.65k • 191
OpenGVLab/MMPR

Preview • Updated Apr 11, 2025 • 81 • 51

InternVideo-Next

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Paper • 2512.01342 • Published Dec 1, 2025 • 15
revliter/internvideo_next_base_p14_res224_f16

91M • Updated 14 days ago • 210 • 3
revliter/internvideo_next_large_p14_res224_f16

0.3B • Updated 14 days ago • 319 • 4
revliter/internvideo_next_large_p14_res224_f16_stage1

Updated 14 days ago • 11 • 1

Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

OpenGVLab/Vlaser-2B

2B • Updated Oct 11, 2025 • 76 • 1
OpenGVLab/Vlaser-8B

8B • Updated Oct 11, 2025 • 57 • 2
OpenGVLab/Vlaser-2B-VLA

Updated Oct 11, 2025 • 3
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Paper • 2510.11027 • Published Oct 13, 2025 • 21

NaViL

[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published Oct 9, 2025 • 19
OpenGVLab/NaViL-2B

4B • Updated Oct 10, 2025 • 90
OpenGVLab/NaViL-9B

16B • Updated Oct 10, 2025 • 67 • 1

InternVL3.5-Flash

InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Paper • 2510.12793 • Published Oct 14, 2025 • 3
OpenGVLab/InternVL3_5-241B-A28B-Flash

Image-Text-to-Text • 242B • Updated Sep 28, 2025 • 104 • 4
OpenGVLab/InternVL3_5-38B-Flash

Image-Text-to-Text • 40B • Updated Sep 28, 2025 • 122 • 5
OpenGVLab/InternVL3_5-30B-A3B-Flash

Image-Text-to-Text • 31B • Updated Sep 28, 2025 • 193 • 6

InternVL3.5-Core

This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 211
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8, 2025 • 81 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8, 2025 • 559 • 6
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8, 2025 • 177 • 5

InternVL3.5

This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 211
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8, 2025 • 81 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8, 2025 • 559 • 6
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8, 2025 • 177 • 5

ScaleCUA

OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 440 • 11
OpenGVLab/ScaleCUA-7B

Image-Text-to-Text • 8B • Updated Sep 18, 2025 • 261 • 8
OpenGVLab/ScaleCUA-32B

Image-Text-to-Text • 33B • Updated Sep 18, 2025 • 63 • 20
OpenGVLab/ScaleCUA-Data

Preview • Updated Sep 27, 2025 • 995 • 27

SDLM

Sequential Diffusion Language Models

Sequential Diffusion Language Models

Paper • 2509.24007 • Published Sep 28, 2025 • 45
OpenGVLab/SDLM-32B-D4

Text Generation • 33B • Updated 5 days ago • 97 • 14
OpenGVLab/SDLM-3B-D4

Text Generation • 3B • Updated 5 days ago • 118 • 5

Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

OpenGVLab/Docopilot-2B

Image-Text-to-Text • 2B • Updated Jul 20, 2025 • 62 • 8
OpenGVLab/Docopilot-8B

Image-Text-to-Text • 8B • Updated Jul 20, 2025 • 60 • 3
OpenGVLab/Doc-750K

Preview • Updated Jul 22, 2025 • 532 • 15

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

OpenGVLab/ZeroGUI-AndroidLab-7B

Image-Text-to-Text • 8B • Updated May 30, 2025 • 36 • 5
OpenGVLab/ZeroGUI-OSWorld-7B

Image-Text-to-Text • 8B • Updated Jun 20, 2025 • 42 • 6
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29, 2025 • 45

InternVL3

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 306
OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11, 2025 • 119k • 77
OpenGVLab/InternVL3-2B

Image-Text-to-Text • 2B • Updated Sep 11, 2025 • 24.9k • 43
OpenGVLab/InternVL3-8B

Image-Text-to-Text • 8B • Updated Sep 11, 2025 • 131k • 103

VisualPRM

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36
OpenGVLab/VisualPRM-8B

Image-Text-to-Text • 8B • Updated May 6, 2025 • 527 • 17
OpenGVLab/VisualPRM-8B-v1_1

Image-Text-to-Text • 8B • Updated May 29, 2025 • 69 • 9
OpenGVLab/VisualPRM400K

Preview • Updated Apr 15, 2025 • 117 • 15

Mono-InternVL

A Pioneering Monolithic MLLM

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Paper • 2410.08202 • Published Oct 10, 2024 • 6
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper • 2507.12566 • Published Jul 16, 2025 • 14
OpenGVLab/Mono-InternVL-2B

Image-Text-to-Text • 3B • Updated Jul 22, 2025 • 7.92k • 37
OpenGVLab/Mono-InternVL-2B-S1-1

Image-Text-to-Text • 3B • Updated Jul 22, 2025 • 57

PIIP

[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14, 2025 • 8
OpenGVLab/PIIP

Object Detection • Updated Apr 16, 2025 • 5
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 30
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 32

VideoChat-R1

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

OpenGVLab/VideoChat-R1-thinking_7B

Video-Text-to-Text • 8B • Updated Apr 13, 2025 • 31
OpenGVLab/VideoChat-R1_7B

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 276 • 8
OpenGVLab/VideoChat-R1_7B_caption

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 36 • 4
OpenGVLab/VideoChat-R1_5-7B

Video-Text-to-Text • 8B • Updated Oct 2, 2025 • 13.2k • 10

InternVideo2.5

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 21.1k • 87
OpenGVLab/InternVL_2_5_HiCo_R16

Video-Text-to-Text • 8B • Updated Feb 13, 2025 • 302 • 6
OpenGVLab/InternVL_2_5_HiCo_R64

Video-Text-to-Text • 8B • Updated May 13, 2025 • 70 • 3
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Paper • 2501.12386 • Published Jan 21, 2025 • 1

VideoMAE-v2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Paper • 2303.16727 • Published Mar 29, 2023
OpenGVLab/VideoMAEv2-Base

Video Classification • 86.2M • Updated Jan 14, 2025 • 14.5k • 9
OpenGVLab/VideoMAEv2-Large

Video Classification • 0.3B • Updated Jan 14, 2025 • 3.68k • 1
OpenGVLab/VideoMAEv2-Huge

Video Classification • 0.6B • Updated Feb 25, 2025 • 622 • 1

VideoChat-Flash

Faster and more powerful VideoChat.

OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448

Video-Text-to-Text • 2B • Updated Mar 16, 2025 • 1.76k • 26
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 122 • 7
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 834 • 12
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Paper • 2501.00574 • Published Dec 31, 2024 • 6

InternVL2.5

Better than InternVL 2.0

Running

Featured

504

InternVL

⚡

504

Upload images or text for analysis and responses
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
OpenGVLab/InternVL2_5-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 180 • 192
OpenGVLab/InternVL2_5-78B-AWQ

Image-Text-to-Text • Updated Sep 11, 2025 • 152 • 14

InternVL2.5-MPO

Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
OpenGVLab/InternVL2_5-78B-MPO

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 99 • 54
OpenGVLab/InternVL2_5-38B-MPO

Image-Text-to-Text • 38B • Updated Sep 11, 2025 • 175 • 20
OpenGVLab/InternVL2_5-26B-MPO

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 478 • 14

InternVL2.0

Expanding Performance Boundaries of Open-Source MLLM

OpenGVLab/InternVL2-Pretrain-Models

Image-Text-to-Text • Updated Mar 25, 2025 • 11
OpenGVLab/InternVL2-Llama3-76B

Image-Text-to-Text • 76B • Updated Mar 25, 2025 • 198 • 211
OpenGVLab/InternVL2-Llama3-76B-AWQ

Image-Text-to-Text • Updated Mar 25, 2025 • 144 • 25
OpenGVLab/InternVL2-40B

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 276 • 93

InternVL1.5

A Pioneering Open-Source Alternative to GPT-4V

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 58
OpenGVLab/InternVL-Chat-V1-5

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 6.96k • 416
OpenGVLab/InternViT-6B-448px-V1-5

Image Feature Extraction • 6B • Updated Dec 9, 2024 • 1.24k • 77
OpenGVLab/InternViT-300M-448px

Image Feature Extraction • 0.3B • Updated Jan 8, 2025 • 3.65k • 61

InternVL1.0

Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 20
OpenGVLab/InternViT-6B-224px

Image Feature Extraction • Updated Dec 9, 2024 • 1.62k • 24
OpenGVLab/InternVL-14B-224px

Image Feature Extraction • 14B • Updated Dec 9, 2024 • 106 • 35
OpenGVLab/InternVL-Chat-V1-2-Plus

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 108 • 34

V2PE

Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

OpenGVLab/V2PE

Updated Dec 13, 2024 • 4
OpenGVLab/V2PE-Data

Preview • Updated Dec 14, 2024 • 662 • 7
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Paper • 2412.09616 • Published Dec 12, 2024 • 1

InternVL Adaptation

Adaptation Models for Specific Domains

OpenGVLab/Mini-InternVL2-4B-DA-DriveLM

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 100 • 3
OpenGVLab/Mini-InternVL2-4B-DA-Medical

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 95 • 6
OpenGVLab/Mini-InternVL2-4B-DA-BDD

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 55
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM

Image-Text-to-Text • 2B • Updated Mar 26, 2025 • 70 • 2

InternVideo2

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 27
OpenGVLab/InternVideo2-Chat-8B

Video-Text-to-Text • 8B • Updated Oct 10, 2024 • 549 • 23
OpenGVLab/InternVideo2_chat_8B_HD

Video-Text-to-Text • 8B • Updated Dec 18, 2024 • 214 • 18
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5

Video-Text-to-Text • 9B • Updated Sep 19, 2024 • 46 • 7

VideoChat

Chat-Centric Video Understanding

OpenGVLab/VideoChat2_stage3_Mistral_7B

Updated May 22, 2024 • 4
OpenGVLab/VideoChat2_stage2_Mistral_7B

Updated May 22, 2024 • 2
OpenGVLab/VideoChat2-IT

Viewer • Updated Jun 29, 2024 • 1.82M • 389 • 51
Running

31

VideoChat2

⚡

31

Display a web page in an iframe

VideoMamba

State Space Model for Efficient Video Understanding

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 29
OpenGVLab/VideoMamba

Video Classification • Updated Apr 14, 2024 • 29
Runtime error

Featured

98

VideoMamba

🐍

98

Identify actions and objects in videos and images
Andy1621/VideoMamba

Updated Mar 13, 2024 • 2

InternVid

A Large-Scale Video-Text Dataset

OpenGVLab/InternVid

Viewer • Updated Aug 13, 2024 • 21.3M • 412 • 90
OpenGVLab/InternVid-10M-FLT-INFO

Viewer • Updated Jan 24, 2024 • 10.6M • 87 • 7
OpenGVLab/ViCLIP

Updated Jun 7, 2024 • 46
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Paper • 2307.06942 • Published Jul 13, 2023 • 23

OmniCorpus

A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 31
OpenGVLab/OmniCorpus-CC

Viewer • Updated Mar 20, 2025 • 872M • 8.64k • 22
OpenGVLab/OmniCorpus-CC-210M

Viewer • Updated Mar 20, 2025 • 208M • 203 • 33
OpenGVLab/OmniCorpus-YT

Updated Mar 20, 2025 • 686 • 13

All-Seeing Project

OpenGVLab/ASMv2

Text Generation • Updated Feb 29, 2024 • 195 • 16
OpenGVLab/ASM-FT

Updated Feb 21, 2024 • 34 • 6
OpenGVLab/AS-Core

Preview • Updated Mar 21, 2024 • 47 • 10
OpenGVLab/ASM-Pretrain

Updated Feb 21, 2024 • 32 • 3

InternImage

Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Paper • 2211.05778 • Published Nov 10, 2022
OpenGVLab/internimage_t_1k_224

Image Classification • 29.9M • Updated Mar 25, 2025 • 149 • 2
OpenGVLab/internimage_s_1k_224

Image Classification • 50.1M • Updated Mar 25, 2025 • 67 • 1
OpenGVLab/internimage_b_1k_224

Image Classification • 97.5M • Updated Mar 25, 2025 • 125 • 1

PVT v2

Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

Paper • 2106.13797 • Published Jun 25, 2021
OpenGVLab/pvt_v2_b1

Image Classification • 14M • Updated Mar 12, 2024 • 49 • 1
OpenGVLab/pvt_v2_b2

Image Classification • 25.4M • Updated Mar 12, 2024 • 410 • 1
OpenGVLab/pvt_v2_b2_linear

Image Classification • 22.6M • Updated Mar 12, 2024 • 47 • 1

InternVL Data

OpenGVLab/InternVL-Chat-V1-2-SFT-Data

Viewer • Updated Sep 20, 2024 • 573k • 794 • 29
OpenGVLab/InternVL-SA-1B-Caption

Viewer • Updated Sep 21, 2024 • 8.63M • 175 • 19
OpenGVLab/ShareGPT-4o

Viewer • Updated Aug 17, 2024 • 59.4k • 1.65k • 191
OpenGVLab/MMPR

Preview • Updated Apr 11, 2025 • 81 • 51

AI & ML interests

Recent Activity

Papers

Team members 117

OpenGVLab 's collections 34

InternVL

VideoChat2

VideoMamba