You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for Model ID

https://showlab.github.io/videollm-online/

Model Details

LLM: meta-llama/Meta-Llama-3-8B-Instruct
Vision Strategy:
- Frame Encoder: google/siglip-large-patch16-384
- Frame Tokens: CLS Token + Avg Pooled 3x3 Tokens
- Frame FPS: 2 for training, 2~10 for inference
- Frame Resolution: max resolution 384, with zero-padding to keep aspect ratio
- Video Length: 10 minutes
Training Data: Ego4D Narration Stream 113K + Ego4D GoalStep Stream 21K

Model Sources

Repository: https://github.com/showlab/videollm-online
Paper: https://arxiv.org/abs/2406.11816

Uses

First, clone the github repository and follow the installation instruction:

git clone https://github.com/showlab/videollm-online

Ensure you have Miniconda and Python version >= 3.10 installed, then run:

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

PyTorch source will make ffmpeg installed, but it is an old version and usually make very low quality preprocessing. Please install newest ffmpeg following:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

If you want to try our model with the audio in real-time streaming, please also clone ChatTTS.

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

Launch the gradio demo locally with:

python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

Or launch the CLI locally with:

python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

Citation

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}

Downloads last month: 1,678

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for chenjoya/videollm-online-8b-v1plus

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Adapter

(2126)

this model

Finetunes

3 models

Dataset used to train chenjoya/videollm-online-8b-v1plus

Paper for chenjoya/videollm-online-8b-v1plus

VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17, 2024 • 26