Yinan Chen 1★
·
Jiangning Zhang 1,2★
·
Teng Hu 3
·
Yuxiang Zeng 4
·
Zhucun Xue 1
·
Qingdong He 2
·
Chengjie Wang 2,3
·
Yong Liu 1†
·
Xiaobin Hu 2
·
Shuicheng Yan 5
1Zhejiang University
2YouTu Lab, Tencent
3Shanghai Jiao Tong University
4University of Auckland
5National University of Singapore
This repository is a comprehensive collection of resources for IVEBench, If you find any work missing or have any suggestions, feel free to pull requests or contact us. We will promptly add the missing papers to this repository.
🔥 More up-to-date instruction-guided video editing methods will continue to be updated.
📝 Update:
-
[2026-01-27] IVEBench has been accepted by ICLR 2026.🎉🎉🎉
-
[2025-11-27] Supports adjusting weights for each dimension.
-
[2025-11-26] Update Evaluation Results: Ditto
-
[2025-10-23] Update Evaluation Results: Lucy-Edit-Dev, Omni-Video, ICVE
-
[2025-10-16] Update Evaluation Results: InsV2V, StableV2V, AnyV2V, VACE
🤓 You can view the scores and comparisons of each method at IVEBench LeaderBoard.
Compared with existing video editing benchmarks, our proposed IVEBench offers the following key advantages:
- Comprehensive support for IVE methods: IVEBench is specifically designed to evaluate instruction-guided video editing (IVE) models while remaining compatible with traditional source-target prompt-based methods, ensuring broad applicability across editing paradigms;
- Diverse and semantically rich video corpus: The benchmark contains 600 high-quality source videos spanning seven semantic dimensions and thirty topics, with frame lengths ranging from 32 to 1,024, providing wide coverage of real-world scenarios;
- Comprehensive editing taxonomy: IVEBench includes eight major editing categories and thirty-five subcategories, encompassing diverse editing types such as style, attribute, subject motion, camera motion, and visual effect editing, to fully represent instruction-guided behaviors;
- Integration of MLLM-based and traditional metrics: The evaluation protocol combines conventional objective indicators with multimodal large language model (MLLM)-based assessments across three dimensions (video quality, instruction compliance, and video fidelity) for more human-aligned and holistic evaluation;
- Extensive benchmarking of state-of-the-art models: We conduct a thorough quantitative and qualitative evaluation of leading IVE models—including InsV2V, AnyV2V, StableV2V, as well as the multi-conditional video editing framework VACE, establishing a unified and fair standard for future research.
- Introduction
- Highlight
- Data Pipeline
- Benchmark Statistics
- Installation
- Usage
- Experiments
- Citation
- Contact
Data acquisition and processing pipeline of IVEBench. 1) Curation process to 600 high-quality diverse videos. 2) Well-designed pipeline for comprehensive editing prompts.
The playback of the source videos can be viewed on IVEBench website.
Statistical distributions of IVEBench DB
git clone git@github.com:RyanChenYN/IVEBench.git
cd IVEBench
conda create -n ivebench python=3.12
conda activate ivebench
pip install -r requirements.txt
Grounding DINO requires additional installation steps, which can be found in the Install section of Grounding DINO
All checkpoints utilized in this project are listed in matrics/path.yml.
Additionally, you may download the following pretrained models as referenced below:
- Qwen/Qwen2.5-VL-72B-Instruct
- Koala-36M/Training_Suitability_Assessment
- alibaba-pai/VideoCLIP-XL-v2
baseline_offline.pthfrom facebook/cotracker3groundingdino_swinb_cogcoor.pthfrom Grounding DINO
After downloading the required checkpoints, you should replace the corresponding loading paths in matrics/path.yml with the local directories where the checkpoints are stored.
This section provides access to the IVEBench Database, which contains the complete .mp4 video data of IVEBench and a .csv file (the file provides the original URLs for each video in the IVEBench Database, except for those from the OpenHumanVid subset, which do not have corresponding URLs).
🥰You can download IVEBench DB to your local path using the following command:
huggingface-cli download --repo-type dataset --resume-download Coraxor/IVEBench --local-dir $YOUR_LOCAL_PATH
-
You first need to run your own video editing model on the IVEBench DB to generate the corresponding Target Video dataset.
-
For each source video, the associated source prompt, edit prompt, target prompt, target phrase, and target span are stored in the
.jsonfile provided within the IVEBench DB. -
The filenames of the videos in your generated Target Video dataset must match the corresponding source video names exactly.
-
The metric computation of IVEBench requires both the original and target videos to be in the form of video frame folders. Therefore, you need to convert the
.mp4videos downloaded from IVEBench DB into video frame folders. Similarly, if the target videos you generate are in.mp4format, they also need to be converted.python data_process/mp42frames_batch.py --input_path $INPUT_PATH --output_path $OUTPUT_PATH -
The IVEBench DB contains videos ranging from 720P to 8K resolution, with frame counts between 32 and 1024. If your method has limitations regarding resolution or frame count, you can use
data_process/resize_batch.pyto perform downscaling and frame sampling on the frame folders converted from the IVEBench DB. This will produce a source video dataset at the maximum resolution and frame count supported by your method, making subsequent editing and evaluation more convenient.python data_process/resize_batch.py --input_path $INPUT_PATH --output_path $OUTPUT_PATH --size $WIDTH $HEIGHT --max_frame $MAX_FRAME
-
-
After you have properly set up the environment, loaded the model weights, prepared the IVEBench DB, and generated the Target Video dataset using your editing method on IVEBench DB, you can use the evaluation script below to compute the performance scores for each video in your Target Video dataset across all metrics. And the evaluation results will be exported as a CSV file.
cd metrics python evaluate.py \ --output_path $YOUR_TARGET_VIDEOS_DIR \ --source_videos_path $IVEBENCHDB_SOURCE_VIDEOS_DIR \ --target_videos_path $YOUR_TARGET_VIDEOS_DIR \ --info_json_path PROMPT_JSON_PATH \ --metric $LIST_OF_METRICS_YOU_NEED \ -
After obtaining the evaluation results on each videos, you can use
metrics\get_average_score.pyto get the total score of your method on the IVEBench DB, as well as the average scores across the three dimensions and all individual metrics.python get_average_score.py -i $INPUT_CSV -o $OUTPUT_CSV -
It is important to note that IVEBench is divided into two subsets: the IVEBench DB Short subset and the IVEBench DB Long subset. The Short subset contains videos with 32–128 frames, while the Long subset contains videos with 129–1024 frames, representing a higher level of difficulty. If you wish to evaluate your method on the full IVEBench DB, you need to generate the Target Video dataset for both subsets separately and perform evaluation on each subset independently.
The continuously updated, sortable table of the latest IVE methods is available on the IVEBench website
IVEBench Evaluation Results of Video Editing Models. We visualize the evaluation results of four IVE models in 12 IVEBench metrics. We normalize the results per dimension for clearer comparisons.
Comparative demonstrations of the source videos and the target videos generated by different methods can be viewed on IVEBench website.
If you If you find IVEBench useful for your research, please consider giving a star⭐ and citation📝 :)
@inproceedings{chen2026ivebench,
title={IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment},
author={Chen, Yinan and Zhang, Jiangning and Hu, Teng and Zeng, Yuxiang and Xue, Zhucun and He, Qingdong and Wang, Chengjie and Liu, Yong and Hu, Xiaobin and Yan, Shuicheng},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
yinanchencs@outlook.com
186368@zju.edu.cn





