Yan Shu (舒言)

I'm a PhD 1st student at the University of Trento. I am advised by Nicu Sebe and Paolo Rota in MHUG Group .

Previously, I was a research intern in Beijing Academy of Artificial Intelligence (BAAI) supervised by Bo Zhao and Zheng Liu .

Before that, I got Master`s degree from Harbin Institute of Technology, supervised by Shaohui Liu. I also worked as research assistant in Institute of Information Engineering, Chinese Academy of Sciences, advised by Yu Zhou

Email  /  Scholar  /  Twitter  /  Github  /  小红书

profile photo

News

  • 12/2024: 😇😇Start my PhD journey at the University of Trento, fighting step by step.
  • 11/2024: 🎉🎉Finished my journey in BAAI. Great thanks to my advisors Zheng Liu and Bo Zhao.
  • 12/2023: 😄😄Ended my RA at CAS. Great thanks to my advisor Yu Zhou.
  • 06/2023: 🎉🎉Got my Master`s Degree at HIT. Great thanks to my advisor Shaohui Liu.

Research

I'm interested in computer vision, multimodal learning, video understanding, Remote Sensing and OCR. Below are some selected publications. (* indicates equal contribution.)

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao
CVPR, 2025  
project page / Arxiv

First-ever hour-scale video understanding models.

MLVU: Multi-task Long Video Understanding Benchmark
Junjie Zhou*, Yan Shu*, Bo Zhao*, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu
CVPR, 2025  
project page / Arxiv

First-ever comprehensive long video benchmark.

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, Yu Zhou
NeurIPS, 2024   (Spotlight)
project page / arXiv

A diffusion-based scene text editing model as well as a real-world scene text editing benchmark.

First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending
Zhenhang Li, Yan Shu, Weichao Zeng Dongbao Yang, Yu Zhou
ECAI, 2024  
project page / arXiv

A diffusion-based scene text generation model as well as a synthetic scene text detection dataset.

CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings
Yachun Mi, Yan Shu, Yu Li Chen Hui, Puchao Zhou, Shaohui Liu
ACM MM, 2024  
project page / arXiv

Video quality assessment framework based on CLIP.

Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing
Yan Shu, Weichao Zeng Zhenhang Li, Fangmin Zhao, Yu Zhou
Arxiv, 2024  
project page / arXiv

Survey on low-level scene text processing methods.

Perceiving Ambiguity and Semantics without Recognition: An Efficient and Effective Ambiguous Scene Text Detector
Yan Shu, Wei Wang, Yu Zhou, Shaohui Liu, Aoting Zhang, Dongbao Yang, Weiping Wang
ACM MM, 2023   (Oral)
project page / arXiv

A model designed for ambiguous scene text detection.

Talks

Education and Working Experience

PhD student in UniTN (2024.12-2027.12 (predicted))
Research intern in BAAI (2024.03-2024.11)
cs188 Research Assistant in Chinese Academy of Sciences (2023.06-2023.12)
cs188 Master in Harbin Institute of Technology (2021.09-2023.06)
cs188 Bachelor in University of International Relations (2017.09-2021.06)

Services

Reviewer in ACM MM 2024, ICLR 2025, CVPR 2025.