2014-2019 Ph.D Candidate Northwestern Polytechnical University
2010-2014 B.S. Honors College, Northwestern Polytechnical University
Work Experience
2020-Now Assistant Professor, Renmin University of China
2019-2020 Research Scientist, Baidu Research
RESEARCH INTERESTS
Machine Multimodal Perception and Learning: Mining and exploring the potential problems and methods of multimodal messages (such as image, sound, touch etc.) in the direction of machine perception, reasoning and understanding, then equipping the machines with “multisensory cognitive ability”.
Prospective Students/Staffs
Curious about things surrounding, self-driven, aiming to do interesting, meaningful and valuable research
PUBLICATIONS
2025
Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition
Chengxiang Huang, Yake Wei, Zequn Yang, Di Hu
Computer Vision and Pattern Recognition (CVPR)
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Wenke Xia, Ruoxuan Feng, Dong Wang, Di Hu
Computer Vision and Pattern Recognition (CVPR)
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Henghui Du, Guangyao Li, Chang Zhou, Chunjie Zhang, Alan Zhao, Di Hu
Computer Vision and Pattern Recognition (CVPR)
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu
Computer Vision and Pattern Recognition (CVPR)
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang*, Di Hu*
International Conference on Learning Representations (ICLR)
2024
On-the-fly Modulation for Balanced Multimodal Learning
Yake Wei, Di Hu*, Henghui Du, and Ji-Rong Wen
IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI)
Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation (Oral)
Ruoxuan Feng, Di Hu*, Wenke Ma, Xuelong Li
Conference on Robot Learning (CoRL)
KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance
Jingxian Lu, Wenke Xia, Dong Wang, Zhigang Wang, Bin Zhao, Di Hu*, and Xuelong Li
Conference on Robot Learning (CoRL)
Diagnosing and Re-learning for Balanced Multimodal Learning
Yake Wei, Siwei Li, Ruoxuan Feng, and Di Hu*
European Conference on Computer Vision (ECCV)
Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
Juncheng Ma, Peiwen Sun, Yaoting Wang, and Di Hu*
European Conference on Computer Vision (ECCV)
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Yaoting Wang†, Peiwen Sun†, Dongzhan Zhou, Guangyao Li, Honggang Zhang, and Di Hu*
European Conference on Computer Vision (ECCV)
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
Yaoting Wang†, Peiwen Sun†, Yuanchao Li, Honggang Zhang, and Di Hu*
European Conference on Computer Vision (ECCV)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
ACM Conference on Multimedia (ACMMM)
Guangyao Li, HenghuiDu, and Di Hu
Unveiling and Mitigating Bias in Audio Visual Segmentation (Oral)
Peiwen Sun, Honggang Zhang, and Di Hu
ACM Conference on Multimedia (ACMMM)
Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection
Xincheng Pang†, Wenke Xia†, Zhigang Wang, Bin Zhao, Di Hu*, Dong Wang, and Xuelong Li
International Conference on Intelligent Robots and Systems (IROS)
Learning Manipulation by Predicting Interaction
Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, and Hongyang Li
Robotics: Science and Systems Conference (RSS)
MMPareto: Innocent Uni-modal Assistance for Enhanced Multi-modal Learning
Yake Wei, Di Hu
International Conference on Machine Learning (ICML)
Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation
Yake Wei , Ruoxuan Feng , Zihe Wang , Di Hu
Computer Vision and Pattern Recognition(CVPR)
Quantifying and Enhancing Multi-modal Robustness with Modality Preference
Zequn Yang , Yake Wei , Ce Liang , Di Hu
The Twelfth International Conference on Learning Representations (ICLR)
SphereDiffusion: Spherical Geometry-aware Distortion Resilient Diffusion Model
Tao Wu , Xuewei Li , Zhongang Qi , Di Hu , Xintao Wang , Ying Shan , Xi Li
The 38th Annual AAAI Conference on Artificial Intelligence
Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer
Yaoting Wang* , Weisong Liu* , Guangyao Li , Jian Ding , Di Hu , Xi Li
The 38th Annual AAAI Conference on Artificial Intelligence