Publications

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Abstract We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts.

Yushu Wu, Zhixing Zhang, Yanyu Li, Yanwu Xu, Anil Kag, Yang Sui, Huseyin Coskun, Ke Ma, Aleksei Lebedev, Ju Hu, Dimitris Metaxas, Yanzhi Wang, Sergey Tulyakov, Jian Ren

SF-V: Single Forward Video Generation Model

Abstract Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs.

Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren

AVID: Any-Length Video Inpainting with Diffusion Model

Abstract Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into video domain, there has been fewer works regarding text-guided video inpainting.

Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, Licheng Yu

OmniLabel: A Challenging Benchmark for Language-Based Object Detection

Abstract Language-based object detection is a promising direction towards building a natural interface to describe objects in images that goes far beyond plain category names. While recent methods show great progress in that direction, proper evaluation is lacking.

Samuel Schulter, Vijay Kumar.B.G, Yumin Suh, Konstantinos M. Dafnis, Zhixing Zhang, Shiyu Zhao, Dimitris Metaxas

OmniLabel: A Challenging Benchmark for Language-Based Object Detection

SINE: Single Image Editing with Text-to-Image Diffusion Models

Abstract Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem–real image editing.

Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris Metaxas, Jian Ren

SINE: Single Image Editing with Text-to-Image Diffusion Models

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Abstract Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale.

Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, Vijay Kumar.B.G, Anastasis Stathopoulos, Manmohan Chandraker, Dimitris Metaxas

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Global Matching with Overlapping Attention for Optical Flow Estimation

Abstract Optical flow estimation is a fundamental task in computer vision. Recent direct-regression methods using deep neural networks achieve remarkable performance improvement. However, they do not explicitly capture long-term motion correspondences and thus cannot handle large motions effectively.

Shiyu Zhao, Long Zhao, Zhixing Zhang, Enyu Zhou, Dimitris Metaxas

Global Matching with Overlapping Attention for Optical Flow Estimation