AVID: Any-Length Video Inpainting with Diffusion Models

Supplementary Material

Qualitative Results

Uncropping

Object swap

Re-texturing

More Inpainting Taks

Comparison

Ablation Studies

Structure guidance

Temporal MultiDiffuson

Middle-frame attention guidance

Limitations

Any-Length Text-to-Video Generation Results

Qualitative Results

In this section, we provide video inpainting results using our method on different sub-tasks, including uncropping, object swap, and retexturing.
The first row contain source videos. For uncropping, the dark region indicates the region to fill in, while for object swap and retexturing, we apply a semi-transprant red mask to indicate the region to inpaint.
We also provide the captions we use for each video as well as the number of frames in each video.
Note, as discussed in the paper, structure guidance is only appled to retexturing.

1. Uncropping

Source Video (48 frames)	Source Video (24 frames)	Source Video (54 frames)	Source Video (32 frames)

"A tiger walking near a waterfall."	"A tiger walking through a jungle."	"A man hiking in the snow."	"A car driving on the road."


Source Video (48 frames)	Source Video (48 frames)	Source Video (64 frames)	Source Video (24 frames)

"A cow in the water."	"A train traveling over a bridge in the mountains."	"A duck swimming in a lake."	"An elk walking in the grass near a river."

2. Object Swap

Source Video (32 frames)	Source Video (32 frames)	Source Video (36 frames)	Source Video (48 frames)

"A mini cooper driving down a road."	"A young boy wearing a cowboy hat standing in a wheat field."	"A Porsche car driving down a road."	"A black Jeep is driving down a road."


Source Video (32 frames)	Source Video (32 frames)	Source Video (24 frames)	Source Video (16 frames)

"A leopard walking on a rock."	"A can floating in the water."	"A van driving down a road near a forest clearing."	"An otter swimming in the water."

3. Re-texturing

Source Video (32 frames)	Source Video (16 frames)	Source Video (16 frames)	Source Video (48 frames)

"A white duck swimming in a lake."	"A woman walking in the woods with a leather backpack."	"A white tiger walking through the jungle."	"A women in glossy leather red coat walking through a greenhouse."


Source Video (32 frames)	Source Video (32 frames)	Source Video (16 frames)	Source Video (44 frames)

"A frozen cocktail in front of fire."	"A metal bird bathing in a pond."	"A yellow maple leaf."	"A woman with blond hair walking through a wheat field."

More Inpainting Tasks

In this section, we explore how our method can be applied to other video inpainting applications, including content removal and environment swap.
Please refer to Appendix B for details.

Source Video (24 frames) Content removal	"A wheat field."	Source Video (24 frames) Environment swap	"A woman walking, Egyptian pyramids."


Source Video (24 frames) Content removal	"A field of purple flowers."	Source Video (24 frames) Environment swap	"A tiger walking on a beach."

Comparison

In this section, we provide video comparison results of our method with others as mentioned in the paper.
The first column are the source videos, and the rest are the results of different methods.
Please refer to Sec. 4.2 in the main paper and Appendix D for more discussion. As shown below, pre-frame and T2V0 can obtain results with good visual quality, but they are not temporally coherent.
VideoComposer and TokenFlow can generate temporally coherent results, but they both suffer from background detail loss.
In both cases, VideoComposer exhibits signficant color shifting.
In the first case, TokenFlow misses details including the road lamp along the road while in the second case, fails to preserve background color.
Our method can generate temporally coherent results with good visual quality and background details.

"A purple car driving down a road."

"A flamingo swimming in a lake."

Per-Frame	T2V0[2]	VideoComposer[3]	TokenFlow[4]	Ours

Ablation Studies

In this section, we provide ablation study results of our method as mentioned in the paper.

1. Structure guidance

In the following, we show the effect of structure guidance on object swap and retexturing.
In the first row, we showcase the results of retexturing with different scale of structure guidance, while in the second row, we show results of object swap.
Please refer to Sec. 4.3 Effect of structure guidance in the main paper for more discussion.

"A golden furred panda eating bamboo."

"A raccoon eating bamboo."


0.0	0.5	1.0

2. Temporal MultiDiffuson

In this sub-section, we show the effectivness of temporal multi-diffusion.
As discussed in the paper, in this sub-section, middle-frame attention guidance is not applied for fair comparison.

2.1. Versus per-clip generation

We first show results of our method with per-clip generation.
As discussed in the main paper (Sec. 4.3 Temporal MultiDiffusion), for a video with 32 frames, per-clip method treat each 16 frames as a clip and generate two clips independently.
As can be seen in the following, the results are not temporally coherent.
At the boundary of each clip, the generated content changes abruptly (the neck of the generated goose and the pose of the raccoon).
In contrast, our method generates the whole video at once, and thus is temporally coherent.

Source Video (32 frames)	"A goose swimming in a lake."

Source Video (32 frames)	"A raccoon standing on a waterfall."

	Per-clip	Ours

2.2. Versus larger chunk generation

As discussed in the supplementary material (Appendix E.1.), due to the inherent property of temporal self-attention, our model can be directly applied to generate videos with more frames.
However, this comes with a cost of generation quality.
We denote this as larger chunk generation and use Baseline below to represent it.
As shown in the following, baseline method sacrifices the generation quality.
In the first case, ears of the generated raccoon appear in the wrong position.

Source Video (24 frames)	"A raccoon standing on a waterfall."

Source Video (24 frames)	"A brown bear cub walking through a river."

	Baseline	Ours

3. Middle-frame attention guidance

We present the effect of Middle-frame attention guidance in this sub-section.
Detailed disscussion can be found in Sec. 4.3 Effect of middle-frame attention guidance.
We show the results of object swap with different scale of attention guidance.
When attention guidance is not applied, in this case, 0.0, the color of the car shifts smoothly from red to black.
With attention guidance, the color of the car can maintain the same throughout the video.
However, when the guidance is too strong, in this case, 1.0, certain artifacts are introduced, as can be seen in the end of the video, specifically the rear wheel of car.

Source Video (32 frames)	"A mini cooper driving down a road."

	0.0	0.3	0.5	0.8	1.0

We also conduct further ablation analysis to show the effectiveness of the middle-frame attention guidance mechanism we propose.
As discussed in Appendix E.2. we compare our method with three different attention based approaches.
Please refer to the appendix for more description on different methods.

Source Video (32 frames)	"A mini cooper driving down a road."

	SC_Attn	MSC_Attn	First-frame attention	Middle-frame attention(ours)

Limitations

Source Video (32 frames)	"A lion walking through a jungle."	Source Video (16 frames)	"A horse wandering in the woods."

Any-Length Text-to-Video Generation Results

A Potential Extension

"A teddy bear walks on the beach." (80 frames)	"A teddy bear dancing in Times Square" (256 frames)

References

[1] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.

[2] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.

[3] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023.

[4] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.