In this section, we provide video inpainting results using our method on different sub-tasks, including uncropping, object swap, and retexturing.
The first row contain source videos.
For uncropping, the dark region indicates the region to fill in,
while for object swap and retexturing, we apply a semi-transprant red mask to indicate the region to inpaint.
We also provide the captions we use for each video as well as the number of frames in each video.
Note, as discussed in the paper, structure guidance is only appled to retexturing.
Source Video (32 frames) | Source Video (32 frames) | Source Video (36 frames) | Source Video (48 frames) |
---|---|---|---|
"A mini cooper driving down a road." | "A young boy wearing a cowboy hat standing in a wheat field." | "A Porsche car driving down a road." | "A black Jeep is driving down a road." |
Source Video (32 frames) | Source Video (32 frames) | Source Video (24 frames) | Source Video (16 frames) |
"A leopard walking on a rock." | "A can floating in the water." | "A van driving down a road near a forest clearing." | "An otter swimming in the water." |
Source Video (32 frames) | Source Video (16 frames) | Source Video (16 frames) | Source Video (48 frames) |
---|---|---|---|
"A white duck swimming in a lake." | "A woman walking in the woods with a leather backpack." | "A white tiger walking through the jungle." | "A women in glossy leather red coat walking through a greenhouse." |
Source Video (32 frames) | Source Video (32 frames) | Source Video (16 frames) | Source Video (44 frames) |
"A frozen cocktail in front of fire." | "A metal bird bathing in a pond." | "A yellow maple leaf." | "A woman with blond hair walking through a wheat field." |
In this section, we explore how our method can be applied to other video inpainting applications, including content removal and environment swap.
Please refer to Appendix B for details.
Source Video (24 frames) Content removal |
"A wheat field." | Source Video (24 frames) Environment swap |
"A woman walking, Egyptian pyramids." |
---|---|---|---|
Source Video (24 frames) Content removal |
"A field of purple flowers." | Source Video (24 frames) Environment swap |
"A tiger walking on a beach." |
In this section, we provide video comparison results of our method with others as mentioned in the paper.
The first column are the source videos, and the rest are the results of different methods.
Please refer to Sec. 4.2 in the main paper and Appendix D for more discussion.
As shown below, pre-frame and T2V0 can obtain results with good visual quality, but they are not temporally coherent.
VideoComposer and TokenFlow can generate temporally coherent results, but they both suffer from background detail loss.
In both cases, VideoComposer exhibits signficant color shifting.
In the first case, TokenFlow misses details including the road lamp along the road while in the second case,
fails to preserve background color.
Our method can generate temporally coherent results with good visual quality and background details.
Source Video (16 frames) | "A purple car driving down a road." | ||||
---|---|---|---|---|---|
"A flamingo swimming in a lake." | |||||
Per-Frame | T2V0[2] | VideoComposer[3] | TokenFlow[4] | Ours |
In this section, we provide ablation study results of our method as mentioned in the paper.
In the following, we show the effect of structure guidance on object swap and retexturing.
In the first row, we showcase the results of retexturing with different scale of structure guidance,
while in the second row, we show results of object swap.
Please refer to Sec. 4.3 Effect of structure guidance in the main paper for more discussion.
Source Video (16 frames) | "A golden furred panda eating bamboo." | ||
---|---|---|---|
"A raccoon eating bamboo." | |||
0.0 | 0.5 | 1.0 |
In this sub-section, we show the effectivness of temporal multi-diffusion.
As discussed in the paper, in this sub-section, middle-frame attention guidance is not applied for fair comparison.
We first show results of our method with per-clip generation.
As discussed in the main paper (Sec. 4.3 Temporal MultiDiffusion), for a video with 32 frames, per-clip method treat each 16 frames as a clip and generate two clips independently.
As can be seen in the following, the results are not temporally coherent.
At the boundary of each clip, the generated content changes abruptly (the neck of the generated goose and the pose of the raccoon).
In contrast, our method generates the whole video at once, and thus is temporally coherent.
Source Video (32 frames) | "A goose swimming in a lake." | |
---|---|---|
Source Video (32 frames) | "A raccoon standing on a waterfall." | |
Per-clip | Ours |
As discussed in the supplementary material (Appendix E.1.), due to the inherent property of temporal self-attention,
our model can be directly applied to generate videos with more frames.
However, this comes with a cost of generation quality.
We denote this as larger chunk generation and use Baseline below to represent it.
As shown in the following, baseline method sacrifices the generation quality.
In the first case, ears of the generated raccoon appear in the wrong position.
Source Video (24 frames) | "A raccoon standing on a waterfall." | |
---|---|---|
Source Video (24 frames) | "A brown bear cub walking through a river." | |
Baseline | Ours |
We present the effect of Middle-frame attention guidance in this sub-section.
Detailed disscussion can be found in Sec. 4.3 Effect of middle-frame attention guidance.
We show the results of object swap with different scale of attention guidance.
When attention guidance is not applied, in this case, 0.0,
the color of the car shifts smoothly from red to black.
With attention guidance, the color of the car can maintain the same throughout the video.
However, when the guidance is too strong, in this case, 1.0, certain artifacts are introduced,
as can be seen in the end of the video, specifically the rear wheel of car.
Source Video (32 frames) | "A mini cooper driving down a road." | ||||
---|---|---|---|---|---|
0.0 | 0.3 | 0.5 | 0.8 | 1.0 |
We also conduct further ablation analysis to show the effectiveness of the middle-frame attention guidance mechanism we propose.
As discussed in Appendix E.2. we compare our method with three different attention based approaches.
Please refer to the appendix for more description on different methods.
Source Video (32 frames) | "A mini cooper driving down a road." | |||
---|---|---|---|---|
SC_Attn | MSC_Attn | First-frame attention | Middle-frame attention(ours) |
Source Video (32 frames) | "A lion walking through a jungle." | Source Video (16 frames) | "A horse wandering in the woods." |
---|---|---|---|
A Potential Extension
"A teddy bear walks on the beach." (80 frames) |
"A teddy bear dancing in Times Square" (256 frames) |
---|---|
[1] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
[2] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
[3] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023.
[4] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.