Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem--real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation.
We employ our method on various images and edit them with two target prompts at 512 × 512 resolution. We show the wide range of edits our approach can be used, including but not limited to style transfer, content add-on, posture change, breed change, etc.
Our method achieves higher-resolution image editing without artifacts like duplicates, even on ones that change the height-width ratio drastically.
We conduct various editing on human face photos, locally or globally. The models are trained and edited at a resolution of 512 × 512.
We show how our approach can be applied to various tasks in image editing, such as content removal (a), style generation (b), and style transfer (c).
A children’s painting of a castle. The generation resolution is set to H = 768 and W = 1024. We use K = 400 and v = 0.7 in this sample.
A painting of a castle in the style of Claude Monet. The output resolution is set to H = 768 and W = 1024. We use K = 400 and v = 0.65 in this example.
A photo of a lake with many sailboats. The output resolution is set to H = 768 and W = 1024. We use K = 400 and v = 0.7 in this example.
A desert. The output resolution is set to H = 768 and W = 1024. We use K = 500 and v = 0.8 in this example.
A desert. The output resolution is set to H = 768 and W = 1024. We use K = 500 and v = 0.8 in this example.
A watercolor painting of a girl. The output resolution is set to H = 1024 and W = 768. We use K = 400 and v = 0.6 in this example.
@article{zhang2022sine,
title={SINE: SINgle Image Editing with Text-to-Image Diffusion Models},
author={Zhang, Zhixing and Han, Ligong and Ghosh, Arnab and Metaxas, Dimitris and Ren, Jian},
journal={arXiv preprint arXiv:2212.04489},
year={2022}
}