Shaping a Stabilized Video by Mitigating Unintended Changes for Text-Guided Video Editing

Abstract

Text-driven video editing utilizing generative diffusion models has garnered significant attention due to their potential applications. However, existing approaches are constrained by the limited word embeddings provided in pre-training, which hinders nuanced editing targeting open concepts with specific attributes. Directly altering the keywords in target prompts often results in unintended disruptions to the attention mechanisms. To achieve more flexible editing easily, this work proposes an improved concept-augmented video editing approach that generates diverse and stable target videos flexibly by devising abstract conceptual pairs. Specifically, the framework involves concept-augmented textual inversion and a dual prior supervision strategy. The former enables plug-and-play guidance of stable diffusion for video editing, effectively capturing target attributes for more stylized results. The dual prior supervision strategy significantly enhances video stability and fidelity. Comprehensive evaluations demonstrate that our approach generates more stable and lifelike videos, outperforming state-of-the-art methods.

Task

Our work focuses on diffusion-based video object editing, which can be categorized into two scenarios: editing without concept videos and editing with concept videos as guidance. Our method demonstrates the capability to maintain consistency in non-target regions between the original and edited videos.

Comparisons

Ours(the far right) vs. Tune-A-Video, Fatezero, MoitionDirector, RAVE

More Results

We conducted additional experiments with different temporal parameters. While our original experiments used a frame sampling stride of 8 and sequence length of 6 frames (I), we tested our approach with a reduced stride of 3 frames and an extended sequence length of 14 frames (II). The comparative analysis reveals the following:

More Ablation

We released an extended set of ablation experiments to analyze the contribution of each component in our framework. The experiments include: (I) Our complete method, (II) Tuning without concept video, (III) Textual Inversion w/o Concept-Augmented, (IV) w/ SCAM & w/o TCAM, (V) w/o SCAM & w/ TCAM, and (VI) w/o DPS.

Shaping A Stabilized Video By Mitigating Unintended Changes For Concept-Augmented Video Editing

Abstract

Task

Comparisons

More Results

More Ablation