>ControlVideo: Training-free Controllable Text-to-Video Generation
Hugging Face Space: https://huggingface.co/spaces/fffiloni/ControlVideo
Hugging Face Paper: https://huggingface.co/papers/2305.13077
Code (official implementation): https://github.com/YBYBZhang/ControlVideo
ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti
Let us know what you think!
ControlNet for Video is here, and it’s not anything like the automatic 11.11 plugins
that claim to use ControlNet to give you video.
Those use frame interpolation and ControlNet looking at individual frames, and this model
takes a whole new approach to this and has absolutely blown my mind.
So we’re going to talk about the paper, we’re going to show the hugging face space where
you can play with this right away, and yeah, let’s get into it.
So the source for this, Sylvian Filoni on Twitter, follow him for more updates here,
he’s the one who produced this implementation.
And to be clear, this is a implementation of a paper that was published in May.
Yeah, and Sylvian was not the author or co-author of this paper.
So this is his interpretation.
It’s working, and it’s very cool to see another incredible adaptation of this technology moving
We’ve seen a lot of it with text-to-video in the last week or so, and we’re moving into
even more complex forms of video going forward.
So what this paper is called is Control Video Training-Free Controllable Text-to-Video Generation.
So it’s not a huge surprise that this model is being referred to casually as control video.
And what I like here is that AK has actually started summarizing abstracts from academic
papers with a internal tuned GPT.
So we’re going to go over that.
The abstract here, it’s written in like pretty broken English.
So I’m going to pick a few things from here that I think are interesting.
Basically, what this paper is doing is saying we want to create a new kind of text-to-video
control net implementation that isn’t as hamstringed on the amount of video training
you would have needed previously or hamstringed by how difficult temporal inference is.
Control video adapted from control net, so it is based on control net, leverages coarsely
structural consistency from input motion sequences.
and less kind of video shape mapping like you’d see in text-to-video tools like RunwayML
or Pica or Zeroscope.
And they do a good job at movement, but curiously, text-to-video models, just because of the
way they’re made, at times will struggle with motion.
And Pica got really close to understanding this motion.
But control net takes a different approach and uses motion as one of its core inputs
or pose as one of its core inputs.
Firstly, to ensure the appearance coherence between frames, which previously was done
with a very rough inference.
You can also use this to smooth video from a lot of other text-to-video tools.
So they say control video adds full cross-frame interaction and self-attention module.
So they’re doing something pretty similar to what these automatic 11.11 plugins did,
which was just applying stable diffusion or control net to every frame and then stitching
the frames back together.
Actually, Zeroscope pretty much works this way.
However, control video is adding in way more context to the calculation going on when you’re
going from frame to frame and understanding how much something should change, what distance
something should change, how focus might change, those kinds of things.
So they call these self-attention modules.
Secondly, to mitigate the flicker effect, which again, I’ve done some content on how
to reduce this with some existing AI tools.
To mitigate the flicker effect, it introduces an interleaved frame smoother that employs
frame interpolation on alternated frames.
This basically means that it can look as far as like five frames ahead and then randomizes
that so that you get a nice distribution after the fact that in the end looks like smooth video.
Finally, to produce log videos efficiently, it utilizes a hierarchical sampler that separately
synthesizes each short clip with holistic coherency.
It’ll be interesting kind of what they define that as.