Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sieve improvements #214

Open
cparello opened this issue Feb 28, 2024 · 4 comments
Open

sieve improvements #214

cparello opened this issue Feb 28, 2024 · 4 comments

Comments

@cparello
Copy link

https://www.sievedata.com/blog/fast-high-quality-ai-lipsyncing

Can the improvements made by sieve be done here?

@savleharshad
Copy link

they did some optimizations. how can we idea of what optimization can we do to improve model?

@cparello
Copy link
Author

they explain it all in the doc and the models can be updated from the 512 to the 1024 or the 2048. this part i already did, but i have not had the chance to attempt the sieve code improvements

@jryebread
Copy link

where do you see that they explained what changes they made? it sais they would open source the results but don't see anything :(

@cparello
Copy link
Author

Our Improvements
To improve this, we’ve introduced a series of optimizations on the original repository that greatly improves speed and performance.

The first optimization is smartly cropping around the face of the target speaker to avoid unnecessarily processing most of the video. Along with the ML models, there are a lot of computer vision operations like warping, inverse transforms, etc. in Video Retalking that are expensive to perform on the entire face. We quickly identify the target speaker using batched RetinaFace, a very lightweight face detector. In many scenarios, there are multiple faces or even multiple predictions of the same face that occur, so we have to isolate the largest face. For now, we treat that as the target speaker. Then, we crop the video around the union of all detections of the face. This allows us to process a much smaller subsection of the video, which greatly speeds up inference up to 4x faster, especially on videos where the face is smaller and doesn’t move a lot. In addition, establishing the target speaker crop allows us to solely enhance that part of the video, rather than potentially generating artifacts around other sections of the frame.

Second, we added batching to the stabilization step, making this step much faster when combined with the cropping above. We also removed enhancement of the stabilized video, as we found that its inclusion did not affect quality after we performed the cropping above.

When detecting facial landmarks, the original repository reinitialized the keypoint extractor multiple times and performed duplicate computations of landmarks during multiple steps of the process due to input resizing. We initialize the keypoint extractor once, and allow landmarks calculated before to be resized and reused during facial alignment. On low resolution inputs where the face is really small, we bypass parts of the alignment that actually made the output look worse, as the feature detection was much less accurate.

Finally, we made the code more durable to edge cases where no faces are detected (by ignoring these frames), more than one face is detected (by detecting the largest face), or there is lots of movement from the speaker (by being smart about cropping).

In addition, we’ve optimized memory and GPU memory throughout the code so that it can fit on an L4 GPU with 8 vCPUs and 32 GB RAM, making it very cost-effective. We also added a low resolution and low fps option to allow for up to an additional 4x speedup for scenarios where speed matters more than quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants