Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Modularize Inpaint Pipeline #6120

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dunkeroni
Copy link
Contributor

Summary

Keeping this as Draft until tested with more models and settings.

This is a rewrite of the inpaint mask code in the diffusers pipeline. It centralizes all the mask logic to a single object that gets called from multiple points. The calls are based on location in the process as a proof of concept for a more elaborate modular extension system of the denoise process. Baby steps first.

Our previous process had mask-based decisions and code in five locations, across multiple object levels, and the mask was inverted partway through. Dragging it all together was necessary for debugging.

Related Issues / Discussions

This (mostly) fixes the issues with inpaint models that were introduced with gradient mask changes. If an Inpaint model is used and a denoise mask is provided that does not include a masked_latents field (generated by providing the original image and VAE to the Create Denoise Mask node), then it will synthetically create one by filling the mask area with predetermined values based on the model architecture. Doing so avoids a second VAE-encode on all inpainting/canvas processes.

This is not one-to-one with the standard setup. The new method has been tuned to reduce artifacts as much as possible, but there can still be some unusual highlighting and occasionally edges of inpainted objects get chopped off.
image
The dog on the left was inpainted with a VAE-encoded masked_latents, and the dog on the right was from a synthetic fill. The mask edge of the synthetic dog has lost part of the hindquarters and added an errant sparkle between the bench slats.

Oddly, using pure synthetic fill made a significantly higher success rate for inpainting objects from the prompt, nearing 100% before adjusting for artifacts. With the fixes in place to avoid edge artifacts, the success rate for synthetic is much closer to VAE, but still slightly higher. VAE will ignore the prompt and just fill in the bench a bit more often than synthetic does at the current settings.
image

Code is included for synthetic fill of SD-2 models, but it does not work at this time. Something is different about the way it expects masked_latents to look.

QA Instructions

Since this affects all of the inpainting pipelines, we need to test a variety of settings on all model architectures to make sure there is nothing strange or unexpected happening.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)

@github-actions github-actions bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files labels Apr 3, 2024
@psychedelicious
Copy link
Collaborator

IMO an extra VAE encode and some additional support code for inpainting is an acceptable tradeoff for no possibility of artifacts.

I'm imagining a scenario where you inpaint and it's almost perfect, but there are some artifacts. So you you inpaint the artifacts, but this then creates more artifacts, and so on. We were OK with a whole extra denoising step before - reducing that overhead to a single VAE encode seems very reasonable.

@dunkeroni
Copy link
Contributor Author

IMO an extra VAE encode and some additional support code for inpainting is an acceptable tradeoff for no possibility of artifacts.

I'm imagining a scenario where you inpaint and it's almost perfect, but there are some artifacts. So you you inpaint the artifacts, but this then creates more artifacts, and so on. We were OK with a whole extra denoising step before - reducing that overhead to a single VAE encode seems very reasonable.

I was hoping it could get to a perfect result, but I think it will have to just live as a "compatibility fallback" procedure if no encoded image is provided.

The issue is how to avoid the extra encode when we definitely don't need it. It is only necessary when a user is inpainting on canvas with an Inpaint trained model. I would like to avoid what we had before, which was that every time you inpaint on canvas it would create the masked image and send it through an extra VAE encode... and then the overwhelming majority of the time it goes unused because normal models don't need it. We have a few options for that:

  1. Find a way for Canvas to adjust the graph based on inpaint models (preferably not, since if it is possible it's more complexity than it's worth)
  2. Package image and VAE info alongside the Denoise Mask data into the Denoise Latents invocation, and then encode if necessary once everything is together in the pipeline (loses the benefits of caching when you use the same mask multiple times)
  3. Feed image, Unet, and VAE info into the Mask nodes so that they can decide internally whether masked_latents is required (a few added connections in Canvas graphs, plus some minor confusion as to what is going on by most users as to what is going on in the mask node)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files invocations PRs that change invocations python PRs that change python files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants