Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Allow specifying a different model at different zoom levels #17

Open
ttulttul opened this issue Dec 7, 2023 · 0 comments
Open

Comments

@ttulttul
Copy link

ttulttul commented Dec 7, 2023

During "shifted crop sampling with dilated sampling," small "foreground" objects can be injected into sharply focused background areas, as you point out in Figure 7 of the paper. You note correctly that, "the priors of current LDMs regarding image crops are solely derived from the general training scheme, which has already resulted in impressive performance. Training a bespoke LDM for a DemoFusion-like framework may be a promising direction to explore."

I'd like to suggest a possible alternative to training a bespoke model and I wonder if you (or anyone) has tried this yet.

During shifted crop sampling with dilated sampling, you could apply an IPAdapter to effectively re-condition diffusion on only the background portion of the global image that your sliding window is diffusing over. Although the diffusion model may not have been trained on a large number of samples of background imagery, if IPAdapter is applied to each patch, the model may nonetheless be guided toward generating background features rather than foreground features. IPAdapter is very fast, requiring only a single 224x224 area of pixels, which are passed through CLIPVision and then a small network to generate four 1024-wide vectors; these vectors are applied using cross-attention to the layers of the U-Net to guide diffusion.

An alternative to this approach with IPAdapter would be to apply a LoRA at each zoom level that has been specifically trained on zoomed samples to help guide diffusion to produce background-appropriate imagery. Loading in a few LoRAs for each zoom level might be less intensive than applying a bunch of IPAdapters, which would require running the latents through the VAE decoder to get an image for CLIPVision.

image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant