This extension offers a new Apply-Style node for Redux that allows for changing the influence of the conditioning image on the final outcome. This effectively allows for changing the style or content of an image using a prompt while using Redux.
As many of you might noticed, the recently released Redux model is rather a model for generating multiple variants of an image, but it does not allow for changing an image based on a prompt.
If you use Redux with an Image and add a prompt, your prompt is just ignored. In general, there is no strength slider or anything in Redux to control how much the coniditioning image should determine the final outcome of your image.
For this purpose I wrote this little custom node that allows you change the strength of the Redux effect.
I used the following pexel image as an example conditioning image: [[https://www.pexels.com/de-de/foto/29455324/]]
Lets say we want to have a similar image, but as comic/cartoon. The prompt I use is "comic, cartoon, vintage comic"
Using Redux on Flux1-dev I obtain the following image.
original Redux setting
As you can see, the prompt is vastly ignored. Using the custom node and "medium" setting I obtain
Redux medium strength
Lets do the same with anime. The prompt is "anime drawing in anime style. Studio Ghibli, Makoto Shinkai."
As the anime keyword has a strong effect in Flux, we see a better prompt following on default than with comics.
original Redux setting
Still, its far from perfect. With "medium" setting we get an image that is much closer to anime or studio Ghibli.
Redux medium strength
You can also mix more than one images together. Here is an example with adding a second image: [[https://www.pexels.com/de-de/foto/komplizierte-bogen-der-mogul-architektur-in-jaipur-29406307/]]
Mixing both together and using the anime prompt above gives me
Finally, we try a very challenging prompt: "Marble statues, sculptures, stone statues. stone and marble texture. Two sculptures made out of marble stone.". As you can see, I repeated the prompt multiple times to increase its strength. But despite the repeats, the default Redux workflow will just give us the input image Reduxed - our prompt is totally ignored.
original Redux setting
With medium we get an image back that looks more like porcelain instead of marble, but at least the two women are sculptures now.
Redux medium strength
Further decreasing the Redux strength will transform the woman into statues finally, but it will also further decrease their likeness to the conditioning image. In almost all my experiments, it was better to repeat multiple seeds with the "medium" setting instead of further decreasing the strength.
With v.2 you can now also add a mask to the conditioning image.
In this example I just masked the flower pattern on the clothing of the right women. I then prompt for "Man walking in New York, smiling, holding a smart phone in his hand.". As you can see, his shirt adapts to the flower pattern, while nothing outside the mask has any impact on the outcoming image.
When the masking area is very small, you have to increase the strength of the conditioning image as "less of the image" is used to condition your generation.
Redux (or better CLIP) cannot deal with non-square images by default. It will just center-crop your conditioning image such that it has a square resolution. Now that the node supports masking, we can simply support non-square images, too. We just make the image square by adding black borders to the shorter edge. Of course, we do not want to have these borders in the generated image, so we add a mask that cover the original image but not the black padding border.
You do not have to do this yourself. There is a "keep aspect ratio" option that automatically generates the padding and adjust the mask for you.
Here is an example: the input image (again from Pexel: ) is this one: https://www.pexels.com/photo/man-wearing-blue-gray-and-black-crew-neck-shirt-1103832/
To make a point, I cropped the images to make it maximal non-square.
With the normal workflow and the prompt "comic, vintage comic, cartoon" we would get this image back:
With the "keep aspect ratio" option enabled, we get this instead:
Similar to masks, the conditioning effect will be weaker when we use only a small mask (or here: when the aspect ratio is extremely unbalanced). Thus, I would still recommend to avoid images with too extreme aspect ratios as this example image above.
I was told that the images above for some reason do not contain the workflow. So I just uploaded the workflow files into the github. The simple_workflow.json is the workflow containing a single setting, the advanced_workflow.json has several customization options as well as masking and aspect ratio.
This workflow is a replacement for the ComfyUI StyleModelApply node. It has a single option that controls the influence of the conditioning image on the generation. The example images are all generated with the "medium" strength option. However, when using masking, you might have to use "strongest" or "strong" instead.
This node allows for more customization. As input it gets the conditioning (prompt), the Redux style model, the CLIP vision model and optionally(!) the mask. Its parameters are:
The node outputs the conditioning, as well as the cropped and resized image and its mask. You neither need the image nor the mask, they are just for debugging. Play around with the cropping option and use the "Image preview" node to see how it effects the cropped image and mask.
Redux works in two steps. First there is a Clip Vision model that crops your input image into square aspect ratio and reduce its size to 384x384 pixels. It splits this image into 27x27 small patches and each patch is projected into CLIP space.
Redux itself is just a very small linear function that projects these clip image patches into the T5 latent space. The resulting tokens are then added to your T5 prompt.
Intuitively, Redux is translating your conditioning input image into "a prompt" that is added at the end of your own prompt.
So why is Redux dominating the final prompt? It's because the user prompt is usually very short (255 or 512 tokens). Redux, in contrast, adds 729 new tokens to your prompt. This might be 3 times as much as your original prompt. Also, the Redux prompt might contain much more information than a user written prompt that just contains the word "anime".
So there are two solutions here: Either we shrink the strength of the Redux prompt, or we shorten the Redux prompt.
The next sections are a bit chaotic: I changed the method several times and many stuff I tried is outdated already. The only and best technique I found so far is described in Interpolation methods.
To shrink the Redux prompt and increase the influence of the user prompt, we can use a simple trick: We take the 27x27 image patches and split them into 9x9 blocks, each containing 3x3 patches. We then merge all 3x3 tokens into one by averaging their latent embeddings. So instead of having a very long prompt with 27x27=729 tokens we now only have 9x9=81 tokens. So our newly added prompt is much smaller than the user provided prompt and, thus, have less influence on the image generation.
Downsampling is what happens when you use the "medium" setting. Of all three techniques I tried to decrease the Redux effect, downsampling worked best. ~~However, there are no further customization options. You can only downsample to 81 tokens (downsampling more is too much)~~.
Instead of averaging over small blocks of tokens, we can use a convolution function to shrink our 27x27 images patches to an arbitrary size. There are different functions available which most of you probably know from image resizing (its the same procedure). The averaging method above is "area", but there are also other methods available such as "bicubic".
The idea here is to shrink the Redux prompt length by merging similar tokens together. Just think about large part of your input image contain more or less the same stuff anyways, so why having always 729 tokens? My implementation here is extremely simple and stupid and not very efficient, but anyways: I just go over all Redux tokens and merge two tokens if their cosine similarity is above a user defined threshold.
Even a threshold like 0.9 is already removing half of the Redux tokens. A threshold of 0.8 is often reducing the Redux tokens so much that they are in similar length as the user prompt.
I would start with a threshold of 0.8. If the image is blurry, increase the value a bit. If there is no effect of your prompt, decrease the threshold slightly.
We can also just multiply the tokens by a certain strength value. As lower the strength, as closer the values are to zero. This is similar to prompt weighting which was quite popular for earlier stable diffusion versions, but never really worked that well for T5. Nevertheless, this technique seem to work well enough for flux.
If you use downscaling, you have to use a very low weight. You can directly start with 0.3 and go down to 0.1 if you want to improve the effect. High weights like 0.6 usually have no impact.
My feeling currently is that downsampling by far works best. So I would first try downsampling with 1:3 and only use the other options if the effect is too weak or too strong.