Directly upscaling inside the latent space. Model was trained for SD1.5 and drawn content. Might add new architectures or update models at some point. This took heavy inspriration from city96/SD-Latent-Upscaler and Ttl/ComfyUi_NNLatentUpscale.
This took heavy inspriration from city96/SD-Latent-Upscaler and Ttl/ComfyUi_NNLatentUpscale. Directly upscaling inside the latent space. Some models are for 1.5 and some models are for SDXL. All models are trained for drawn content. Might add new architectures or update models at some point. I recommend the SwinFIR or DRCT models.
1.5 comparison:
SDXL comparison:
First row is upscaled rgb image from rgb models before being used in vae encode or vae decoded image for latent models, second row final output after second KSampler.
I tried to take promising networks from already existing papers and apply more exotic loss functions.
DAT12x6_l1_eV2-b0_contextual_315k_1.5 / DAT6x6_l1_eV2-b0_265k_1.5
DAT12x6_l1_eV2-b0_contextual_315k_1.5
SwinFIR4x6_fft_l1_94k_sdxl / SwinFIR4x6_mse_64k_sdxl
DRCT-l_12x6_325k_l1_sdxl / DRCTFIR-l_12x6_215k_l1_sdxl
DRCT-l_12x6_160k_l1_vaeDecode_l1_hfen_sdxl
DRCT-l_12x6_170k_l1_vaeDecode_l1_fft_sdxl
Ideas I might test in the future:
nonnegative_ssim=True
does not seem to help as well. Avoid SSIM to retain stability.vae.config.scaling_factor = 0.13025
(do not set a scaling factor, nnlatent used it and city96 didn't, I do not recommend to use it), image range 0 to 1 (image tensor is supposed to be -1 to 1 prior to encoding with vae) and not using torch.inference_mode()
while creating the dataset. A combination of these can make training a lot less stable, even if loss goes down during training and does seemingly converge, the final model won't be able to generate properly. Here is a correct example:vae = AutoencoderKL.from_single_file("vae.pt").to(device)
vae.eval()
with torch.inference_mode():
image = cv2.imread(f)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = (
torch.from_numpy(image.transpose(2, 0, 1))
.float()
.unsqueeze(0)
.to(device)
/ 255.0
)
image = vae.encode(image*2.0-1.0).latent_dist.sample()
DITN and OmniSR looked like liquid with their official sizes. Not recommended to use small or efficient networks.
HAT looked promising, but seemingly always had some kind of blur effect. I didn't manage to get a proper model yet.