Latent Diffusion Models (High-Resolution Image Synthesis with Latent Diffusion Models[)

To generate high-resolution image.

The noise predictor is trained in the latent space of AutoEncoder.

Earlier models used U-Net with the attention module at each layer for the noise prediction.

DiT: Diffusion Models with Transformers

DDPM 也可以不用Unet做扩散, DiT 就是用 ViT 代替 Unet

DiT is based on the Vision Transformer (ViT) architecture which operates on sequences of patches