cs180

Overview

In Project 5, we gain an introduction to diffusion models and some of their interesting applications. We begin in Project 5A by leveraging the pre-trained DeepFloyd model to learn about sampling, denoising, classifier-free guidance, image-to-image translation, inpainting, visual anagrams, and more. In Project 5B, we implement and train our diffusion model on the MNIST dataset.

5A: Part 0: Setup

During the setup, we downloaded the DeepFloyd model and gave it sample text prompts to generate images. I experimented with changing the num_inference_steps. I tried 10 steps and 100 steps. Clearly, the 100 step trials contain more detailed, sharper images. I also seeded my project with 6174.

"a rocket ship" @ 10 steps

"a man wearing a hat" @ 10 steps

"an oil painting of a snowy mountain village" @ 10 steps

"a rocket ship" @ 100 steps

"a man wearing a hat" @ 100 steps

"an oil painting of a snowy mountain village" @ 100 steps

5A: Part 1: Sampling Loops: 1.1 Implementing the Forward Process

During this step, I wrote the function that adds noise to a clean image based on the given noise level (a function parameter).

5A: Part 1: Sampling Loops: 1.2 Classical Denoising

In this part, I implemented classical denoising by convolving the noisy input with a 2D Gaussian. The aim of this is to remove high-frequency noise from the image.

5A: Part 1: Sampling Loops: 1.3 One-Step Denoising

In this part, I used the DeepFloyd diffusion model to denoise images generated through the forward process. The model was trained with text conditioning–– we included "a high quality photo" as a prompt embedding. Stage 1 of the UNet model is where the actual denoiser is found. We arrived at the clean image by solving for it using the equation from the forward process. This method significantly outperformed classical denoising.

5A: Part 1: Sampling Loops: 1.4 Iterative Denoising

The goal of this section was to iteratively denoise an image until it returns to its original clean state. However, if we were to iterate step by step, this task would quickly become computationally expensive. Instead, we used strided timestamps to "skip around." For this part and the rest of the project, I used a stride of 30 steps.

5A: Part 1: Sampling Loops: 1.5 Diffusion Model Sampling

I used the iterative denoising from the last section to generate new images for this part. I passed in an image of random noise (with the same dimensions as the test image) and denoised it using iterative denoising. Here, we used "a high quality photo" as the text embedding for this exercise. The results were not significant. However, we will address this in the next section.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

5A: Part 1: Sampling Loops: 1.6 Classifier-Free Guidance (CFG)

In this section, we implemented classifier-free guidance (CFG) to improve the output from the last part. In CFG, we provide the model with an additional text prompt to help "guide" it–– we use an additional estimate to compute the final estimate used for the denoising. The downside is that this reduces the variety of our outputs. I used the default parameters provided in the project spec for this section and all subsequent sections. However, the results were significantly better.

5A: Part 1: Sampling Loops: 1.7 Image-to-image Translation

This part involved adding noise to an existing image and using our model to denoise it back to its original state. However, this is usually not a perfect reversal and often results in [cool] edits to the original image.

5A: Part 1: Sampling Loops: 1.7.1 Editing Hand-Drawn and Web Images

Here we repeated the last section's exercise with some new images–– one from the web (Johnny Test, my favorite cartoon character), and two that I drew (a car and house). This process seems to work better with less realistic images as you can see.

5A: Part 1: Sampling Loops: 1.7.2 Inpainting

In this section, we used the diffusion models to generate new features for a certain part of an image. We covered the rest of the image with a mask to preserve it. Here are some examples:

5A: Part 1: Sampling Loops: 1.7.3 Text-Conditional Image-to-image Translation

In this part, we repeat an image-to-image translation, but we also provide a text prompt to the model–– allowing us to have some control over the result. For the Campanile example, the prompt was "a rocket ship."

5A: Part 1: Sampling Loops: 1.8 Visual Anagrams

The aim of this section was to generate visual anagrams: images that have different interpretations depending on their orientation (in our case, the image flipped upside down would show something different). In the following example, "an oil painting of people around a campfire" is displayed, and when flipped, "an oil painting of an old man" can be seen.

5A: Part 1: Sampling Loops: 1.9 Hybrid Images

Finally, in this section, we created hybrid images using factorized diffusion. Similar to our previous project, hybrid images are created by blending high and low-frequency components of two images (or, in our case, two estimates for noise).

5B: Part 1: Training a Single-Step Denoising UNet: 1.2 Using the UNet to Train a Denoiser

I simulated the noising process on clean MNIST digits by progressively adding noise. The visualization shows how the noise accumulates over time, demonstrating the gradual transition from a clean image to pure noise.

Varying levels of noise on sample data

5B: Part 1: Training a Single-Step Denoising UNet: 1.2.1 Training

I trained the UNet model to denoise noisy MNIST images by using pairs of clean and noised images generated. Over five epochs, the network gradually improved, as shown by the loss curve and the visual results, with clear progress from the 1st epoch to the 5th.

5B: Part 1: Training a Single-Step Denoising UNet: 1.2.2 Out-of-Distribution Testing

I tested the denoiser on MNIST digits with different noise levels it wasn’t trained for. The results showed how well the model could generalize, with better performance on low noise levels and more challenges as the noise increased.

5B: Part 2: Training a Diffusion Model: 2.3 Sampling from the UNet

I trained the time-conditioned UNet and tracked the loss over all epochs to see how well the model was learning. After 5 and 20 epochs, I generated samples to visualize its performance. The samples improved significantly with more training, showing sharper and less noisy outputs at epoch 20 compared to epoch 5.

5B: Part 2: Training a Diffusion Model: 2.5 Sampling from the Class-Conditioned UNet

I trained the class-conditioned UNet. I sampled outputs after 5 and 20 epochs. At both stages, I generated 4 versions of each digit to highlight the quality of results. The samples after 20 epochs were noticeably less noisy compared to those at 5 epochs.