Aarav Patel
In Project 5, we gain an introduction to diffusion models and some of their interesting applications. We begin in Project 5A by leveraging the pre-trained DeepFloyd model to learn about sampling, denoising, classifier-free guidance, image-to-image translation, inpainting, visual anagrams, and more. In Project 5B, we implement and train our diffusion model on the MNIST dataset.
During the setup, we downloaded the DeepFloyd model and gave it sample text prompts to generate images. I experimented with changing the num_inference_steps. I tried 10 steps and 100 steps. Clearly, the 100 step trials contain more detailed, sharper images. I also seeded my project with 6174.
During this step, I wrote the function that adds noise to a clean image based on the given noise level (a function parameter).
In this part, I implemented classical denoising by convolving the noisy input with a 2D Gaussian. The aim of this is to remove high-frequency noise from the image.
In this part, I used the DeepFloyd diffusion model to denoise images generated through the forward process. The model was trained with text conditioning–– we included "a high quality photo" as a prompt embedding. Stage 1 of the UNet model is where the actual denoiser is found. We arrived at the clean image by solving for it using the equation from the forward process. This method significantly outperformed classical denoising.
The goal of this section was to iteratively denoise an image until it returns to its original clean state. However, if we were to iterate step by step, this task would quickly become computationally expensive. Instead, we used strided timestamps to "skip around." For this part and the rest of the project, I used a stride of 30 steps.
I used the iterative denoising from the last section to generate new images for this part. I passed in an image of random noise (with the same dimensions as the test image) and denoised it using iterative denoising. Here, we used "a high quality photo" as the text embedding for this exercise. The results were not significant. However, we will address this in the next section.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
In this section, we implemented classifier-free guidance (CFG) to improve the output from the last part. In CFG, we provide the model with an additional text prompt to help "guide" it–– we use an additional estimate to compute the final estimate used for the denoising. The downside is that this reduces the variety of our outputs. I used the default parameters provided in the project spec for this section and all subsequent sections. However, the results were significantly better.
This part involved adding noise to an existing image and using our model to denoise it back to its original state. However, this is usually not a perfect reversal and often results in [cool] edits to the original image.
Here we repeated the last section's exercise with some new images–– one from the web (Johnny Test, my favorite cartoon character), and two that I drew (a car and house). This process seems to work better with less realistic images as you can see.
In this section, we used the diffusion models to generate new features for a certain part of an image. We covered the rest of the image with a mask to preserve it. Here are some examples:
In this part, we repeat an image-to-image translation, but we also provide a text prompt to the model–– allowing us to have some control over the result. For the Campanile example, the prompt was "a rocket ship."
The aim of this section was to generate visual anagrams: images that have different interpretations depending on their orientation (in our case, the image flipped upside down would show something different). In the following example, "an oil painting of people around a campfire" is displayed, and when flipped, "an oil painting of an old man" can be seen.
Finally, in this section, we created hybrid images using factorized diffusion. Similar to our previous project, hybrid images are created by blending high and low-frequency components of two images (or, in our case, two estimates for noise).
I simulated the noising process on clean MNIST digits by progressively adding noise. The visualization shows how the noise accumulates over time, demonstrating the gradual transition from a clean image to pure noise.
I trained the UNet model to denoise noisy MNIST images by using pairs of clean and noised images generated. Over five epochs, the network gradually improved, as shown by the loss curve and the visual results, with clear progress from the 1st epoch to the 5th.
I tested the denoiser on MNIST digits with different noise levels it wasn’t trained for. The results showed how well the model could generalize, with better performance on low noise levels and more challenges as the noise increased.
I trained the time-conditioned UNet and tracked the loss over all epochs to see how well the model was learning. After 5 and 20 epochs, I generated samples to visualize its performance. The samples improved significantly with more training, showing sharper and less noisy outputs at epoch 20 compared to epoch 5.
I trained the class-conditioned UNet. I sampled outputs after 5 and 20 epochs. At both stages, I generated 4 versions of each digit to highlight the quality of results. The samples after 20 epochs were noticeably less noisy compared to those at 5 epochs.