Editing Diffusion Models

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Diffusion models are a class of generative AI models that have achieved state-of-the-art results in image, audio, video, and molecular generation. Systems like DALL-E 3, Stable Diffusion, Midjourney, Sora, and AudioLDM are all built on diffusion model foundations. The core idea is elegant: learn to reverse a process of gradually adding noise to data, training the model to denoise step by step until a coherent sample emerges from pure random noise.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Generative model''' — A model that learns the underlying distribution of training data and can generate new samples from that distribution.
* '''Forward process (diffusion)''' — The process of gradually adding Gaussian noise to data over T time steps until it becomes pure noise.
* '''Reverse process (denoising)''' — The learned process of iteratively removing noise from a noisy sample to recover a clean data point.
* '''Noise schedule''' — The function controlling how much noise is added at each time step. Common schedules: linear, cosine, sigmoid.
* '''U-Net''' — The neural network architecture originally used as the denoising backbone in diffusion models; it processes images at multiple scales via encoder-decoder with skip connections.
* '''Score function''' — The gradient of the log probability density, which points in the direction of higher data density; diffusion models implicitly learn to estimate this.
* '''DDPM (Denoising Diffusion Probabilistic Models)''' — The foundational 2020 paper that established the modern diffusion model framework (Ho et al.).
* '''DDIM (Denoising Diffusion Implicit Models)''' — A faster sampling method that achieves similar quality in far fewer steps (50 instead of 1000) by using a deterministic sampling formula.
* '''Latent diffusion''' — Performing the diffusion process in a compressed latent space (using a VAE encoder/decoder) rather than pixel space. This is how Stable Diffusion works.
* '''VAE (Variational Autoencoder)''' — The compression model used in latent diffusion to encode images into a compact latent representation.
* '''Classifier-Free Guidance (CFG)''' — A technique to improve sample quality and text-image alignment by interpolating between conditional and unconditional model predictions.
* '''Guidance scale''' — A hyperparameter controlling the strength of CFG; higher values produce samples more aligned with the conditioning signal but less diverse.
* '''Text-to-image''' — Generating images conditioned on natural language prompts.
* '''ControlNet''' — An architecture that adds spatial conditioning (e.g., edge maps, depth maps, pose skeletons) to pre-trained diffusion models without retraining.
* '''Inpainting''' — Using a diffusion model to fill in a masked region of an image coherently with its surroundings.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Diffusion models learn by training on data that has been corrupted with noise. At each training step, the model is shown a noisy version of a real image (with a known noise level t) and must predict the original noise that was added. Over millions of training examples, the model learns: "given a partially noisy image at noise level t, here's how to denoise it."

'''The forward process''' is fixed and mathematically defined:
x''t = √(ᾱ''t) · x''0 + √(1-ᾱ''t) · ε, where ε ~ N(0,I)

This means any noisy version of an image can be computed in one step directly from the original.

'''The reverse process''' is what the model learns:
p''θ(x''{t-1} | x_t) — given a noisy image at step t, predict the slightly less noisy image at step t-1.

'''Why not just use a GAN?''' GANs (Generative Adversarial Networks) were the previous state-of-the-art for image generation. They train a generator and discriminator in adversarial competition. Diffusion models have several advantages: more stable training (no mode collapse), better coverage of the data distribution (more diverse samples), and more principled theoretical grounding. The trade-off is slower sampling.

'''Latent diffusion''' solves the speed problem: instead of working in pixel space (512×512×3 = 786,432 dimensions), the VAE encodes images into a much smaller latent space (64×64×4 = 16,384 dimensions). The diffusion process runs in this compressed space — 50× fewer dimensions — making training and inference dramatically faster.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Generating images with Stable Diffusion using diffusers:'''

<syntaxhighlight lang="python">
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# Load Stable Diffusion XL
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)

# Use fast DPM-Solver++ scheduler (25 steps instead of 50+)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

# Generate image
image = pipe(
    prompt="A futuristic library with glowing holographic books, cinematic lighting, 8K",
    negative_prompt="blurry, low quality, distorted, watermark",
    num_inference_steps=25,
    guidance_scale=7.5,
    width=1024,
    height=1024,
    generator=torch.Generator("cuda").manual_seed(42)
).images[0]

image.save("output.png")
</syntaxhighlight>

; Key diffusion model parameters
: '''num''inference''steps''' → More steps = higher quality but slower. 20-50 is typical; use DDIM or DPM-Solver for efficiency.
: '''guidance_scale''' → 7–12 for good prompt adherence. Too high → oversaturation and artifacts. Too low → prompt ignored.
: '''negative_prompt''' → Tells the model what to avoid. Useful for removing common quality issues.
: '''seed''' → Set for reproducibility; vary for diversity.
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Generative Model Comparison
! Model Type !! Training Stability !! Sample Quality !! Sample Diversity !! Speed
|-
| GAN || Poor (mode collapse) || Very high (when tuned) || Low (mode collapse) || Fast (single forward pass)
|-
| VAE || Stable || Moderate (blurry) || Good || Fast
|-
| Diffusion (pixel) || Stable || State-of-the-art || Excellent || Slow (100-1000 steps)
|-
| Diffusion (latent) || Stable || State-of-the-art || Excellent || Moderate (20-50 steps in latent)
|-
| Flow Matching || Stable || State-of-the-art || Excellent || Fast (8-20 steps)
|}

'''Failure modes and limitations:'''
* '''Text rendering''' — Diffusion models notoriously struggle to generate coherent text within images. The denoising process doesn't understand character shapes.
* '''Consistent identities''' — Generating the same face or character across multiple images is difficult without techniques like DreamBooth or IP-Adapter.
* '''Prompt sensitivity''' — Small prompt changes can cause large output variation. "A cat sitting on a red chair" vs "A red chair with a cat" may produce very different images.
* '''NSFW and copyright concerns''' — Models trained on internet data have absorbed copyrighted images and potentially harmful content. Content filtering and fine-tuning on curated data are important mitigations.
* '''Computational cost''' — Running a 50-step reverse diffusion process is expensive. Inference optimization (distillation, SDXL-Turbo, LCM) reduces steps to 1–4.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Expert evaluation of diffusion models uses both automatic metrics and human studies:

'''FID (Fréchet Inception Distance)''': The standard metric for image generation quality and diversity. Lower is better. Computes the distance between the distribution of generated and real images in a feature space. FID captures both quality (are samples sharp and realistic?) and diversity (do they cover the real data distribution?).

'''CLIP score''': Measures text-image alignment using CLIP's joint embedding space. Higher means the generated image better matches the text prompt.

'''Human evaluation''': For production systems, human raters assess photorealism, prompt adherence, aesthetic quality, and safety. Tools like RLHF for diffusion models (using human feedback to steer the model toward preferred outputs) are emerging.

'''Perceptual studies''': Present raters with real vs generated images and measure discrimination accuracy. Models that fool humans reliably are considered high-quality.

Expert practitioners also evaluate '''controllability''' — does the model reliably respond to specific attributes in the prompt? Does it understand spatial relationships ("a cat to the left of a dog")? Composition benchmarks like T2I-CompBench evaluate these capabilities.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a diffusion-based image generation system:

'''1. Choose the right base model'''
<syntaxhighlight lang="text">
Use case assessment:
├── General creative images → SDXL 1.0 or Flux.1
├── Photorealistic portraits → Realistic Vision, Deliberate
├── Anime/illustration style → Anything-v5, DreamShaper
├── Product/e-commerce → Fine-tune on product images
└── Video generation → AnimateDiff, Sora, Stable Video Diffusion
</syntaxhighlight>

'''2. Customization pipeline (DreamBooth/LoRA)'''
<syntaxhighlight lang="text">
3-30 reference images (subject or style)
    ↓
[DreamBooth fine-tune: ~1000-2000 steps with rare token identifier]
    OR
[LoRA training: lightweight adapter, 500 MB vs 4 GB full model]
    ↓
[Inference: base model + LoRA adapter merged at generation time]
    ↓
Consistent identity/style preservation in generations
</syntaxhighlight>

'''3. Production inference stack'''
* Use xFormers or Flash Attention for memory-efficient attention
* Enable torch.compile() for 20-40% throughput improvement
* Batch requests across users to maximize GPU utilization
* Use model distillation (SDXL-Turbo, LCM-LoRA) for real-time applications
* Serve multiple LoRA adapters on one base model via dynamic loading

'''4. Safety layer'''
* Safety checker on outputs (NSFW classifier)
* Prompt filtering for prohibited content
* Watermarking generated images for provenance tracking

[[Category:Artificial Intelligence]]
[[Category:Generative AI]]
[[Category:Deep Learning]]
</div>