Editing
Diffusion Models
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Diffusion models are a class of generative AI models that have achieved state-of-the-art results in image, audio, video, and molecular generation. Systems like DALL-E 3, Stable Diffusion, Midjourney, Sora, and AudioLDM are all built on diffusion model foundations. The core idea is elegant: learn to reverse a process of gradually adding noise to data, training the model to denoise step by step until a coherent sample emerges from pure random noise. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Generative model''' β A model that learns the underlying distribution of training data and can generate new samples from that distribution. * '''Forward process (diffusion)''' β The process of gradually adding Gaussian noise to data over T time steps until it becomes pure noise. * '''Reverse process (denoising)''' β The learned process of iteratively removing noise from a noisy sample to recover a clean data point. * '''Noise schedule''' β The function controlling how much noise is added at each time step. Common schedules: linear, cosine, sigmoid. * '''U-Net''' β The neural network architecture originally used as the denoising backbone in diffusion models; it processes images at multiple scales via encoder-decoder with skip connections. * '''Score function''' β The gradient of the log probability density, which points in the direction of higher data density; diffusion models implicitly learn to estimate this. * '''DDPM (Denoising Diffusion Probabilistic Models)''' β The foundational 2020 paper that established the modern diffusion model framework (Ho et al.). * '''DDIM (Denoising Diffusion Implicit Models)''' β A faster sampling method that achieves similar quality in far fewer steps (50 instead of 1000) by using a deterministic sampling formula. * '''Latent diffusion''' β Performing the diffusion process in a compressed latent space (using a VAE encoder/decoder) rather than pixel space. This is how Stable Diffusion works. * '''VAE (Variational Autoencoder)''' β The compression model used in latent diffusion to encode images into a compact latent representation. * '''Classifier-Free Guidance (CFG)''' β A technique to improve sample quality and text-image alignment by interpolating between conditional and unconditional model predictions. * '''Guidance scale''' β A hyperparameter controlling the strength of CFG; higher values produce samples more aligned with the conditioning signal but less diverse. * '''Text-to-image''' β Generating images conditioned on natural language prompts. * '''ControlNet''' β An architecture that adds spatial conditioning (e.g., edge maps, depth maps, pose skeletons) to pre-trained diffusion models without retraining. * '''Inpainting''' β Using a diffusion model to fill in a masked region of an image coherently with its surroundings. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == Diffusion models learn by training on data that has been corrupted with noise. At each training step, the model is shown a noisy version of a real image (with a known noise level t) and must predict the original noise that was added. Over millions of training examples, the model learns: "given a partially noisy image at noise level t, here's how to denoise it." '''The forward process''' is fixed and mathematically defined: x''t = β(αΎ±''t) Β· x''0 + β(1-αΎ±''t) Β· Ξ΅, where Ξ΅ ~ N(0,I) This means any noisy version of an image can be computed in one step directly from the original. '''The reverse process''' is what the model learns: p''ΞΈ(x''{t-1} | x_t) β given a noisy image at step t, predict the slightly less noisy image at step t-1. '''Why not just use a GAN?''' GANs (Generative Adversarial Networks) were the previous state-of-the-art for image generation. They train a generator and discriminator in adversarial competition. Diffusion models have several advantages: more stable training (no mode collapse), better coverage of the data distribution (more diverse samples), and more principled theoretical grounding. The trade-off is slower sampling. '''Latent diffusion''' solves the speed problem: instead of working in pixel space (512Γ512Γ3 = 786,432 dimensions), the VAE encodes images into a much smaller latent space (64Γ64Γ4 = 16,384 dimensions). The diffusion process runs in this compressed space β 50Γ fewer dimensions β making training and inference dramatically faster. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''Generating images with Stable Diffusion using diffusers:''' <syntaxhighlight lang="python"> from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler import torch # Load Stable Diffusion XL pipe = StableDiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16" ) # Use fast DPM-Solver++ scheduler (25 steps instead of 50+) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe = pipe.to("cuda") # Generate image image = pipe( prompt="A futuristic library with glowing holographic books, cinematic lighting, 8K", negative_prompt="blurry, low quality, distorted, watermark", num_inference_steps=25, guidance_scale=7.5, width=1024, height=1024, generator=torch.Generator("cuda").manual_seed(42) ).images[0] image.save("output.png") </syntaxhighlight> ; Key diffusion model parameters : '''num''inference''steps''' β More steps = higher quality but slower. 20-50 is typical; use DDIM or DPM-Solver for efficiency. : '''guidance_scale''' β 7β12 for good prompt adherence. Too high β oversaturation and artifacts. Too low β prompt ignored. : '''negative_prompt''' β Tells the model what to avoid. Useful for removing common quality issues. : '''seed''' β Set for reproducibility; vary for diversity. </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ Generative Model Comparison ! Model Type !! Training Stability !! Sample Quality !! Sample Diversity !! Speed |- | GAN || Poor (mode collapse) || Very high (when tuned) || Low (mode collapse) || Fast (single forward pass) |- | VAE || Stable || Moderate (blurry) || Good || Fast |- | Diffusion (pixel) || Stable || State-of-the-art || Excellent || Slow (100-1000 steps) |- | Diffusion (latent) || Stable || State-of-the-art || Excellent || Moderate (20-50 steps in latent) |- | Flow Matching || Stable || State-of-the-art || Excellent || Fast (8-20 steps) |} '''Failure modes and limitations:''' * '''Text rendering''' β Diffusion models notoriously struggle to generate coherent text within images. The denoising process doesn't understand character shapes. * '''Consistent identities''' β Generating the same face or character across multiple images is difficult without techniques like DreamBooth or IP-Adapter. * '''Prompt sensitivity''' β Small prompt changes can cause large output variation. "A cat sitting on a red chair" vs "A red chair with a cat" may produce very different images. * '''NSFW and copyright concerns''' β Models trained on internet data have absorbed copyrighted images and potentially harmful content. Content filtering and fine-tuning on curated data are important mitigations. * '''Computational cost''' β Running a 50-step reverse diffusion process is expensive. Inference optimization (distillation, SDXL-Turbo, LCM) reduces steps to 1β4. </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == Expert evaluation of diffusion models uses both automatic metrics and human studies: '''FID (FrΓ©chet Inception Distance)''': The standard metric for image generation quality and diversity. Lower is better. Computes the distance between the distribution of generated and real images in a feature space. FID captures both quality (are samples sharp and realistic?) and diversity (do they cover the real data distribution?). '''CLIP score''': Measures text-image alignment using CLIP's joint embedding space. Higher means the generated image better matches the text prompt. '''Human evaluation''': For production systems, human raters assess photorealism, prompt adherence, aesthetic quality, and safety. Tools like RLHF for diffusion models (using human feedback to steer the model toward preferred outputs) are emerging. '''Perceptual studies''': Present raters with real vs generated images and measure discrimination accuracy. Models that fool humans reliably are considered high-quality. Expert practitioners also evaluate '''controllability''' β does the model reliably respond to specific attributes in the prompt? Does it understand spatial relationships ("a cat to the left of a dog")? Composition benchmarks like T2I-CompBench evaluate these capabilities. </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Designing a diffusion-based image generation system: '''1. Choose the right base model''' <syntaxhighlight lang="text"> Use case assessment: βββ General creative images β SDXL 1.0 or Flux.1 βββ Photorealistic portraits β Realistic Vision, Deliberate βββ Anime/illustration style β Anything-v5, DreamShaper βββ Product/e-commerce β Fine-tune on product images βββ Video generation β AnimateDiff, Sora, Stable Video Diffusion </syntaxhighlight> '''2. Customization pipeline (DreamBooth/LoRA)''' <syntaxhighlight lang="text"> 3-30 reference images (subject or style) β [DreamBooth fine-tune: ~1000-2000 steps with rare token identifier] OR [LoRA training: lightweight adapter, 500 MB vs 4 GB full model] β [Inference: base model + LoRA adapter merged at generation time] β Consistent identity/style preservation in generations </syntaxhighlight> '''3. Production inference stack''' * Use xFormers or Flash Attention for memory-efficient attention * Enable torch.compile() for 20-40% throughput improvement * Batch requests across users to maximize GPU utilization * Use model distillation (SDXL-Turbo, LCM-LoRA) for real-time applications * Serve multiple LoRA adapters on one base model via dynamic loading '''4. Safety layer''' * Safety checker on outputs (NSFW classifier) * Prompt filtering for prohibited content * Watermarking generated images for provenance tracking [[Category:Artificial Intelligence]] [[Category:Generative AI]] [[Category:Deep Learning]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information