Generative Adversarial Networks

From BloomWiki
Revision as of 01:51, 25 April 2026 by Wordpad (talk | contribs) (BloomWiki: Generative Adversarial Networks)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Generative Adversarial Networks (GANs) are a class of deep learning models in which two neural networks — a generator and a discriminator — are trained simultaneously in a competitive game. The generator learns to produce realistic synthetic data (images, audio, video, text), while the discriminator learns to distinguish real data from generated fakes. This adversarial dynamic drives both networks to improve: the generator becomes better at fooling the discriminator, and the discriminator becomes better at detecting fakes. Introduced by Ian Goodfellow in 2014, GANs powered the first wave of AI image generation and remain foundational to understanding generative models.

Remembering[edit]

  • Generator (G) — A neural network that takes a random noise vector as input and produces synthetic data (images, audio, etc.) designed to fool the discriminator.
  • Discriminator (D) — A neural network that takes a data sample (real or generated) as input and outputs a probability that it is real (not fake).
  • Latent space — The space of random noise vectors (z) that the generator maps to data space. Interpolating in latent space produces smooth transitions between generated samples.
  • Adversarial training — The min-max game between G and D: G minimizes and D maximizes the same loss function simultaneously.
  • Nash equilibrium — The theoretical ideal endpoint of GAN training, where G generates samples indistinguishable from real data and D outputs 0.5 for all inputs.
  • Mode collapse — A common GAN failure where the generator learns to produce only a small variety of outputs, ignoring most of the real data distribution.
  • Training instability — GANs are notoriously difficult to train; the generator and discriminator can fail to converge or collapse.
  • DCGAN (Deep Convolutional GAN) — An early influential GAN using convolutional layers; established architectural best practices.
  • Conditional GAN (cGAN) — A GAN conditioned on additional information (class label, image, text) to control what is generated.
  • StyleGAN — A high-quality face generation GAN (NVIDIA) known for its disentangled latent space and photorealistic outputs.
  • CycleGAN — A GAN for unpaired image-to-image translation (e.g., photos ↔ paintings) without paired training examples.
  • Pix2Pix — A conditional GAN for paired image-to-image translation (e.g., sketches → photos, day → night).
  • Wasserstein GAN (WGAN) — A GAN variant using Wasserstein distance as the loss, dramatically improving training stability.
  • FID (Fréchet Inception Distance) — The standard metric for GAN image quality; measures the distance between real and generated image distributions.
  • Progressive growing — A training technique where the resolution of generated images increases gradually during training.

Understanding[edit]

The GAN training objective is a minimax game:

min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]

  • D wants to maximize this: output high probabilities for real data x and low for G(z)
  • G wants to minimize this: produce G(z) that D assigns high probability to

Think of it as a forger (G) and an art expert (D). The forger gets better at creating convincing fakes; the expert gets better at detecting them. Both improve through competition. In theory, the game converges when the forger is so good that the expert can't tell real from fake — the Nash equilibrium.

Why is training hard? The minimax game is not convex — there's no guarantee of convergence. Several failure modes are common:

  • If D is too strong early, G receives near-zero gradients and cannot learn (vanishing gradient)
  • If G is stronger, D cannot discriminate and provides no useful training signal
  • Mode collapse: G finds one or a few "safe" outputs that always fool D and gets stuck

Wasserstein distance addresses vanishing gradients. Instead of a probability (0–1), WGAN trains D (called the "critic") to output a real number representing how real the sample is, using the Wasserstein-1 distance as the objective. This provides a smooth, meaningful gradient even when the distributions are far apart — fixing the vanishing gradient problem.

Conditional generation lets you control what the GAN produces. By feeding both G and D a conditioning signal (e.g., a class label "cat" or a source image), the generator learns to produce outputs matching that condition, enabling text-to-image, image-to-image, and class-conditional generation.

Applying[edit]

Building a simple DCGAN for MNIST digit generation:

<syntaxhighlight lang="python"> import torch import torch.nn as nn

  1. Generator: noise z → fake image

class Generator(nn.Module):

   def __init__(self, latent_dim=100):
       super().__init__()
       self.net = nn.Sequential(
           nn.Linear(latent_dim, 256),
           nn.LeakyReLU(0.2),
           nn.BatchNorm1d(256),
           nn.Linear(256, 512),
           nn.LeakyReLU(0.2),
           nn.BatchNorm1d(512),
           nn.Linear(512, 784),   # 28×28 image
           nn.Tanh()              # Output range [-1, 1]
       )
   def forward(self, z):
       return self.net(z).view(-1, 1, 28, 28)
  1. Discriminator: image → real/fake probability

class Discriminator(nn.Module):

   def __init__(self):
       super().__init__()
       self.net = nn.Sequential(
           nn.Flatten(),
           nn.Linear(784, 512),
           nn.LeakyReLU(0.2),
           nn.Dropout(0.3),
           nn.Linear(512, 256),
           nn.LeakyReLU(0.2),
           nn.Linear(256, 1),
           nn.Sigmoid()
       )
   def forward(self, x):
       return self.net(x)

G = Generator(); D = Discriminator() opt_G = torch.optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999)) opt_D = torch.optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999)) criterion = nn.BCELoss()

def train_step(real_images, latent_dim=100):

   batch_size = real_images.size(0)
   real_labels = torch.ones(batch_size, 1)
   fake_labels = torch.zeros(batch_size, 1)
   # Train Discriminator
   z = torch.randn(batch_size, latent_dim)
   fake_images = G(z).detach()        # detach: don't backprop into G yet
   loss_D = criterion(D(real_images), real_labels) + \
            criterion(D(fake_images), fake_labels)
   opt_D.zero_grad(); loss_D.backward(); opt_D.step()
   # Train Generator
   z = torch.randn(batch_size, latent_dim)
   loss_G = criterion(D(G(z)), real_labels)  # G wants D to say "real"
   opt_G.zero_grad(); loss_G.backward(); opt_G.step()
   return loss_D.item(), loss_G.item()

</syntaxhighlight>

GAN application landscape
Face generation → StyleGAN3 (NVIDIA) — photorealistic faces at 1024px
Image-to-image → Pix2Pix (paired), CycleGAN (unpaired)
Super resolution → SRGAN, ESRGAN — upscale low-res images
Video synthesis → Vid2Vid, StyleGAN-V
Data augmentation → Generate synthetic training data for rare classes
Medical imaging → Synthesize rare pathology images for training classifiers

Analyzing[edit]

GAN vs. Diffusion Models vs. VAEs
Property GAN Diffusion Model VAE
Sample quality Very high (when stable) State-of-the-art Moderate (blurry)
Training stability Poor (adversarial) Stable Stable
Mode coverage Poor (mode collapse) Excellent Good
Sampling speed Very fast (single forward pass) Slow (many steps) Fast
Latent space quality Good (disentangled in StyleGAN) Implicit Explicit, smooth
Controllability Moderate (cGAN) High (guidance scale) Moderate

Failure modes:

  • Mode collapse — G always generates the same output (e.g., always "8" for digit generation). Fix: mini-batch discrimination, Wasserstein loss, spectral normalization.
  • Training oscillation — Loss curves oscillate wildly; G and D never converge. Fix: reduce learning rates, increase batch size, gradient penalty (WGAN-GP).
  • Checkerboard artifacts — Upsampling with transposed convolutions creates grid-pattern artifacts. Fix: bilinear upsampling followed by standard convolution.
  • Discriminator overfitting — D memorizes training data rather than learning a general real/fake boundary. Fix: discriminator dropout, data augmentation on real samples.
  • Evaluation metric gaming — Optimizing FID specifically (rather than actual quality) can produce high FID-scoring but visually poor images. Use multiple metrics.

Evaluating[edit]

Expert GAN evaluation is multi-faceted:

FID (Fréchet Inception Distance): Computes the Fréchet distance between the distribution of Inception v3 features for 50k real and 50k generated samples. Lower is better. FID captures both quality (sharpness, realism) and diversity (mode coverage). It is the standard metric but has known limitations: it is sensitive to the number of samples and the pre-trained feature extractor.

Precision and Recall for generative models: Kynkäänniemi et al. (2019) proposed separate precision (sample quality) and recall (mode coverage) metrics. A GAN with high precision but low recall has mode collapse. This is more informative than FID alone.

IS (Inception Score): Measures both quality (samples should be classifiable) and diversity (class distribution should be uniform). Less reliable than FID because it doesn't compare to real data.

Perceptual user studies: Human raters are shown real and generated images; measure discrimination accuracy (lower = more realistic). This remains the gold standard for applications where human perception is the target.

Expert practitioners also perform interpolation tests: sample two latent vectors z1 and z2, interpolate between them, and verify that the generated images transition smoothly and meaningfully — indicating a well-structured latent space rather than memorization.

Creating[edit]

Designing a GAN-based image synthesis system:

1. Architecture selection by task <syntaxhighlight lang="text"> Task classification: ├── Unconditional image synthesis → StyleGAN3 ├── Class-conditional generation → BigGAN, StyleGAN-XL ├── Text-to-image (GAN-based) → GigaGAN ├── Paired image translation → Pix2Pix (paired data) ├── Unpaired image translation → CycleGAN (no paired data) ├── Super resolution → ESRGAN └── Video generation → MoCoGAN, DIGAN </syntaxhighlight>

2. Training stabilization recipe (for custom GAN)

  • Use WGAN-GP or StyleGAN's R1 gradient penalty loss
  • Spectral normalization on discriminator weights
  • LeakyReLU (0.2) in discriminator; ReLU in generator
  • Adam optimizer with β1=0.5, β2=0.999; learning rate 1e-4 to 2e-4
  • Exponential moving average (EMA) of generator weights for smoother evaluation
  • Progressive growing or batch size ramp-up for high-resolution targets

3. Data preparation <syntaxhighlight lang="text"> Collect dataset (minimum 10k images; 100k+ for high quality)

Crop + align (for face generation: align to landmarks)

Resize to target resolution (power of 2: 64, 128, 256, 512, 1024)

Normalize to [-1, 1] (matches Tanh output activation)

[Optional] ADA (Adaptive Discriminator Augmentation) for small datasets </syntaxhighlight>

4. Monitoring training health

  • Plot G loss and D loss separately; they should remain in rough balance
  • Sample fixed noise vectors (z_fixed) each epoch → visualize how G evolves
  • Compute FID every 5k–10k iterations on 10k samples
  • Alert if G loss spikes dramatically (mode collapse indicator)