Editing Gans

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Generative Adversarial Networks (GANs) are a class of deep learning models in which two neural networks — a generator and a discriminator — are trained simultaneously in a competitive game. The generator learns to produce realistic synthetic data (images, audio, video, text), while the discriminator learns to distinguish real data from generated fakes. This adversarial dynamic drives both networks to improve: the generator becomes better at fooling the discriminator, and the discriminator becomes better at detecting fakes. Introduced by Ian Goodfellow in 2014, GANs powered the first wave of AI image generation and remain foundational to understanding generative models.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Generator (G)''' — A neural network that takes a random noise vector as input and produces synthetic data (images, audio, etc.) designed to fool the discriminator.
* '''Discriminator (D)''' — A neural network that takes a data sample (real or generated) as input and outputs a probability that it is real (not fake).
* '''Latent space''' — The space of random noise vectors (z) that the generator maps to data space. Interpolating in latent space produces smooth transitions between generated samples.
* '''Adversarial training''' — The min-max game between G and D: G minimizes and D maximizes the same loss function simultaneously.
* '''Nash equilibrium''' — The theoretical ideal endpoint of GAN training, where G generates samples indistinguishable from real data and D outputs 0.5 for all inputs.
* '''Mode collapse''' — A common GAN failure where the generator learns to produce only a small variety of outputs, ignoring most of the real data distribution.
* '''Training instability''' — GANs are notoriously difficult to train; the generator and discriminator can fail to converge or collapse.
* '''DCGAN (Deep Convolutional GAN)''' — An early influential GAN using convolutional layers; established architectural best practices.
* '''Conditional GAN (cGAN)''' — A GAN conditioned on additional information (class label, image, text) to control what is generated.
* '''StyleGAN''' — A high-quality face generation GAN (NVIDIA) known for its disentangled latent space and photorealistic outputs.
* '''CycleGAN''' — A GAN for unpaired image-to-image translation (e.g., photos ↔ paintings) without paired training examples.
* '''Pix2Pix''' — A conditional GAN for paired image-to-image translation (e.g., sketches → photos, day → night).
* '''Wasserstein GAN (WGAN)''' — A GAN variant using Wasserstein distance as the loss, dramatically improving training stability.
* '''FID (Fréchet Inception Distance)''' — The standard metric for GAN image quality; measures the distance between real and generated image distributions.
* '''Progressive growing''' — A training technique where the resolution of generated images increases gradually during training.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The GAN training objective is a minimax game:

min''G max''D  E[log D(x)] + E[log(1 - D(G(z)))]

* D wants to maximize this: output high probabilities for real data x and low for G(z)
* G wants to minimize this: produce G(z) that D assigns high probability to

Think of it as a forger (G) and an art expert (D). The forger gets better at creating convincing fakes; the expert gets better at detecting them. Both improve through competition. In theory, the game converges when the forger is so good that the expert can't tell real from fake — the Nash equilibrium.

'''Why is training hard?''' The minimax game is not convex — there's no guarantee of convergence. Several failure modes are common:
* If D is too strong early, G receives near-zero gradients and cannot learn (vanishing gradient)
* If G is stronger, D cannot discriminate and provides no useful training signal
* Mode collapse: G finds one or a few "safe" outputs that always fool D and gets stuck

'''Wasserstein distance''' addresses vanishing gradients. Instead of a probability (0–1), WGAN trains D (called the "critic") to output a real number representing how real the sample is, using the Wasserstein-1 distance as the objective. This provides a smooth, meaningful gradient even when the distributions are far apart — fixing the vanishing gradient problem.

'''Conditional generation''' lets you control what the GAN produces. By feeding both G and D a conditioning signal (e.g., a class label "cat" or a source image), the generator learns to produce outputs matching that condition, enabling text-to-image, image-to-image, and class-conditional generation.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Building a simple DCGAN for MNIST digit generation:'''

<syntaxhighlight lang="python">
import torch
import torch.nn as nn

# Generator: noise z → fake image
class Generator(nn.Module):
    def __init__(self, latent_dim=100):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.LeakyReLU(0.2),
            nn.BatchNorm1d(256),
            nn.Linear(256, 512),
            nn.LeakyReLU(0.2),
            nn.BatchNorm1d(512),
            nn.Linear(512, 784),   # 28×28 image
            nn.Tanh()              # Output range [-1, 1]
        )
    def forward(self, z):
        return self.net(z).view(-1, 1, 28, 28)

# Discriminator: image → real/fake probability
class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 512),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )
    def forward(self, x):
        return self.net(x)

G = Generator(); D = Discriminator()
opt_G = torch.optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999))
criterion = nn.BCELoss()

def train_step(real_images, latent_dim=100):
    batch_size = real_images.size(0)
    real_labels = torch.ones(batch_size, 1)
    fake_labels = torch.zeros(batch_size, 1)

    # Train Discriminator
    z = torch.randn(batch_size, latent_dim)
    fake_images = G(z).detach()        # detach: don't backprop into G yet
    loss_D = criterion(D(real_images), real_labels) + \
             criterion(D(fake_images), fake_labels)
    opt_D.zero_grad(); loss_D.backward(); opt_D.step()

    # Train Generator
    z = torch.randn(batch_size, latent_dim)
    loss_G = criterion(D(G(z)), real_labels)  # G wants D to say "real"
    opt_G.zero_grad(); loss_G.backward(); opt_G.step()

    return loss_D.item(), loss_G.item()
</syntaxhighlight>

; GAN application landscape
: '''Face generation''' → StyleGAN3 (NVIDIA) — photorealistic faces at 1024px
: '''Image-to-image''' → Pix2Pix (paired), CycleGAN (unpaired)
: '''Super resolution''' → SRGAN, ESRGAN — upscale low-res images
: '''Video synthesis''' → Vid2Vid, StyleGAN-V
: '''Data augmentation''' → Generate synthetic training data for rare classes
: '''Medical imaging''' → Synthesize rare pathology images for training classifiers
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ GAN vs. Diffusion Models vs. VAEs
! Property !! GAN !! Diffusion Model !! VAE
|-
| Sample quality || Very high (when stable) || State-of-the-art || Moderate (blurry)
|-
| Training stability || Poor (adversarial) || Stable || Stable
|-
| Mode coverage || Poor (mode collapse) || Excellent || Good
|-
| Sampling speed || Very fast (single forward pass) || Slow (many steps) || Fast
|-
| Latent space quality || Good (disentangled in StyleGAN) || Implicit || Explicit, smooth
|-
| Controllability || Moderate (cGAN) || High (guidance scale) || Moderate
|}

'''Failure modes:'''
* '''Mode collapse''' — G always generates the same output (e.g., always "8" for digit generation). Fix: mini-batch discrimination, Wasserstein loss, spectral normalization.
* '''Training oscillation''' — Loss curves oscillate wildly; G and D never converge. Fix: reduce learning rates, increase batch size, gradient penalty (WGAN-GP).
* '''Checkerboard artifacts''' — Upsampling with transposed convolutions creates grid-pattern artifacts. Fix: bilinear upsampling followed by standard convolution.
* '''Discriminator overfitting''' — D memorizes training data rather than learning a general real/fake boundary. Fix: discriminator dropout, data augmentation on real samples.
* '''Evaluation metric gaming''' — Optimizing FID specifically (rather than actual quality) can produce high FID-scoring but visually poor images. Use multiple metrics.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Expert GAN evaluation is multi-faceted:

'''FID (Fréchet Inception Distance)''': Computes the Fréchet distance between the distribution of Inception v3 features for 50k real and 50k generated samples. Lower is better. FID captures both quality (sharpness, realism) and diversity (mode coverage). It is the standard metric but has known limitations: it is sensitive to the number of samples and the pre-trained feature extractor.

'''Precision and Recall for generative models''': Kynkäänniemi et al. (2019) proposed separate precision (sample quality) and recall (mode coverage) metrics. A GAN with high precision but low recall has mode collapse. This is more informative than FID alone.

'''IS (Inception Score)''': Measures both quality (samples should be classifiable) and diversity (class distribution should be uniform). Less reliable than FID because it doesn't compare to real data.

'''Perceptual user studies''': Human raters are shown real and generated images; measure discrimination accuracy (lower = more realistic). This remains the gold standard for applications where human perception is the target.

Expert practitioners also perform '''interpolation tests''': sample two latent vectors z1 and z2, interpolate between them, and verify that the generated images transition smoothly and meaningfully — indicating a well-structured latent space rather than memorization.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a GAN-based image synthesis system:

'''1. Architecture selection by task'''
<syntaxhighlight lang="text">
Task classification:
├── Unconditional image synthesis → StyleGAN3
├── Class-conditional generation → BigGAN, StyleGAN-XL
├── Text-to-image (GAN-based) → GigaGAN
├── Paired image translation → Pix2Pix (paired data)
├── Unpaired image translation → CycleGAN (no paired data)
├── Super resolution → ESRGAN
└── Video generation → MoCoGAN, DIGAN
</syntaxhighlight>

'''2. Training stabilization recipe (for custom GAN)'''
* Use WGAN-GP or StyleGAN's R1 gradient penalty loss
* Spectral normalization on discriminator weights
* LeakyReLU (0.2) in discriminator; ReLU in generator
* Adam optimizer with β1=0.5, β2=0.999; learning rate 1e-4 to 2e-4
* Exponential moving average (EMA) of generator weights for smoother evaluation
* Progressive growing or batch size ramp-up for high-resolution targets

'''3. Data preparation'''
<syntaxhighlight lang="text">
Collect dataset (minimum 10k images; 100k+ for high quality)
    ↓
Crop + align (for face generation: align to landmarks)
    ↓
Resize to target resolution (power of 2: 64, 128, 256, 512, 1024)
    ↓
Normalize to [-1, 1] (matches Tanh output activation)
    ↓
[Optional] ADA (Adaptive Discriminator Augmentation) for small datasets
</syntaxhighlight>

'''4. Monitoring training health'''
* Plot G loss and D loss separately; they should remain in rough balance
* Sample fixed noise vectors (z_fixed) each epoch → visualize how G evolves
* Compute FID every 5k–10k iterations on 10k samples
* Alert if G loss spikes dramatically (mode collapse indicator)

[[Category:Artificial Intelligence]]
[[Category:Deep Learning]]
[[Category:Generative AI]]
</div>