Editing Synthetic Data Generation (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
**Why synthetic data?** Four main use cases:

**Privacy**: Healthcare, finance, and legal data are highly sensitive. Synthetic data preserves statistical patterns without containing real patient or customer records, enabling sharing, collaboration, and ML development without regulatory risk.

**Data scarcity**: Some events are rare — industrial faults, rare diseases, fraud patterns, crash scenarios. Real datasets may contain only dozens of examples. Synthetic data can generate thousands of realistic rare-event examples for training.

**Data augmentation**: Standard image training uses random crops, flips, and color jitter. Modern approaches use diffusion models to generate entirely new training images, dramatically expanding effective dataset size.

**Simulation**: Autonomous vehicle companies generate billions of synthetic driving scenarios from physics simulators (CARLA, AirSim) to train perception and planning models — impossible to collect all scenarios in the real world.

**The fidelity-privacy-utility triangle**: You cannot simultaneously maximize all three. High-fidelity synthetic data closely resembles the original — but may expose private information. Applying differential privacy (DP) to synthesis guarantees privacy but reduces fidelity and utility. Finding the optimal operating point for a specific use case is the key challenge.

**Evaluation gap**: A common failure mode — synthetic data looks statistically similar but fails as training data. Low-order statistics (means, correlations) may match while high-order structure (rare combinations, causal relationships) does not. Always evaluate with TSTR: does a model trained on synthetic data achieve comparable test performance on real data?
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">