<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://bloomwiki.org/index.php?action=history&amp;feed=atom&amp;title=Synthetic_Data_Generation</id>
	<title>Synthetic Data Generation - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://bloomwiki.org/index.php?action=history&amp;feed=atom&amp;title=Synthetic_Data_Generation"/>
	<link rel="alternate" type="text/html" href="http://bloomwiki.org/index.php?title=Synthetic_Data_Generation&amp;action=history"/>
	<updated>2026-05-06T14:17:10Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.0</generator>
	<entry>
		<id>http://bloomwiki.org/index.php?title=Synthetic_Data_Generation&amp;diff=4861&amp;oldid=prev</id>
		<title>Wordpad: BloomWiki: Synthetic Data Generation</title>
		<link rel="alternate" type="text/html" href="http://bloomwiki.org/index.php?title=Synthetic_Data_Generation&amp;diff=4861&amp;oldid=prev"/>
		<updated>2026-04-25T01:58:51Z</updated>

		<summary type="html">&lt;p&gt;BloomWiki: Synthetic Data Generation&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 01:58, 25 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;div style=&quot;background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;&quot;&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{{BloomIntro}}&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{{BloomIntro}}&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Synthetic data generation is the process of programmatically creating artificial data that mimics the statistical properties of real-world data, without containing actual sensitive records. As AI systems require ever-larger training datasets while privacy regulations restrict data sharing, synthetic data has become a critical tool. It enables training on data that doesn&amp;#039;t exist yet (rare events, edge cases), augmenting scarce real data, testing AI systems safely, and sharing datasets across organizations without exposing private information.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Synthetic data generation is the process of programmatically creating artificial data that mimics the statistical properties of real-world data, without containing actual sensitive records. As AI systems require ever-larger training datasets while privacy regulations restrict data sharing, synthetic data has become a critical tool. It enables training on data that doesn&amp;#039;t exist yet (rare events, edge cases), augmenting scarce real data, testing AI systems safely, and sharing datasets across organizations without exposing private information.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Remembering ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;__TOC__&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt; &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;div style&lt;/ins&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&quot;background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;&quot;&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;= &amp;lt;span style=&quot;color: #FFFFFF;&quot;&amp;gt;&lt;/ins&gt;Remembering&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/span&amp;gt; &lt;/ins&gt;==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Synthetic data&amp;#039;&amp;#039;&amp;#039; — Artificially generated data that preserves statistical properties of real data without containing actual records.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Synthetic data&amp;#039;&amp;#039;&amp;#039; — Artificially generated data that preserves statistical properties of real data without containing actual records.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Data augmentation&amp;#039;&amp;#039;&amp;#039; — Creating modified versions of existing samples (rotations, flips, noise) to expand training data; a simple form of synthetic data.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Data augmentation&amp;#039;&amp;#039;&amp;#039; — Creating modified versions of existing samples (rotations, flips, noise) to expand training data; a simple form of synthetic data.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l17&quot;&gt;Line 17:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 22:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Train-on-synthetic, test-on-real (TSTR)&amp;#039;&amp;#039;&amp;#039; — The standard utility evaluation: train a model on synthetic data, test on real data.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Train-on-synthetic, test-on-real (TSTR)&amp;#039;&amp;#039;&amp;#039; — The standard utility evaluation: train a model on synthetic data, test on real data.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Simulation-based synthesis&amp;#039;&amp;#039;&amp;#039; — Generating synthetic data from physics or domain simulations rather than generative models.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Simulation-based synthesis&amp;#039;&amp;#039;&amp;#039; — Generating synthetic data from physics or domain simulations rather than generative models.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Understanding ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;div style&lt;/ins&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&quot;background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;&quot;&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;= &amp;lt;span style=&quot;color: #FFFFFF;&quot;&amp;gt;&lt;/ins&gt;Understanding&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/span&amp;gt; &lt;/ins&gt;==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;**Why synthetic data?** Four main use cases:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;**Why synthetic data?** Four main use cases:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l32&quot;&gt;Line 32:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 39:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;**Evaluation gap**: A common failure mode — synthetic data looks statistically similar but fails as training data. Low-order statistics (means, correlations) may match while high-order structure (rare combinations, causal relationships) does not. Always evaluate with TSTR: does a model trained on synthetic data achieve comparable test performance on real data?&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;**Evaluation gap**: A common failure mode — synthetic data looks statistically similar but fails as training data. Low-order statistics (means, correlations) may match while high-order structure (rare combinations, causal relationships) does not. Always evaluate with TSTR: does a model trained on synthetic data achieve comparable test performance on real data?&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Applying ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;div style&lt;/ins&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&quot;background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;&quot;&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;= &amp;lt;span style=&quot;color: #FFFFFF;&quot;&amp;gt;&lt;/ins&gt;Applying&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/span&amp;gt; &lt;/ins&gt;==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Tabular synthetic data generation with SDV:&amp;#039;&amp;#039;&amp;#039;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Tabular synthetic data generation with SDV:&amp;#039;&amp;#039;&amp;#039;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l87&quot;&gt;Line 87:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 96:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;: &amp;#039;&amp;#039;&amp;#039;Text&amp;#039;&amp;#039;&amp;#039; → LLM generation with domain prompts, backtranslation&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;: &amp;#039;&amp;#039;&amp;#039;Text&amp;#039;&amp;#039;&amp;#039; → LLM generation with domain prompts, backtranslation&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;: &amp;#039;&amp;#039;&amp;#039;Simulation&amp;#039;&amp;#039;&amp;#039; → CARLA (autonomous driving), NVIDIA Isaac (robotics), Unity ML-Agents&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;: &amp;#039;&amp;#039;&amp;#039;Simulation&amp;#039;&amp;#039;&amp;#039; → CARLA (autonomous driving), NVIDIA Isaac (robotics), Unity ML-Agents&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Analyzing ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;div style&lt;/ins&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&quot;background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;&quot;&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;= &amp;lt;span style=&quot;color: #FFFFFF;&quot;&amp;gt;&lt;/ins&gt;Analyzing&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/span&amp;gt; &lt;/ins&gt;==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{| class=&amp;quot;wikitable&amp;quot;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{| class=&amp;quot;wikitable&amp;quot;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|+ Synthetic Data Approach Tradeoffs&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|+ Synthetic Data Approach Tradeoffs&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l105&quot;&gt;Line 105:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 116:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Failure modes&amp;#039;&amp;#039;&amp;#039;: Mode collapse in GAN synthesis — the generator produces limited variety. Privacy leakage — even without direct record memorization, synthetic data can allow membership inference. Distribution mismatch — synthetic data looks similar in aggregate but fails on edge cases the model learned from real data. Overfitting to synthetic data — model learns synthetic-data-specific artifacts.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Failure modes&amp;#039;&amp;#039;&amp;#039;: Mode collapse in GAN synthesis — the generator produces limited variety. Privacy leakage — even without direct record memorization, synthetic data can allow membership inference. Distribution mismatch — synthetic data looks similar in aggregate but fails on edge cases the model learned from real data. Overfitting to synthetic data — model learns synthetic-data-specific artifacts.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Evaluating ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;div style&lt;/ins&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&quot;background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;&quot;&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;= &amp;lt;span style=&quot;color: #FFFFFF;&quot;&amp;gt;&lt;/ins&gt;Evaluating&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/span&amp;gt; &lt;/ins&gt;==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Comprehensive synthetic data evaluation: (1) **Statistical fidelity**: column-level distributions (KS test), pairwise correlations, mutual information. (2) **Utility (TSTR)**: train on synthetic, test on real — compare to train-on-real baseline. Good synthetic data achieves &amp;gt;90% of real-data model performance. (3) **Privacy**: run membership inference attacks; ensure attack success rate ≈ random guess. (4) **Rare event coverage**: verify rare categories and edge cases are proportionally represented. (5) **Domain expert review**: have subject matter experts inspect samples for plausibility.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Comprehensive synthetic data evaluation: (1) **Statistical fidelity**: column-level distributions (KS test), pairwise correlations, mutual information. (2) **Utility (TSTR)**: train on synthetic, test on real — compare to train-on-real baseline. Good synthetic data achieves &amp;gt;90% of real-data model performance. (3) **Privacy**: run membership inference attacks; ensure attack success rate ≈ random guess. (4) **Rare event coverage**: verify rare categories and edge cases are proportionally represented. (5) **Domain expert review**: have subject matter experts inspect samples for plausibility.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Creating ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;div style&lt;/ins&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&quot;background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;&quot;&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;= &amp;lt;span style=&quot;color: #FFFFFF;&quot;&amp;gt;&lt;/ins&gt;Creating&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/span&amp;gt; &lt;/ins&gt;==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Designing a synthetic data pipeline: (1) Characterize real data: distributions, correlations, constraints (valid ranges, foreign keys). (2) Choose method: CTGAN for tabular, diffusion for images, TimeGAN for sequences. (3) Apply constraints: post-generation filtering to enforce domain rules (age ≥ 0, income &amp;gt; 0, valid date ranges). (4) Evaluate fidelity and TSTR on held-out real data split. (5) Privacy audit: run membership inference; if needed, apply DP noise (SDMetrics dpsynth). (6) Data validation: automated schema checks before synthetic data enters any training pipeline.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Designing a synthetic data pipeline: (1) Characterize real data: distributions, correlations, constraints (valid ranges, foreign keys). (2) Choose method: CTGAN for tabular, diffusion for images, TimeGAN for sequences. (3) Apply constraints: post-generation filtering to enforce domain rules (age ≥ 0, income &amp;gt; 0, valid date ranges). (4) Evaluate fidelity and TSTR on held-out real data split. (5) Privacy audit: run membership inference; if needed, apply DP noise (SDMetrics dpsynth). (6) Data validation: automated schema checks before synthetic data enters any training pipeline.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l115&quot;&gt;Line 115:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 130:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Synthetic Data]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Synthetic Data]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff:1.41:old-167:rev-4861:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>Wordpad</name></author>
	</entry>
	<entry>
		<id>http://bloomwiki.org/index.php?title=Synthetic_Data_Generation&amp;diff=167&amp;oldid=prev</id>
		<title>Wordpad: New BloomWiki article: Synthetic Data Generation</title>
		<link rel="alternate" type="text/html" href="http://bloomwiki.org/index.php?title=Synthetic_Data_Generation&amp;diff=167&amp;oldid=prev"/>
		<updated>2026-04-23T08:12:40Z</updated>

		<summary type="html">&lt;p&gt;New BloomWiki article: Synthetic Data Generation&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;{{BloomIntro}}&lt;br /&gt;
Synthetic data generation is the process of programmatically creating artificial data that mimics the statistical properties of real-world data, without containing actual sensitive records. As AI systems require ever-larger training datasets while privacy regulations restrict data sharing, synthetic data has become a critical tool. It enables training on data that doesn&amp;#039;t exist yet (rare events, edge cases), augmenting scarce real data, testing AI systems safely, and sharing datasets across organizations without exposing private information.&lt;br /&gt;
&lt;br /&gt;
== Remembering ==&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Synthetic data&amp;#039;&amp;#039;&amp;#039; — Artificially generated data that preserves statistical properties of real data without containing actual records.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Data augmentation&amp;#039;&amp;#039;&amp;#039; — Creating modified versions of existing samples (rotations, flips, noise) to expand training data; a simple form of synthetic data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Generative model&amp;#039;&amp;#039;&amp;#039; — A model that learns the distribution of training data and can sample new examples from it.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;GAN (for synthesis)&amp;#039;&amp;#039;&amp;#039; — Using Generative Adversarial Networks to create realistic synthetic tabular, image, or text data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Variational Autoencoder (VAE)&amp;#039;&amp;#039;&amp;#039; — A generative model that encodes data to a latent distribution and samples new data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Differential privacy (DP)&amp;#039;&amp;#039;&amp;#039; — A mathematical guarantee that synthetic data cannot be used to identify individuals in the original dataset.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Fidelity&amp;#039;&amp;#039;&amp;#039; — How statistically similar the synthetic data is to the real data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Utility&amp;#039;&amp;#039;&amp;#039; — How useful the synthetic data is for training models (does a model trained on synthetic data work on real data?).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Privacy (for synthetic data)&amp;#039;&amp;#039;&amp;#039; — The degree to which the synthetic data protects the privacy of individuals in the original dataset.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Membership inference attack&amp;#039;&amp;#039;&amp;#039; — A privacy attack testing whether a specific record was in the training data; used to evaluate synthetic data privacy.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;CTGAN&amp;#039;&amp;#039;&amp;#039; — Conditional Tabular GAN; one of the most widely used methods for synthetic tabular data generation.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;SDV (Synthetic Data Vault)&amp;#039;&amp;#039;&amp;#039; — An open-source library for synthetic tabular data generation using multiple methods.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Train-on-synthetic, test-on-real (TSTR)&amp;#039;&amp;#039;&amp;#039; — The standard utility evaluation: train a model on synthetic data, test on real data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Simulation-based synthesis&amp;#039;&amp;#039;&amp;#039; — Generating synthetic data from physics or domain simulations rather than generative models.&lt;br /&gt;
&lt;br /&gt;
== Understanding ==&lt;br /&gt;
**Why synthetic data?** Four main use cases:&lt;br /&gt;
&lt;br /&gt;
**Privacy**: Healthcare, finance, and legal data are highly sensitive. Synthetic data preserves statistical patterns without containing real patient or customer records, enabling sharing, collaboration, and ML development without regulatory risk.&lt;br /&gt;
&lt;br /&gt;
**Data scarcity**: Some events are rare — industrial faults, rare diseases, fraud patterns, crash scenarios. Real datasets may contain only dozens of examples. Synthetic data can generate thousands of realistic rare-event examples for training.&lt;br /&gt;
&lt;br /&gt;
**Data augmentation**: Standard image training uses random crops, flips, and color jitter. Modern approaches use diffusion models to generate entirely new training images, dramatically expanding effective dataset size.&lt;br /&gt;
&lt;br /&gt;
**Simulation**: Autonomous vehicle companies generate billions of synthetic driving scenarios from physics simulators (CARLA, AirSim) to train perception and planning models — impossible to collect all scenarios in the real world.&lt;br /&gt;
&lt;br /&gt;
**The fidelity-privacy-utility triangle**: You cannot simultaneously maximize all three. High-fidelity synthetic data closely resembles the original — but may expose private information. Applying differential privacy (DP) to synthesis guarantees privacy but reduces fidelity and utility. Finding the optimal operating point for a specific use case is the key challenge.&lt;br /&gt;
&lt;br /&gt;
**Evaluation gap**: A common failure mode — synthetic data looks statistically similar but fails as training data. Low-order statistics (means, correlations) may match while high-order structure (rare combinations, causal relationships) does not. Always evaluate with TSTR: does a model trained on synthetic data achieve comparable test performance on real data?&lt;br /&gt;
&lt;br /&gt;
== Applying ==&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Tabular synthetic data generation with SDV:&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;python&amp;quot;&amp;gt;&lt;br /&gt;
import pandas as pd&lt;br /&gt;
from sdv.tabular import CTGAN&lt;br /&gt;
from sdv.evaluation import evaluate&lt;br /&gt;
from sdv.metrics.tabular import KSComplement, TVComplement&lt;br /&gt;
&lt;br /&gt;
# Real data (e.g., customer transactions)&lt;br /&gt;
real_data = pd.read_csv(&amp;quot;customer_data.csv&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
# Fit CTGAN generative model&lt;br /&gt;
model = CTGAN(&lt;br /&gt;
    epochs=300,&lt;br /&gt;
    batch_size=500,&lt;br /&gt;
    generator_dim=(256, 256),&lt;br /&gt;
    discriminator_dim=(256, 256),&lt;br /&gt;
)&lt;br /&gt;
model.fit(real_data)&lt;br /&gt;
&lt;br /&gt;
# Generate synthetic dataset&lt;br /&gt;
synthetic_data = model.sample(num_rows=10000)&lt;br /&gt;
print(synthetic_data.head())&lt;br /&gt;
&lt;br /&gt;
# Evaluate fidelity (statistical similarity)&lt;br /&gt;
results = evaluate(&lt;br /&gt;
    synthetic_data, real_data,&lt;br /&gt;
    metrics=[KSComplement, TVComplement],&lt;br /&gt;
    aggregate=False&lt;br /&gt;
)&lt;br /&gt;
print(results)&lt;br /&gt;
&lt;br /&gt;
# TSTR evaluation: train classifier on synthetic, test on real&lt;br /&gt;
from sklearn.ensemble import GradientBoostingClassifier&lt;br /&gt;
from sklearn.metrics import accuracy_score&lt;br /&gt;
&lt;br /&gt;
target = &amp;#039;churn&amp;#039;&lt;br /&gt;
X_syn = synthetic_data.drop(target, axis=1)&lt;br /&gt;
y_syn = synthetic_data[target]&lt;br /&gt;
X_real_test = real_data.drop(target, axis=1)&lt;br /&gt;
y_real_test = real_data[target]&lt;br /&gt;
&lt;br /&gt;
clf = GradientBoostingClassifier().fit(X_syn, y_syn)&lt;br /&gt;
score = accuracy_score(y_real_test, clf.predict(X_real_test))&lt;br /&gt;
print(f&amp;quot;TSTR accuracy: {score:.3f}&amp;quot;)  # Compare against train-on-real baseline&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
; Synthetic data method by data type&lt;br /&gt;
: &amp;#039;&amp;#039;&amp;#039;Tabular&amp;#039;&amp;#039;&amp;#039; → CTGAN, GaussianCopula, TVAE (SDV library)&lt;br /&gt;
: &amp;#039;&amp;#039;&amp;#039;Images (augmentation)&amp;#039;&amp;#039;&amp;#039; → Albumentations, torchvision.transforms&lt;br /&gt;
: &amp;#039;&amp;#039;&amp;#039;Images (generative)&amp;#039;&amp;#039;&amp;#039; → Stable Diffusion inpainting, DreamBooth for novel objects&lt;br /&gt;
: &amp;#039;&amp;#039;&amp;#039;Time series&amp;#039;&amp;#039;&amp;#039; → TimeGAN, PAR (SDV), Temporal Fusion Transformer synthesis&lt;br /&gt;
: &amp;#039;&amp;#039;&amp;#039;Text&amp;#039;&amp;#039;&amp;#039; → LLM generation with domain prompts, backtranslation&lt;br /&gt;
: &amp;#039;&amp;#039;&amp;#039;Simulation&amp;#039;&amp;#039;&amp;#039; → CARLA (autonomous driving), NVIDIA Isaac (robotics), Unity ML-Agents&lt;br /&gt;
&lt;br /&gt;
== Analyzing ==&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+ Synthetic Data Approach Tradeoffs&lt;br /&gt;
! Method !! Fidelity !! Privacy !! Utility || Complexity&lt;br /&gt;
|-&lt;br /&gt;
| Rule-based simulation || Medium || High || High (domain-specific) || High&lt;br /&gt;
|-&lt;br /&gt;
| Statistical sampling || Low-medium || High || Low-medium || Low&lt;br /&gt;
|-&lt;br /&gt;
| CTGAN/TVAE || High || Medium || High || Medium&lt;br /&gt;
|-&lt;br /&gt;
| Diffusion model || Very high || Low || Very high || High&lt;br /&gt;
|-&lt;br /&gt;
| DP-GAN || Medium || Guaranteed (ε,δ) || Lower || High&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Failure modes&amp;#039;&amp;#039;&amp;#039;: Mode collapse in GAN synthesis — the generator produces limited variety. Privacy leakage — even without direct record memorization, synthetic data can allow membership inference. Distribution mismatch — synthetic data looks similar in aggregate but fails on edge cases the model learned from real data. Overfitting to synthetic data — model learns synthetic-data-specific artifacts.&lt;br /&gt;
&lt;br /&gt;
== Evaluating ==&lt;br /&gt;
Comprehensive synthetic data evaluation: (1) **Statistical fidelity**: column-level distributions (KS test), pairwise correlations, mutual information. (2) **Utility (TSTR)**: train on synthetic, test on real — compare to train-on-real baseline. Good synthetic data achieves &amp;gt;90% of real-data model performance. (3) **Privacy**: run membership inference attacks; ensure attack success rate ≈ random guess. (4) **Rare event coverage**: verify rare categories and edge cases are proportionally represented. (5) **Domain expert review**: have subject matter experts inspect samples for plausibility.&lt;br /&gt;
&lt;br /&gt;
== Creating ==&lt;br /&gt;
Designing a synthetic data pipeline: (1) Characterize real data: distributions, correlations, constraints (valid ranges, foreign keys). (2) Choose method: CTGAN for tabular, diffusion for images, TimeGAN for sequences. (3) Apply constraints: post-generation filtering to enforce domain rules (age ≥ 0, income &amp;gt; 0, valid date ranges). (4) Evaluate fidelity and TSTR on held-out real data split. (5) Privacy audit: run membership inference; if needed, apply DP noise (SDMetrics dpsynth). (6) Data validation: automated schema checks before synthetic data enters any training pipeline.&lt;br /&gt;
&lt;br /&gt;
[[Category:Artificial Intelligence]]&lt;br /&gt;
[[Category:Synthetic Data]]&lt;br /&gt;
[[Category:Machine Learning]]&lt;/div&gt;</summary>
		<author><name>Wordpad</name></author>
	</entry>
</feed>