Causal Inference: Difference between revisions

From BloomWiki
Jump to navigation Jump to search
BloomWiki: Causal Inference
 
BloomWiki: Causal Inference
Line 1: Line 1:
{{BloomIntro}}
{{BloomIntro}}
Causal Inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. While traditional statistics is often summarized by the phrase "Correlation is not Causation," Causal Inference is the science of determining *when* and *how* we can conclude that one thing actually causes another. This field is essential for policy-making, medicine, and AI, as we need to know not just that two things happen together (e.g., ice cream sales and shark attacks), but if changing one will change the other (e.g., if we ban ice cream, will shark attacks decrease?).
Causal inference in AI is the study and application of methods for reasoning about cause-and-effect relationships, not merely statistical correlations. Traditional machine learning excels at finding patterns — "X correlates with Y" — but causal inference asks a different and deeper question: "Does X cause Y, and if I change X, what will happen to Y?" This distinction is crucial for decision-making, policy evaluation, fairness analysis, and building AI systems that can reason reliably about interventions in the world. Causal inference bridges statistics, computer science, economics, and philosophy.


== Remembering ==
== Remembering ==
* '''Causal Inference''' — The branch of statistics concerned with identifying cause-and-effect relationships.
* '''Causation''' — A relationship where one event (the cause) brings about another event (the effect). Distinct from correlation.
* '''Counterfactual''' — The "What if?" scenario; what would have happened if a different action had been taken.
* '''Correlation''' — A statistical association between two variables that does not imply causation ("correlation is not causation").
* '''Confounder''' — A variable that influences both the cause and the effect, creating a "spurious" correlation (e.g., 'Heat' causes both ice cream sales and shark attacks).
* '''Confounding variable''' — A hidden variable that influences both the apparent cause and effect, creating a spurious association.
* '''Randomized Controlled Trial (RCT)''' — The "Gold Standard" of causal inference, where participants are randomly assigned to groups to eliminate confounders.
* '''Intervention''' — Actively changing the value of a variable (rather than just observing it), denoted do(X=x) in do-calculus.
* '''Observational Study''' — A study where the researcher does not control the assignment of treatment (common in economics and sociology).
* '''Observational data''' — Data collected without intervening; correlations in observational data may not reflect causal relationships.
* '''Selection Bias''' — When the people who choose a treatment are different from those who don't (e.g., people who take vitamins are already more health-conscious).
* '''Randomized Controlled Trial (RCT)''' — The gold standard for establishing causation: randomly assign units to treatment or control, then measure outcomes.
* '''Instrumental Variable (IV)''' — A variable that affects the treatment but has no direct effect on the outcome, used to "isolate" a causal effect in observational data.
* '''Counterfactual''' — A hypothetical: "What would have happened if X had been different?" e.g., "Would this patient have survived if they had received the drug?"
* '''Propensity Score Matching''' — A technique that attempts to estimate the effect of a treatment by accounting for the covariates that predict receiving the treatment.
* '''Potential outcomes framework''' — A formalization of causal inference using Y(1) (outcome if treated) and Y(0) (outcome if not treated) for each unit.
* '''Directed Acyclic Graph (DAG)''' — A visual map of causal relationships (nodes and arrows).
* '''Average Treatment Effect (ATE)''' — The average causal effect of a treatment across a population: E[Y(1) - Y(0)].
* '''Do-calculus''' — A mathematical framework developed by Judea Pearl for intervening in a causal system.
* '''DAG (Directed Acyclic Graph)''' — A graphical model where nodes are variables and directed edges represent causal relationships; no cycles.
* '''Average Treatment Effect (ATE)''' — The average difference in outcomes between the treated and untreated groups.
* '''Backdoor criterion''' — A graphical criterion for identifying which variables to condition on to block spurious correlations (confounding paths) in a causal DAG.
* '''Natural Experiment''' — An empirical study where individuals are exposed to the experimental and control conditions as determined by nature or other factors outside the control of the investigators (e.g., a change in law in one state but not another).
* '''Do-calculus''' — A set of rules (developed by Judea Pearl) for computing the effect of interventions from observational data and a causal DAG.
* '''Instrumental variable''' — A variable that affects the treatment but has no direct effect on the outcome except through the treatment; used to estimate causal effects when confounding is present.
* '''Causal discovery''' — Algorithms for inferring causal structure (the DAG) from observational data.
* '''Selection bias''' — Bias arising when the sample used for analysis is not representative of the population of interest.


== Understanding ==
== Understanding ==
Causal inference is the quest for the **Counterfactual**.
The fundamental problem of causal inference is what statistician Donald Rubin called the '''Fundamental Problem of Causal Inference''': we can never observe both potential outcomes for the same unit at the same time. Either a patient received the drug (Y(1) observed, Y(0) unobserved) or they didn't (Y(0) observed, Y(1) unobserved). We can never know what would have happened to the same person under the alternative treatment — the counterfactual.


**The Fundamental Problem of Causal Inference**: You can never observe the same person in two different states at the same time. You either took the pill or you didn't. We can never *know* for sure what would have happened if you hadn't taken it. Therefore, we have to find clever ways to "simulate" the counterfactual.
Judea Pearl's '''Ladder of Causation''' describes three levels of causal reasoning:
1. '''Association''' (rung 1): "What is?" — Observing and predicting correlations. Standard ML lives here.
2. '''Intervention''' (rung 2): "What if I do X?" — Reasoning about the effect of deliberate actions. Requires a causal model.
3. '''Counterfactual''' (rung 3): "What if I had done X instead?" — Imagining alternate histories. Requires a complete structural causal model.


**Judea Pearl's Ladder of Causation**:
Most ML systems operate only on rung 1. To make reliable decisions and avoid discrimination, AI systems often need rung 2 or 3.
1. **Association** (Seeing): "If I see X, how likely is Y?" (Standard Machine Learning).
2. **Intervention** (Doing): "If I *do* X, what will happen to Y?" (Causal Inference).
3. **Counterfactuals** (Imagining): "If I had done X instead of Z, what *would have* happened?" (The highest form of human/AI reasoning).


**The Back-Door Criterion**: If you want to know if X causes Y, you must "close the back door"—meaning you must control for all the variables that might be causing both X and Y. If you don't, your result will be "biased" by the **Confounder**.
'''Why this matters for AI''':
* '''Spurious correlations''': A model that classifies "pneumonia" as lower risk may have learned that pneumonia patients sent to the ICU have lower final mortality — confusing treatment effect with baseline risk.
* '''Fairness''': Is a model discriminating based on race, or is it using variables that are correlated with race but causally related to the outcome? Causal fairness criteria give precise answers.
* '''Policy decisions''': If we deploy an AI to recommend interventions, we must understand the causal effect of those interventions — not just their correlation with past outcomes.
* '''Robustness''': Models that learn causal relationships rather than spurious correlations generalize better when the environment changes (distribution shift).


== Applying ==
== Applying ==
'''Simulating a Confounder (Spurious Correlation):'''
'''Estimating causal treatment effects with DoWhy:'''
 
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np
import numpy as np


def simulate_spurious_correlation(n_samples):
# Generate synthetic data: drug → recovery, age → both drug and recovery (confounder)
    """
np.random.seed(42)
    Shows how 'Heat' causes both Ice Cream and Shark Attacks.
n = 1000
    Without knowing about 'Heat', we might think Ice Cream
age = np.random.normal(50, 15, n)
    causes Shark Attacks.
drug = (0.3 * age + np.random.normal(0, 10, n) > 30).astype(int) # older → more likely to receive drug
    """
recovery = 0.5 * drug - 0.02 * age + np.random.normal(0, 1, n)   # drug helps, but age hurts
    # The Confounder (The true cause)
    heat = np.random.normal(25, 5, n_samples)
   
    # Effects
    ice_cream = 2 * heat + np.random.normal(0, 2, n_samples)
    sharks = 0.5 * heat + np.random.normal(0, 1, n_samples)
   
    # Correlation between Ice Cream and Sharks
    correlation = np.corrcoef(ice_cream, sharks)[0, 1]
   
    return correlation


print(f"Correlation between Ice Cream and Sharks: {simulate_spurious_correlation(1000):.3f}")
df = pd.DataFrame({'age': age, 'drug': drug, 'recovery': recovery})
# This is a 'Spurious' correlation. Controlling for 'Heat'  
 
# would bring this correlation to near zero.
# Step 1: Define the causal model as a DAG
model = CausalModel(
    data=df,
    treatment="drug",
    outcome="recovery",
    common_causes=["age"]  # age is a confounder
)
 
# Step 2: Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
 
# Step 3: Estimate the causal effect (controlling for age via backdoor adjustment)
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
    control_value=0,
    treatment_value=1,
)
print(f"Estimated ATE: {estimate.value:.3f}")
# Should recover ~0.5 (the true causal effect)
 
# Naive regression ignoring confounding:
naive_corr = df[df.drug==1]['recovery'].mean() - df[df.drug==0]['recovery'].mean()
print(f"Naive (biased) correlation: {naive_corr:.3f}")
# Will be biased because older people receive drug more but recover less
 
# Step 4: Refute the estimate (robustness checks)
refutation = model.refute_estimate(estimate,
                                  method_name="random_common_cause")
print(refutation)  # Good estimate: adding random confounder shouldn't change result
</syntaxhighlight>
</syntaxhighlight>


; Causal Tools in Action
; Causal inference methods by scenario
: '''A/B Testing''' → Using RCTs to see if a specific website change "causes" more sales.
: '''RCT available''' → Compute difference in means (no adjustment needed — randomization handles confounding)
: '''Difference-in-Differences (Diff-in-Diff)''' → Comparing a "Treatment" group (e.g., a state that raised the minimum wage) to a "Control" group (a neighbor state that didn't).
: '''Observational, confounders known''' → Propensity score matching, IPW, doubly-robust estimators
: '''Regression Discontinuity''' → Comparing people just above and just below a cutoff (e.g., students who just barely passed an exam vs. those who just barely failed).
: '''Observational, confounders unknown''' → Instrumental variables (IV), regression discontinuity
: '''Mediation Analysis''' → Exploring the "mechanism"—does X cause Y directly, or does X cause M, which then causes Y?
: '''Time series, sequential treatments''' → G-computation, marginal structural models
: '''Causal structure unknown''' → Causal discovery: PC algorithm, FCI, LiNGAM, NOTEARS
: '''Heterogeneous effects''' → Causal forests, meta-learners (S, T, X-learner)


== Analyzing ==
== Analyzing ==
{| class="wikitable"
{| class="wikitable"
|+ Correlation vs. Causation
|+ Causal Inference Methods Comparison
! Feature !! Correlation !! Causation
! Method !! Assumptions !! When to Use !! Python Library
|-
| Regression adjustment || No unmeasured confounding, correct functional form || Known confounders, sufficient data || statsmodels, DoWhy
|-
| Propensity score matching || No unmeasured confounding || Binary treatment, observational data || DoWhy, CausalML
|-
|-
| Symmetry || Symmetric (If A correlates with B, B correlates with A) || Asymmetric (A causes B, but B doesn't necessarily cause A)
| Instrumental variables || Valid instrument exists || Hidden confounders, instrument available || DoWhy, linearmodels
|-
|-
| Prediction || Good for "What usually happens?" || Good for "What happens if I change things?"
| Difference-in-differences || Parallel trends assumption || Panel data, natural experiment || CausalPy, statsmodels
|-
|-
| Math || Covariance, Pearson's r || Do-calculus, DAGs, Structural Equations
| Causal forest || No unmeasured confounding || Heterogeneous treatment effects || EconML, GRF (R)
|-
|-
| Requirement || Observation || Intervention or clever identification
| Regression discontinuity || Local continuity at threshold || Sharp threshold in treatment assignment || RDD (R), DoWhy
|}
|}


**Collider Bias**: This is a tricky trap. If you control for a variable that is caused by *both* your treatment and your outcome, you can accidentally create a fake correlation where none existed. For example, if you only study "Famous Actors," you might find that "Acting Talent" and "Physical Beauty" are negatively correlated—not because they are in real life, but because you need *one* of them to be famous in the first place.
'''Key pitfalls and failure modes:'''
* '''Conditioning on colliders''' — Incorrectly conditioning on a variable that is a common effect (not cause) of treatment and outcome opens spurious paths rather than blocking them. Using a DAG is essential to identify what to condition on.
* '''Positivity violation''' — If some subgroups never receive (or always receive) treatment, causal effects for those subgroups cannot be estimated from data. Check overlap in propensity score distributions.
* '''Model misspecification''' — Parametric methods (regression adjustment) assume a specific functional form. Use doubly-robust or non-parametric methods (causal forests) to reduce this risk.
* '''Weak instruments''' — IV estimation with a weak instrument (low correlation with treatment) produces large, unreliable estimates. Test for instrument strength (F-statistic > 10 rule of thumb).
* '''Extrapolation beyond support''' — Causal effect estimates are only reliable within the range of the observed data. Be cautious about extrapolating to new populations or intervention levels.


== Evaluating ==
== Evaluating ==
Evaluating a causal claim: (1) **Exogeneity**: Was the treatment really assigned randomly (or "as-if" randomly)? (2) **SUTVA**: Does one person's treatment affect another person's outcome (spillover)? (3) **Internal Validity**: Is the causal effect true for the group studied? (4) **External Validity (Transportability)**: Will this causal effect work in a different city or a different decade?
Causal inference evaluation is uniquely challenging because we can never observe the true counterfactual:
 
'''Simulation studies (synthetic data)''': Generate data from a known causal model where the true ATE is known. Evaluate whether each estimator recovers the true ATE. This is the standard way to compare methods.
 
'''Semi-synthetic benchmarks''': Use real covariate data but simulate outcomes from a known causal model. ACIC (Atlantic Causal Inference Conference) benchmarks provide standardized challenges.
 
'''Sensitivity analysis''': Test how robust the estimate is to violations of key assumptions (e.g., unmeasured confounding). E-values quantify the minimum strength of unmeasured confounding that would overturn the result.
 
'''Refutation tests (DoWhy)''': Specific tests designed to detect estimation problems:
* Adding a random confounder shouldn't change a good estimate
* Replacing treatment with a random variable should produce ATE ≈ 0
* Placebo treatment (observed but causally irrelevant) should produce ATE ≈ 0
 
Expert practitioners present causal estimates with explicit assumption documentation, sensitivity analyses, and refutation test results — not just a point estimate. An ATE that fails refutation tests or is sensitive to unmeasured confounding should be reported with appropriate uncertainty.


== Creating ==
== Creating ==
Future Frontiers: (1) **Causal AI**: Moving beyond "Pattern Recognition" (Large Language Models) to "Causal Reasoning" (systems that can answer 'Why?' and 'What if?'). (2) **Synthetic Controls**: Using AI to create a "perfect" simulated control group for situations where no real control exists. (3) **Causal Discovery**: Algorithms that can look at a dataset and "infer" the DAG (the map of arrows) automatically. (4) **Precision Policy**: Using causal models to predict which specific individual will benefit from a specific intervention (Heterogeneous Treatment Effects).
Designing a causal inference analysis pipeline for business decision-making:
 
'''1. Problem formulation'''
<syntaxhighlight lang="text">
Define: What is the treatment? What is the outcome? What is the population?
    ↓
Draw the causal DAG: variables, causal paths, potential confounders
    ↓
Apply backdoor criterion: which variables block all confounding paths?
    ↓
Check identifiability: can the causal effect be estimated from available data?
    ↓
Assess data availability: are the required conditioning variables measured?
</syntaxhighlight>
 
'''2. Estimation pipeline'''
<syntaxhighlight lang="text">
Data collection: ensure confounders are measured, check positivity
    ↓
Covariate balance check: compare treatment/control distributions
    ↓
Propensity score modeling (if observational): logistic regression or GBM
    ↓
Effect estimation: doubly-robust AIPW or causal forest for heterogeneous effects
    ↓
Sensitivity analysis: E-value, Rosenbaum bounds
    ↓
Refutation tests: DoWhy refute suite
    ↓
Report: ATE ± CI, assumptions, sensitivity, subgroup effects
</syntaxhighlight>
 
'''3. Causal ML in production (uplift modeling)'''
* Estimate heterogeneous treatment effects: which users benefit most from intervention?
* Use X-learner or causal forest to estimate CATE (Conditional ATE) per user
* Target intervention to users with highest predicted CATE
* Monitor actual vs. predicted uplift in A/B tests post-deployment
* Continuously retrain causal model as new experimental data arrives


[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Causal Inference]]
[[Category:Statistics]]
[[Category:Statistics]]
[[Category:Science]]
[[Category:Economics]]

Revision as of 14:19, 23 April 2026

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Causal inference in AI is the study and application of methods for reasoning about cause-and-effect relationships, not merely statistical correlations. Traditional machine learning excels at finding patterns — "X correlates with Y" — but causal inference asks a different and deeper question: "Does X cause Y, and if I change X, what will happen to Y?" This distinction is crucial for decision-making, policy evaluation, fairness analysis, and building AI systems that can reason reliably about interventions in the world. Causal inference bridges statistics, computer science, economics, and philosophy.

Remembering

  • Causation — A relationship where one event (the cause) brings about another event (the effect). Distinct from correlation.
  • Correlation — A statistical association between two variables that does not imply causation ("correlation is not causation").
  • Confounding variable — A hidden variable that influences both the apparent cause and effect, creating a spurious association.
  • Intervention — Actively changing the value of a variable (rather than just observing it), denoted do(X=x) in do-calculus.
  • Observational data — Data collected without intervening; correlations in observational data may not reflect causal relationships.
  • Randomized Controlled Trial (RCT) — The gold standard for establishing causation: randomly assign units to treatment or control, then measure outcomes.
  • Counterfactual — A hypothetical: "What would have happened if X had been different?" e.g., "Would this patient have survived if they had received the drug?"
  • Potential outcomes framework — A formalization of causal inference using Y(1) (outcome if treated) and Y(0) (outcome if not treated) for each unit.
  • Average Treatment Effect (ATE) — The average causal effect of a treatment across a population: E[Y(1) - Y(0)].
  • DAG (Directed Acyclic Graph) — A graphical model where nodes are variables and directed edges represent causal relationships; no cycles.
  • Backdoor criterion — A graphical criterion for identifying which variables to condition on to block spurious correlations (confounding paths) in a causal DAG.
  • Do-calculus — A set of rules (developed by Judea Pearl) for computing the effect of interventions from observational data and a causal DAG.
  • Instrumental variable — A variable that affects the treatment but has no direct effect on the outcome except through the treatment; used to estimate causal effects when confounding is present.
  • Causal discovery — Algorithms for inferring causal structure (the DAG) from observational data.
  • Selection bias — Bias arising when the sample used for analysis is not representative of the population of interest.

Understanding

The fundamental problem of causal inference is what statistician Donald Rubin called the Fundamental Problem of Causal Inference: we can never observe both potential outcomes for the same unit at the same time. Either a patient received the drug (Y(1) observed, Y(0) unobserved) or they didn't (Y(0) observed, Y(1) unobserved). We can never know what would have happened to the same person under the alternative treatment — the counterfactual.

Judea Pearl's Ladder of Causation describes three levels of causal reasoning: 1. Association (rung 1): "What is?" — Observing and predicting correlations. Standard ML lives here. 2. Intervention (rung 2): "What if I do X?" — Reasoning about the effect of deliberate actions. Requires a causal model. 3. Counterfactual (rung 3): "What if I had done X instead?" — Imagining alternate histories. Requires a complete structural causal model.

Most ML systems operate only on rung 1. To make reliable decisions and avoid discrimination, AI systems often need rung 2 or 3.

Why this matters for AI:

  • Spurious correlations: A model that classifies "pneumonia" as lower risk may have learned that pneumonia patients sent to the ICU have lower final mortality — confusing treatment effect with baseline risk.
  • Fairness: Is a model discriminating based on race, or is it using variables that are correlated with race but causally related to the outcome? Causal fairness criteria give precise answers.
  • Policy decisions: If we deploy an AI to recommend interventions, we must understand the causal effect of those interventions — not just their correlation with past outcomes.
  • Robustness: Models that learn causal relationships rather than spurious correlations generalize better when the environment changes (distribution shift).

Applying

Estimating causal treatment effects with DoWhy:

<syntaxhighlight lang="python"> import dowhy from dowhy import CausalModel import pandas as pd import numpy as np

  1. Generate synthetic data: drug → recovery, age → both drug and recovery (confounder)

np.random.seed(42) n = 1000 age = np.random.normal(50, 15, n) drug = (0.3 * age + np.random.normal(0, 10, n) > 30).astype(int) # older → more likely to receive drug recovery = 0.5 * drug - 0.02 * age + np.random.normal(0, 1, n) # drug helps, but age hurts

df = pd.DataFrame({'age': age, 'drug': drug, 'recovery': recovery})

  1. Step 1: Define the causal model as a DAG

model = CausalModel(

   data=df,
   treatment="drug",
   outcome="recovery",
   common_causes=["age"]  # age is a confounder

)

  1. Step 2: Identify the causal effect

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True) print(identified_estimand)

  1. Step 3: Estimate the causal effect (controlling for age via backdoor adjustment)

estimate = model.estimate_effect(

   identified_estimand,
   method_name="backdoor.linear_regression",
   control_value=0,
   treatment_value=1,

) print(f"Estimated ATE: {estimate.value:.3f}")

  1. Should recover ~0.5 (the true causal effect)
  1. Naive regression ignoring confounding:

naive_corr = df[df.drug==1]['recovery'].mean() - df[df.drug==0]['recovery'].mean() print(f"Naive (biased) correlation: {naive_corr:.3f}")

  1. Will be biased because older people receive drug more but recover less
  1. Step 4: Refute the estimate (robustness checks)

refutation = model.refute_estimate(estimate,

                                  method_name="random_common_cause")

print(refutation) # Good estimate: adding random confounder shouldn't change result </syntaxhighlight>

Causal inference methods by scenario
RCT available → Compute difference in means (no adjustment needed — randomization handles confounding)
Observational, confounders known → Propensity score matching, IPW, doubly-robust estimators
Observational, confounders unknown → Instrumental variables (IV), regression discontinuity
Time series, sequential treatments → G-computation, marginal structural models
Causal structure unknown → Causal discovery: PC algorithm, FCI, LiNGAM, NOTEARS
Heterogeneous effects → Causal forests, meta-learners (S, T, X-learner)

Analyzing

Causal Inference Methods Comparison
Method Assumptions When to Use Python Library
Regression adjustment No unmeasured confounding, correct functional form Known confounders, sufficient data statsmodels, DoWhy
Propensity score matching No unmeasured confounding Binary treatment, observational data DoWhy, CausalML
Instrumental variables Valid instrument exists Hidden confounders, instrument available DoWhy, linearmodels
Difference-in-differences Parallel trends assumption Panel data, natural experiment CausalPy, statsmodels
Causal forest No unmeasured confounding Heterogeneous treatment effects EconML, GRF (R)
Regression discontinuity Local continuity at threshold Sharp threshold in treatment assignment RDD (R), DoWhy

Key pitfalls and failure modes:

  • Conditioning on colliders — Incorrectly conditioning on a variable that is a common effect (not cause) of treatment and outcome opens spurious paths rather than blocking them. Using a DAG is essential to identify what to condition on.
  • Positivity violation — If some subgroups never receive (or always receive) treatment, causal effects for those subgroups cannot be estimated from data. Check overlap in propensity score distributions.
  • Model misspecification — Parametric methods (regression adjustment) assume a specific functional form. Use doubly-robust or non-parametric methods (causal forests) to reduce this risk.
  • Weak instruments — IV estimation with a weak instrument (low correlation with treatment) produces large, unreliable estimates. Test for instrument strength (F-statistic > 10 rule of thumb).
  • Extrapolation beyond support — Causal effect estimates are only reliable within the range of the observed data. Be cautious about extrapolating to new populations or intervention levels.

Evaluating

Causal inference evaluation is uniquely challenging because we can never observe the true counterfactual:

Simulation studies (synthetic data): Generate data from a known causal model where the true ATE is known. Evaluate whether each estimator recovers the true ATE. This is the standard way to compare methods.

Semi-synthetic benchmarks: Use real covariate data but simulate outcomes from a known causal model. ACIC (Atlantic Causal Inference Conference) benchmarks provide standardized challenges.

Sensitivity analysis: Test how robust the estimate is to violations of key assumptions (e.g., unmeasured confounding). E-values quantify the minimum strength of unmeasured confounding that would overturn the result.

Refutation tests (DoWhy): Specific tests designed to detect estimation problems:

  • Adding a random confounder shouldn't change a good estimate
  • Replacing treatment with a random variable should produce ATE ≈ 0
  • Placebo treatment (observed but causally irrelevant) should produce ATE ≈ 0

Expert practitioners present causal estimates with explicit assumption documentation, sensitivity analyses, and refutation test results — not just a point estimate. An ATE that fails refutation tests or is sensitive to unmeasured confounding should be reported with appropriate uncertainty.

Creating

Designing a causal inference analysis pipeline for business decision-making:

1. Problem formulation <syntaxhighlight lang="text"> Define: What is the treatment? What is the outcome? What is the population?

Draw the causal DAG: variables, causal paths, potential confounders

Apply backdoor criterion: which variables block all confounding paths?

Check identifiability: can the causal effect be estimated from available data?

Assess data availability: are the required conditioning variables measured? </syntaxhighlight>

2. Estimation pipeline <syntaxhighlight lang="text"> Data collection: ensure confounders are measured, check positivity

Covariate balance check: compare treatment/control distributions

Propensity score modeling (if observational): logistic regression or GBM

Effect estimation: doubly-robust AIPW or causal forest for heterogeneous effects

Sensitivity analysis: E-value, Rosenbaum bounds

Refutation tests: DoWhy refute suite

Report: ATE ± CI, assumptions, sensitivity, subgroup effects </syntaxhighlight>

3. Causal ML in production (uplift modeling)

  • Estimate heterogeneous treatment effects: which users benefit most from intervention?
  • Use X-learner or causal forest to estimate CATE (Conditional ATE) per user
  • Target intervention to users with highest predicted CATE
  • Monitor actual vs. predicted uplift in A/B tests post-deployment
  • Continuously retrain causal model as new experimental data arrives