Editing AI for Epidemiology

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
AI for epidemiology applies machine learning to the study and control of disease patterns in populations. Epidemiology uses data to understand what causes disease, who is at risk, and how interventions work. AI expands this toolkit: ML models identify disease risk factors in large electronic health record datasets, natural language processing extracts outbreak signals from news and social media, computer vision analyzes satellite imagery for environmental health, and agent-based models simulate how interventions will change disease trajectories. COVID-19 demonstrated both the promise (rapid genomic surveillance, vaccine development support) and pitfalls (poorly validated risk models causing harm) of epidemiology AI.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Epidemiology''' — The study of the distribution and determinants of health and disease in populations.
* '''Outbreak detection''' — Identifying unusual disease clustering in time and space; now augmented by AI surveillance systems.
* '''Nowcasting''' — Estimating current disease burden before surveillance data is complete; ML corrects for reporting delays.
* '''Forecasting (epidemiology)''' — Predicting future disease incidence, hospital burden, and mortality.
* '''Syndromic surveillance''' — Monitoring health indicators (emergency visits, pharmacy sales, absenteeism) for early outbreak signals.
* '''R₀ (basic reproduction number)''' — Average number of secondary infections from one case in a fully susceptible population; estimated by ML.
* '''SIR model''' — Compartmental epidemic model: Susceptible → Infected → Recovered; foundational mathematical framework.
* '''Digital epidemiology''' — Using digital data (search trends, social media, mobile phones) for disease surveillance.
* '''Google Flu Trends''' — Google's attempt to predict flu using search data; famously failed, revealing pitfalls of digital epidemiology.
* '''Wastewater epidemiology''' — Detecting pathogens in wastewater to estimate community infection levels; AI improves trend detection.
* '''Contact tracing''' — Identifying contacts of infected individuals; AI systems automated this during COVID-19.
* '''Causal inference (epidemiology)''' — Methods distinguishing correlation from causation in observational data; propensity scores, instrumental variables.
* '''Electronic Health Records (EHR)''' — Digitized patient health data; a massive resource for epidemiological ML.
* '''Genomic surveillance''' — Sequencing pathogen genomes to track variants, transmission chains, and evolution; SARS-CoV-2 Nextstrain.
* '''ProMED''' — Global infectious disease outbreak monitoring system; now augmented by AI text analysis.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Epidemiology AI operates at multiple scales: individual risk prediction, community-level disease monitoring, and global outbreak response.

**Disease surveillance**: Traditional disease surveillance relies on reported cases — slow, incomplete, and biased toward severe cases. Digital surveillance uses proxy signals: Google search trends (flu-like searches predict flu incidence), Twitter/Reddit posts (symptom language), pharmacy sales (OTC medication patterns), and HealthMap (news article NLP). These provide near-real-time signals weeks before official reports. COVID-19 demonstrated the power of wastewater surveillance — detecting SARS-CoV-2 RNA in wastewater 7–10 days before clinical case increases.

**Epidemic forecasting**: ML models trained on historical surveillance data, mobility data, climate, and vaccination rates predict epidemic trajectories. The COVID-19 Forecast Hub aggregated predictions from 40+ teams; ensemble models outperformed individual models. Key challenges: epidemics have non-stationary dynamics; behavioral change (lockdowns, masking) shifts transmission; data quality degrades during surges.

**The Google Flu Trends lesson**: Google's 2009 paper predicted flu 1–2 weeks ahead using search query data with impressive early accuracy. By 2012, it was systematically overestimating flu by 2×. The lesson: big data correlations are fragile; search behavior changes (media panic during flu season causes flu-related searches even in healthy people); models trained in one regime fail in another. This is the canonical cautionary tale for digital epidemiology.

**EHR-based risk models**: ML models trained on large EHR databases can predict individual patient risk for flu complications, sepsis, hospital readmission, and chronic disease development. The challenge: many COVID-19 risk models published in 2020 were methodologically flawed (data leakage, inadequate validation, biased training data), and several were explicitly shown to be harmful or useless when deployed. TRIPOD guidelines for prediction model reporting were widely violated.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Epidemic curve forecasting with LSTM:'''
<syntaxhighlight lang="python">
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error

class EpidemicLSTM(nn.Module):
    """LSTM for epidemic incidence forecasting."""
    def __init__(self, input_dim, hidden_dim=128, n_layers=2, output_horizon=14):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, n_layers, batch_first=True, dropout=0.2)
        self.head = nn.Linear(hidden_dim, output_horizon)

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.head(out[:, -1, :])  # Forecast 14 days ahead

# Features: new_cases, hospitalizations, mobility, vaccination_rate, temperature
# Use CDC FluView, WHO FluNet, or COVID-19 surveillance data
def prepare_sequences(df, window=28, horizon=14):
    X, y = [], []
    for i in range(len(df) - window - horizon + 1):
        X.append(df.iloc[i:i+window].values)
        y.append(df['new_cases'].iloc[i+window:i+window+horizon].values)
    return np.array(X), np.array(y)

# Load surveillance data
df = pd.read_csv("surveillance_data.csv", parse_dates=['date']).sort_values('date')
scaler = MinMaxScaler()
scaled = pd.DataFrame(scaler.fit_transform(df[['new_cases', 'hospitalizations',
                                                'mobility', 'vaccination_rate']]),
                      columns=['new_cases', 'hospitalizations', 'mobility', 'vaccination_rate'])
X, y = prepare_sequences(scaled, window=28, horizon=14)
X = torch.FloatTensor(X); y = torch.FloatTensor(y)
# Train/test split: chronological (never shuffle time series!)
split = int(0.8 * len(X))
model = EpidemicLSTM(input_dim=4, output_horizon=14)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.MSELoss()
</syntaxhighlight>

; Epidemiology AI tools
: '''Surveillance''' → HealthMap (NLP news), ProMED AI, Nextstrain (genomic)
: '''Wastewater''' → Biobot Analytics, WastewaterSCAN + ML trend detection
: '''Forecasting''' → CDC Forecast Hub, EU COVID-19 Forecast Hub, FluSight
: '''Contact tracing''' → TraceTogether, NOVID, state health department apps
: '''EHR analytics''' → Trinetx, TrialSpark, Aetion (causal inference)
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Epidemiology AI Performance vs. Traditional Methods
! Application !! AI Advantage !! Key Limitation !! Reliability
|-
| Digital surveillance (flu) || 1-2 week lead time || Fragile correlations || Moderate
|-
| Wastewater surveillance || 7-10 day early warning || Catchment area complexity || High
|-
| Epidemic forecasting (short-term) || Comparable to statistical || Non-stationary dynamics || Moderate
|-
| Genomic variant tracking || Automated, fast || Sequencing bias || High
|-
| Individual risk models (EHR) || Personalized prediction || Validation quality varies || Variable
|}

'''Failure modes''': Distribution shift during pandemic phases (model trained pre-Omicron fails post-Omicron). Data quality collapses during surges (under-reporting, delayed reporting). Algorithmic amplification of disparities — risk models trained on historically under-served populations perform worse precisely where intervention is most needed. Overconfident point predictions without uncertainty quantification leading to poor public health decisions.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Epidemiology AI evaluation: (1) **Prospective validation**: epidemic forecasts must be tested on future data, never past data the model could implicitly learn from. (2) **Calibration**: are 80% prediction intervals calibrated to contain 80% of true values? (3) **Ensemble performance**: individual models vs. ensemble — ensemble typically outperforms. (4) **Real-time vs. revised data**: evaluate on real-time surveillance data (with delays, revisions) not finalized data. (5) **TRIPOD guidelines**: for clinical prediction models — report transparency, reproducibility, validation.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Building an epidemic surveillance system: (1) Data streams: integrate official surveillance reports, hospital admissions, lab positivity rates, wastewater signals, mobility data. (2) Nowcasting: account for reporting delays using negative-binomial model or ML correction. (3) Forecasting: ensemble of LSTM + statistical baselines (ARIMA, Prophet); report uncertainty intervals, not just point estimates. (4) Alert system: automated detection of significant deviations from trend; dashboard for public health officials. (5) Equity lens: track disease burden and model performance by demographic group; ensure early warning reaches all communities equally.

[[Category:Artificial Intelligence]]
[[Category:Epidemiology]]
[[Category:Public Health]]
</div>