Tabular Dl: Difference between revisions

From BloomWiki
Jump to navigation Jump to search
BloomWiki: Tabular Dl
 
BloomWiki: Tabular Dl
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
{{BloomIntro}}
Tabular deep learning applies neural network architectures to structured tabular data — the spreadsheet-format data that dominates enterprise applications, business analytics, and scientific databases. For decades, gradient boosting (XGBoost, LightGBM, CatBoost) has dominated tabular ML competitions and production systems, consistently outperforming neural networks. Recent tabular deep learning research challenges this status quo: architectures like TabNet, TabTransformer, FT-Transformer, and foundation models for tabular data (TabPFN, SAINT) are closing the gap. Understanding when deep learning beats gradient boosting — and when it doesn't — is essential knowledge for ML practitioners.
Tabular deep learning applies neural network architectures to structured tabular data — the spreadsheet-format data that dominates enterprise applications, business analytics, and scientific databases. For decades, gradient boosting (XGBoost, LightGBM, CatBoost) has dominated tabular ML competitions and production systems, consistently outperforming neural networks. Recent tabular deep learning research challenges this status quo: architectures like TabNet, TabTransformer, FT-Transformer, and foundation models for tabular data (TabPFN, SAINT) are closing the gap. Understanding when deep learning beats gradient boosting — and when it doesn't — is essential knowledge for ML practitioners.
</div>


== Remembering ==
__TOC__
 
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Tabular data''' — Structured data organized in rows (samples) and columns (features); the dominant format in enterprise ML.
* '''Tabular data''' — Structured data organized in rows (samples) and columns (features); the dominant format in enterprise ML.
* '''Heterogeneous features''' — Tabular data typically mixes numerical and categorical features of varying scales and semantics; unique challenge vs. images/text.
* '''Heterogeneous features''' — Tabular data typically mixes numerical and categorical features of varying scales and semantics; unique challenge vs. images/text.
Line 17: Line 22:
* '''Large Language Models for tables''' — Using LLMs for tabular tasks via serialization; surprisingly competitive on certain tasks.
* '''Large Language Models for tables''' — Using LLMs for tabular tasks via serialization; surprisingly competitive on certain tasks.
* '''AutoML''' — Automated ML pipeline search including architecture selection; FLAML, AutoGluon, H2O AutoML.
* '''AutoML''' — Automated ML pipeline search including architecture selection; FLAML, AutoGluon, H2O AutoML.
</div>


== Understanding ==
<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The "tabular gap": neural networks excel at images, text, and audio because these have spatial/sequential structure that convolutions and attention exploit efficiently. Tabular data lacks this structure — features are semantically heterogeneous, with no natural ordering. A column called "age" is fundamentally different from a column called "revenue" in ways that have no analog in pixels.
The "tabular gap": neural networks excel at images, text, and audio because these have spatial/sequential structure that convolutions and attention exploit efficiently. Tabular data lacks this structure — features are semantically heterogeneous, with no natural ordering. A column called "age" is fundamentally different from a column called "revenue" in ways that have no analog in pixels.


'''Why GBDTs win''': Gradient boosted trees handle heterogeneous features natively, discover complex feature interactions via splits, are robust to irrelevant features (automatic feature selection), require minimal preprocessing, and train quickly. They're hard to beat on tabular benchmarks because they solve exactly the problems posed by tabular data without the overhead of deep learning.
'''Why GBDTs win''': Gradient boosted trees handle heterogeneous features natively, discover complex feature interactions via splits, are robust to irrelevant features (automatic feature selection), require minimal preprocessing, and train quickly. They're hard to beat on tabular benchmarks because they solve exactly the problems posed by tabular data without the overhead of deep learning.


'''Where DL can win on tabular data''': (1) '''Large datasets''' (>100K samples): neural networks improve with scale where GBDTs plateau. (2) '''High-cardinality categoricals''': entity embeddings for user IDs, product IDs with millions of values. (3) '''Multi-modal inputs''': when tabular data is combined with text, images, or other modalities. (4) '''End-to-end learning''': when the tabular model is part of a larger differentiable system. (5) '''Online learning''': neural networks update incrementally more easily than tree ensembles.
'''Where DL can win on tabular data''':
# '''Large datasets''' (>100K samples): neural networks improve with scale where GBDTs plateau.
# '''High-cardinality categoricals''': entity embeddings for user IDs, product IDs with millions of values.
# '''Multi-modal inputs''': when tabular data is combined with text, images, or other modalities.
# '''End-to-end learning''': when the tabular model is part of a larger differentiable system.
# '''Online learning''': neural networks update incrementally more easily than tree ensembles.


'''FT-Transformer — current SOTA''': The Feature Tokenizer + Transformer embeds each feature as a token (using a linear layer for numerical, embedding table for categorical), prepends a CLS token, and applies standard transformer layers. It consistently outperforms TabNet and approaches GBDT performance on many benchmarks — while being a clean, generalizable architecture.
'''FT-Transformer — current SOTA''': The Feature Tokenizer + Transformer embeds each feature as a token (using a linear layer for numerical, embedding table for categorical), prepends a CLS token, and applies standard transformer layers. It consistently outperforms TabNet and approaches GBDT performance on many benchmarks — while being a clean, generalizable architecture.


'''TabPFN — few-shot tabular ML''': Pre-trained on millions of synthetic tabular datasets, TabPFN uses in-context learning to make predictions on new small datasets (up to ~1000 samples) with a single forward pass — no training required. On small datasets it frequently matches or beats XGBoost with minutes of tuning. This is a fundamentally different paradigm from standard ML.
'''TabPFN — few-shot tabular ML''': Pre-trained on millions of synthetic tabular datasets, TabPFN uses in-context learning to make predictions on new small datasets (up to ~1000 samples) with a single forward pass — no training required. On small datasets it frequently matches or beats XGBoost with minutes of tuning. This is a fundamentally different paradigm from standard ML.
</div>


== Applying ==
<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''FT-Transformer on tabular benchmark:'''
'''FT-Transformer on tabular benchmark:'''
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
Line 84: Line 98:
: '''Multi-modal (tabular + text)''' → TabTransformer/FT-Transformer + BERT fusion
: '''Multi-modal (tabular + text)''' → TabTransformer/FT-Transformer + BERT fusion
: '''AutoML''' → AutoGluon-Tabular (tests multiple models); strong default
: '''AutoML''' → AutoGluon-Tabular (tests multiple models); strong default
</div>


== Analyzing ==
<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
{| class="wikitable"
|+ Tabular ML Method Comparison (2024)
|+ Tabular ML Method Comparison (2024)
Line 104: Line 120:


'''Failure modes''': Overfitting to small tabular datasets with deep learning (more parameters than samples). Forgetting to normalize numerical features for neural networks. Missing value handling — NNs require explicit imputation; GBDTs handle natively. Hyperparameter sensitivity of tabular NNs (learning rate, weight decay require tuning). Scale mismatch between features causing slow convergence.
'''Failure modes''': Overfitting to small tabular datasets with deep learning (more parameters than samples). Forgetting to normalize numerical features for neural networks. Missing value handling — NNs require explicit imputation; GBDTs handle natively. Hyperparameter sensitivity of tabular NNs (learning rate, weight decay require tuning). Scale mismatch between features causing slow convergence.
</div>


== Evaluating ==
<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Tabular deep learning evaluation: (1) '''Benchmark against GBDT baseline''': always compare to tuned XGBoost or LightGBM — if DL doesn't beat it, use GBDT. (2) '''Multiple random seeds''': tabular DL has higher variance; report mean ± std over 5+ seeds. (3) '''Cross-validation''': strict k-fold with stratification for classification. (4) '''Calibration''': are predicted probabilities well-calibrated? (5) '''Computation budget''': account for DL training time vs. GBDT; ROI of DL must exceed the extra cost.
== <span style="color: #FFFFFF;">Evaluating</span> ==
Tabular deep learning evaluation:
# '''Benchmark against GBDT baseline''': always compare to tuned XGBoost or LightGBM — if DL doesn't beat it, use GBDT.
# '''Multiple random seeds''': tabular DL has higher variance; report mean ± std over 5+ seeds.
# '''Cross-validation''': strict k-fold with stratification for classification.
# '''Calibration''': are predicted probabilities well-calibrated?
# '''Computation budget''': account for DL training time vs. GBDT; ROI of DL must exceed the extra cost.
</div>


== Creating ==
<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Tabular ML production pipeline: (1) Baseline: train XGBoost/LightGBM with default hyperparameters; measure AUC/RMSE. (2) AutoML: run AutoGluon-Tabular for 1 hour; assess best model. (3) If DL worth pursuing: FT-Transformer with AdamW, learning rate warmup, cosine annealing. (4) Feature engineering: log-transform skewed numeric features; frequency encoding for very high-cardinality categoricals. (5) Ensembling: stack GBDT + FT-Transformer predictions; often beats either alone. (6) Deployment: export XGBoost as ONNX or LightGBM native; FT-Transformer as TorchScript.
== <span style="color: #FFFFFF;">Creating</span> ==
Tabular ML production pipeline:
# Baseline: train XGBoost/LightGBM with default hyperparameters; measure AUC/RMSE.
# AutoML: run AutoGluon-Tabular for 1 hour; assess best model.
# If DL worth pursuing: FT-Transformer with AdamW, learning rate warmup, cosine annealing.
# Feature engineering: log-transform skewed numeric features; frequency encoding for very high-cardinality categoricals.
# Ensembling: stack GBDT + FT-Transformer predictions; often beats either alone.
# Deployment: export XGBoost as ONNX or LightGBM native; FT-Transformer as TorchScript.


[[Category:Artificial Intelligence]]
[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Machine Learning]]
[[Category:Tabular Data]]
[[Category:Tabular Data]]
</div>

Latest revision as of 01:58, 25 April 2026

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Tabular deep learning applies neural network architectures to structured tabular data — the spreadsheet-format data that dominates enterprise applications, business analytics, and scientific databases. For decades, gradient boosting (XGBoost, LightGBM, CatBoost) has dominated tabular ML competitions and production systems, consistently outperforming neural networks. Recent tabular deep learning research challenges this status quo: architectures like TabNet, TabTransformer, FT-Transformer, and foundation models for tabular data (TabPFN, SAINT) are closing the gap. Understanding when deep learning beats gradient boosting — and when it doesn't — is essential knowledge for ML practitioners.

Remembering[edit]

  • Tabular data — Structured data organized in rows (samples) and columns (features); the dominant format in enterprise ML.
  • Heterogeneous features — Tabular data typically mixes numerical and categorical features of varying scales and semantics; unique challenge vs. images/text.
  • Feature interactions — Relationships between features that jointly predict the target; gradient boosting discovers these via trees; DL via attention.
  • Entity embedding — Representing categorical variables as learned dense vectors; a key technique enabling neural networks to handle high-cardinality categoricals.
  • TabNet — An attention-based neural network for tabular data with built-in feature selection; Arik & Pfister (2021).
  • TabTransformer — A transformer applying self-attention to categorical embeddings; Sheikh et al. (2021).
  • FT-Transformer (Feature Tokenizer + Transformer) — Embeds all features (numerical + categorical) as tokens; applies transformer; Gorishniy et al. (2021).
  • TabPFN — A pre-trained transformer that performs in-context learning on small tabular datasets; prior-fitted networks.
  • SAINT — Self-Attention and Intersample Attention Transformer; applies attention both within and across samples.
  • XGBoost / LightGBM / CatBoost — The dominant gradient boosting frameworks; still the baseline to beat on most tabular benchmarks.
  • Prior-Data Fitted Networks (PFN) — Models pre-trained on synthetic tabular datasets that can perform few-shot inference on new datasets.
  • Hyperparameter sensitivity — Neural networks for tabular data require careful tuning; GBDTs are more robust to hyperparameter choices.
  • Large Language Models for tables — Using LLMs for tabular tasks via serialization; surprisingly competitive on certain tasks.
  • AutoML — Automated ML pipeline search including architecture selection; FLAML, AutoGluon, H2O AutoML.

Understanding[edit]

The "tabular gap": neural networks excel at images, text, and audio because these have spatial/sequential structure that convolutions and attention exploit efficiently. Tabular data lacks this structure — features are semantically heterogeneous, with no natural ordering. A column called "age" is fundamentally different from a column called "revenue" in ways that have no analog in pixels.

Why GBDTs win: Gradient boosted trees handle heterogeneous features natively, discover complex feature interactions via splits, are robust to irrelevant features (automatic feature selection), require minimal preprocessing, and train quickly. They're hard to beat on tabular benchmarks because they solve exactly the problems posed by tabular data without the overhead of deep learning.

Where DL can win on tabular data:

  1. Large datasets (>100K samples): neural networks improve with scale where GBDTs plateau.
  2. High-cardinality categoricals: entity embeddings for user IDs, product IDs with millions of values.
  3. Multi-modal inputs: when tabular data is combined with text, images, or other modalities.
  4. End-to-end learning: when the tabular model is part of a larger differentiable system.
  5. Online learning: neural networks update incrementally more easily than tree ensembles.

FT-Transformer — current SOTA: The Feature Tokenizer + Transformer embeds each feature as a token (using a linear layer for numerical, embedding table for categorical), prepends a CLS token, and applies standard transformer layers. It consistently outperforms TabNet and approaches GBDT performance on many benchmarks — while being a clean, generalizable architecture.

TabPFN — few-shot tabular ML: Pre-trained on millions of synthetic tabular datasets, TabPFN uses in-context learning to make predictions on new small datasets (up to ~1000 samples) with a single forward pass — no training required. On small datasets it frequently matches or beats XGBoost with minutes of tuning. This is a fundamentally different paradigm from standard ML.

Applying[edit]

FT-Transformer on tabular benchmark: <syntaxhighlight lang="python"> import torch import torch.nn as nn import numpy as np from rtdl import FTTransformer # Real-world Tabular Deep Learning library

  1. FT-Transformer via rtdl library
  2. pip install rtdl
  1. Separate numerical and categorical features

n_num_features = 8 # Number of continuous features cat_cardinalities = [5, 100, 10] # Cardinality of each categorical feature

model = FTTransformer.make_default(

   n_num_features=n_num_features,
   cat_cardinalities=cat_cardinalities,
   d_out=1,  # 1 for binary classification or regression; n_classes for multiclass

)

optimizer = model.make_default_optimizer() # AdamW with standard tabular LR schedule

  1. Training loop

def train_epoch(model, loader, optimizer, task='classification'):

   model.train()
   total_loss = 0
   for X_num, X_cat, y in loader:
       logits = model(X_num, X_cat).squeeze(1)
       if task == 'classification':
           loss = nn.BCEWithLogitsLoss()(logits, y.float())
       else:
           loss = nn.MSELoss()(logits, y.float())
       optimizer.zero_grad(); loss.backward(); optimizer.step()
       total_loss += loss.item()
   return total_loss / len(loader)
  1. Quick comparison: TabPFN for small datasets (no training!)

from tabpfn import TabPFNClassifier from sklearn.metrics import roc_auc_score

clf = TabPFNClassifier(device='cpu', N_ensemble_configurations=32) clf.fit(X_train_small, y_train_small) # Instant — no gradient descent preds = clf.predict_proba(X_test_small) print(f"TabPFN AUC: {roc_auc_score(y_test_small, preds[:,1]):.3f}")

  1. Often matches XGBoost on small datasets without any hyperparameter tuning!

</syntaxhighlight>

Tabular DL framework selection guide
Small data (<1K samples) → TabPFN (no training needed), XGBoost with Bayesian HPO
Medium data (1K–100K) → XGBoost/LightGBM baseline; try FT-Transformer
Large data (>100K) → FT-Transformer, SAINT; potentially beats GBDT
High-cardinality categoricals → Entity embeddings + any DL model; CatBoost also strong
Multi-modal (tabular + text) → TabTransformer/FT-Transformer + BERT fusion
AutoML → AutoGluon-Tabular (tests multiple models); strong default

Analyzing[edit]

Tabular ML Method Comparison (2024)
Method Small Data Large Data Categorical Handling Training Speed Interpretability
XGBoost/LightGBM Excellent Good Good (ordinal encoding) Fast Medium (SHAP)
CatBoost Excellent Good Excellent (native) Fast Medium (SHAP)
TabNet Good Good Medium Slow High (attention masks)
FT-Transformer Good Excellent Excellent (embeddings) Slow Low
TabPFN Excellent (≤1K) N/A Good Instant (inference) Low
Random Forest Good Good Poor Medium Medium

Failure modes: Overfitting to small tabular datasets with deep learning (more parameters than samples). Forgetting to normalize numerical features for neural networks. Missing value handling — NNs require explicit imputation; GBDTs handle natively. Hyperparameter sensitivity of tabular NNs (learning rate, weight decay require tuning). Scale mismatch between features causing slow convergence.

Evaluating[edit]

Tabular deep learning evaluation:

  1. Benchmark against GBDT baseline: always compare to tuned XGBoost or LightGBM — if DL doesn't beat it, use GBDT.
  2. Multiple random seeds: tabular DL has higher variance; report mean ± std over 5+ seeds.
  3. Cross-validation: strict k-fold with stratification for classification.
  4. Calibration: are predicted probabilities well-calibrated?
  5. Computation budget: account for DL training time vs. GBDT; ROI of DL must exceed the extra cost.

Creating[edit]

Tabular ML production pipeline:

  1. Baseline: train XGBoost/LightGBM with default hyperparameters; measure AUC/RMSE.
  2. AutoML: run AutoGluon-Tabular for 1 hour; assess best model.
  3. If DL worth pursuing: FT-Transformer with AdamW, learning rate warmup, cosine annealing.
  4. Feature engineering: log-transform skewed numeric features; frequency encoding for very high-cardinality categoricals.
  5. Ensembling: stack GBDT + FT-Transformer predictions; often beats either alone.
  6. Deployment: export XGBoost as ONNX or LightGBM native; FT-Transformer as TorchScript.