Tabular Deep Learning - Revision history

Wordpad: BloomWiki: Tabular Deep Learning

2026-04-25T01:58:54Z

BloomWiki: Tabular Deep Learning

← Older revision		Revision as of 01:58, 25 April 2026
Line 1:		Line 1:
			<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
	{{BloomIntro}}		{{BloomIntro}}
	Tabular deep learning applies neural network architectures to structured tabular data — the spreadsheet-format data that dominates enterprise applications, business analytics, and scientific databases. For decades, gradient boosting (XGBoost, LightGBM, CatBoost) has dominated tabular ML competitions and production systems, consistently outperforming neural networks. Recent tabular deep learning research challenges this status quo: architectures like TabNet, TabTransformer, FT-Transformer, and foundation models for tabular data (TabPFN, SAINT) are closing the gap. Understanding when deep learning beats gradient boosting — and when it doesn't — is essential knowledge for ML practitioners.		Tabular deep learning applies neural network architectures to structured tabular data — the spreadsheet-format data that dominates enterprise applications, business analytics, and scientific databases. For decades, gradient boosting (XGBoost, LightGBM, CatBoost) has dominated tabular ML competitions and production systems, consistently outperforming neural networks. Recent tabular deep learning research challenges this status quo: architectures like TabNet, TabTransformer, FT-Transformer, and foundation models for tabular data (TabPFN, SAINT) are closing the gap. Understanding when deep learning beats gradient boosting — and when it doesn't — is essential knowledge for ML practitioners.
			</div>

	== Remembering ==		__TOC__

			<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Remembering</span> ==
	* '''Tabular data''' — Structured data organized in rows (samples) and columns (features); the dominant format in enterprise ML.		* '''Tabular data''' — Structured data organized in rows (samples) and columns (features); the dominant format in enterprise ML.
	* '''Heterogeneous features''' — Tabular data typically mixes numerical and categorical features of varying scales and semantics; unique challenge vs. images/text.		* '''Heterogeneous features''' — Tabular data typically mixes numerical and categorical features of varying scales and semantics; unique challenge vs. images/text.
Line 17:		Line 22:
	* '''Large Language Models for tables''' — Using LLMs for tabular tasks via serialization; surprisingly competitive on certain tasks.		* '''Large Language Models for tables''' — Using LLMs for tabular tasks via serialization; surprisingly competitive on certain tasks.
	* '''AutoML''' — Automated ML pipeline search including architecture selection; FLAML, AutoGluon, H2O AutoML.		* '''AutoML''' — Automated ML pipeline search including architecture selection; FLAML, AutoGluon, H2O AutoML.
			</div>

	== Understanding ==		<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Understanding</span> ==
	The "tabular gap": neural networks excel at images, text, and audio because these have spatial/sequential structure that convolutions and attention exploit efficiently. Tabular data lacks this structure — features are semantically heterogeneous, with no natural ordering. A column called "age" is fundamentally different from a column called "revenue" in ways that have no analog in pixels.		The "tabular gap": neural networks excel at images, text, and audio because these have spatial/sequential structure that convolutions and attention exploit efficiently. Tabular data lacks this structure — features are semantically heterogeneous, with no natural ordering. A column called "age" is fundamentally different from a column called "revenue" in ways that have no analog in pixels.

Line 28:		Line 35:

	TabPFN — few-shot tabular ML: Pre-trained on millions of synthetic tabular datasets, TabPFN uses in-context learning to make predictions on new small datasets (up to ~1000 samples) with a single forward pass — no training required. On small datasets it frequently matches or beats XGBoost with minutes of tuning. This is a fundamentally different paradigm from standard ML.		TabPFN — few-shot tabular ML: Pre-trained on millions of synthetic tabular datasets, TabPFN uses in-context learning to make predictions on new small datasets (up to ~1000 samples) with a single forward pass — no training required. On small datasets it frequently matches or beats XGBoost with minutes of tuning. This is a fundamentally different paradigm from standard ML.
			</div>

	== Applying ==		<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Applying</span> ==
	'''FT-Transformer on tabular benchmark:'''		'''FT-Transformer on tabular benchmark:'''
	<syntaxhighlight lang="python">		<syntaxhighlight lang="python">
Line 84:		Line 93:
	: '''Multi-modal (tabular + text)''' → TabTransformer/FT-Transformer + BERT fusion		: '''Multi-modal (tabular + text)''' → TabTransformer/FT-Transformer + BERT fusion
	: '''AutoML''' → AutoGluon-Tabular (tests multiple models); strong default		: '''AutoML''' → AutoGluon-Tabular (tests multiple models); strong default
			</div>

	== Analyzing ==		<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Analyzing</span> ==
	{\| class="wikitable"		{\| class="wikitable"
	\|+ Tabular ML Method Comparison (2024)		\|+ Tabular ML Method Comparison (2024)
Line 104:		Line 115:

	'''Failure modes''': Overfitting to small tabular datasets with deep learning (more parameters than samples). Forgetting to normalize numerical features for neural networks. Missing value handling — NNs require explicit imputation; GBDTs handle natively. Hyperparameter sensitivity of tabular NNs (learning rate, weight decay require tuning). Scale mismatch between features causing slow convergence.		'''Failure modes''': Overfitting to small tabular datasets with deep learning (more parameters than samples). Forgetting to normalize numerical features for neural networks. Missing value handling — NNs require explicit imputation; GBDTs handle natively. Hyperparameter sensitivity of tabular NNs (learning rate, weight decay require tuning). Scale mismatch between features causing slow convergence.
			</div>

	== Evaluating ==		<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Evaluating</span> ==
	Tabular deep learning evaluation: (1) Benchmark against GBDT baseline: always compare to tuned XGBoost or LightGBM — if DL doesn't beat it, use GBDT. (2) Multiple random seeds: tabular DL has higher variance; report mean ± std over 5+ seeds. (3) Cross-validation: strict k-fold with stratification for classification. (4) Calibration: are predicted probabilities well-calibrated? (5) Computation budget: account for DL training time vs. GBDT; ROI of DL must exceed the extra cost.		Tabular deep learning evaluation: (1) Benchmark against GBDT baseline: always compare to tuned XGBoost or LightGBM — if DL doesn't beat it, use GBDT. (2) Multiple random seeds: tabular DL has higher variance; report mean ± std over 5+ seeds. (3) Cross-validation: strict k-fold with stratification for classification. (4) Calibration: are predicted probabilities well-calibrated? (5) Computation budget: account for DL training time vs. GBDT; ROI of DL must exceed the extra cost.
			</div>

	== Creating ==		<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Creating</span> ==
	Tabular ML production pipeline: (1) Baseline: train XGBoost/LightGBM with default hyperparameters; measure AUC/RMSE. (2) AutoML: run AutoGluon-Tabular for 1 hour; assess best model. (3) If DL worth pursuing: FT-Transformer with AdamW, learning rate warmup, cosine annealing. (4) Feature engineering: log-transform skewed numeric features; frequency encoding for very high-cardinality categoricals. (5) Ensembling: stack GBDT + FT-Transformer predictions; often beats either alone. (6) Deployment: export XGBoost as ONNX or LightGBM native; FT-Transformer as TorchScript.		Tabular ML production pipeline: (1) Baseline: train XGBoost/LightGBM with default hyperparameters; measure AUC/RMSE. (2) AutoML: run AutoGluon-Tabular for 1 hour; assess best model. (3) If DL worth pursuing: FT-Transformer with AdamW, learning rate warmup, cosine annealing. (4) Feature engineering: log-transform skewed numeric features; frequency encoding for very high-cardinality categoricals. (5) Ensembling: stack GBDT + FT-Transformer predictions; often beats either alone. (6) Deployment: export XGBoost as ONNX or LightGBM native; FT-Transformer as TorchScript.

Line 114:		Line 129:
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Tabular Data]]		[[Category:Tabular Data]]
			</div>

Wordpad: BloomWiki: Tabular Deep Learning

2026-04-23T12:25:37Z

BloomWiki: Tabular Deep Learning

New page

{{BloomIntro}}
Tabular deep learning applies neural network architectures to structured tabular data — the spreadsheet-format data that dominates enterprise applications, business analytics, and scientific databases. For decades, gradient boosting (XGBoost, LightGBM, CatBoost) has dominated tabular ML competitions and production systems, consistently outperforming neural networks. Recent tabular deep learning research challenges this status quo: architectures like TabNet, TabTransformer, FT-Transformer, and foundation models for tabular data (TabPFN, SAINT) are closing the gap. Understanding when deep learning beats gradient boosting — and when it doesn't — is essential knowledge for ML practitioners.

== Remembering ==
* '''Tabular data''' — Structured data organized in rows (samples) and columns (features); the dominant format in enterprise ML.
* '''Heterogeneous features''' — Tabular data typically mixes numerical and categorical features of varying scales and semantics; unique challenge vs. images/text.
* '''Feature interactions''' — Relationships between features that jointly predict the target; gradient boosting discovers these via trees; DL via attention.
* '''Entity embedding''' — Representing categorical variables as learned dense vectors; a key technique enabling neural networks to handle high-cardinality categoricals.
* '''TabNet''' — An attention-based neural network for tabular data with built-in feature selection; Arik & Pfister (2021).
* '''TabTransformer''' — A transformer applying self-attention to categorical embeddings; Sheikh et al. (2021).
* '''FT-Transformer (Feature Tokenizer + Transformer)''' — Embeds all features (numerical + categorical) as tokens; applies transformer; Gorishniy et al. (2021).
* '''TabPFN''' — A pre-trained transformer that performs in-context learning on small tabular datasets; prior-fitted networks.
* '''SAINT''' — Self-Attention and Intersample Attention Transformer; applies attention both within and across samples.
* '''XGBoost / LightGBM / CatBoost''' — The dominant gradient boosting frameworks; still the baseline to beat on most tabular benchmarks.
* '''Prior-Data Fitted Networks (PFN)''' — Models pre-trained on synthetic tabular datasets that can perform few-shot inference on new datasets.
* '''Hyperparameter sensitivity''' — Neural networks for tabular data require careful tuning; GBDTs are more robust to hyperparameter choices.
* '''Large Language Models for tables''' — Using LLMs for tabular tasks via serialization; surprisingly competitive on certain tasks.
* '''AutoML''' — Automated ML pipeline search including architecture selection; FLAML, AutoGluon, H2O AutoML.

== Understanding ==
The "tabular gap": neural networks excel at images, text, and audio because these have spatial/sequential structure that convolutions and attention exploit efficiently. Tabular data lacks this structure — features are semantically heterogeneous, with no natural ordering. A column called "age" is fundamentally different from a column called "revenue" in ways that have no analog in pixels.

**Why GBDTs win**: Gradient boosted trees handle heterogeneous features natively, discover complex feature interactions via splits, are robust to irrelevant features (automatic feature selection), require minimal preprocessing, and train quickly. They're hard to beat on tabular benchmarks because they solve exactly the problems posed by tabular data without the overhead of deep learning.

**Where DL can win on tabular data**: (1) **Large datasets** (>100K samples): neural networks improve with scale where GBDTs plateau. (2) **High-cardinality categoricals**: entity embeddings for user IDs, product IDs with millions of values. (3) **Multi-modal inputs**: when tabular data is combined with text, images, or other modalities. (4) **End-to-end learning**: when the tabular model is part of a larger differentiable system. (5) **Online learning**: neural networks update incrementally more easily than tree ensembles.

**FT-Transformer — current SOTA**: The Feature Tokenizer + Transformer embeds each feature as a token (using a linear layer for numerical, embedding table for categorical), prepends a CLS token, and applies standard transformer layers. It consistently outperforms TabNet and approaches GBDT performance on many benchmarks — while being a clean, generalizable architecture.

**TabPFN — few-shot tabular ML**: Pre-trained on millions of synthetic tabular datasets, TabPFN uses in-context learning to make predictions on new small datasets (up to ~1000 samples) with a single forward pass — no training required. On small datasets it frequently matches or beats XGBoost with minutes of tuning. This is a fundamentally different paradigm from standard ML.

== Applying ==
'''FT-Transformer on tabular benchmark:'''
<syntaxhighlight lang="python">
import torch
import torch.nn as nn
import numpy as np
from rtdl import FTTransformer # Real-world Tabular Deep Learning library

# FT-Transformer via rtdl library
# pip install rtdl

# Separate numerical and categorical features
n_num_features = 8 # Number of continuous features
cat_cardinalities = [5, 100, 10] # Cardinality of each categorical feature

model = FTTransformer.make_default(
n_num_features=n_num_features,
cat_cardinalities=cat_cardinalities,
d_out=1, # 1 for binary classification or regression; n_classes for multiclass
)

optimizer = model.make_default_optimizer() # AdamW with standard tabular LR schedule

# Training loop
def train_epoch(model, loader, optimizer, task='classification'):
model.train()
total_loss = 0
for X_num, X_cat, y in loader:
logits = model(X_num, X_cat).squeeze(1)
if task == 'classification':
loss = nn.BCEWithLogitsLoss()(logits, y.float())
else:
loss = nn.MSELoss()(logits, y.float())
optimizer.zero_grad(); loss.backward(); optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)

# Quick comparison: TabPFN for small datasets (no training!)
from tabpfn import TabPFNClassifier
from sklearn.metrics import roc_auc_score

clf = TabPFNClassifier(device='cpu', N_ensemble_configurations=32)
clf.fit(X_train_small, y_train_small) # Instant — no gradient descent
preds = clf.predict_proba(X_test_small)
print(f"TabPFN AUC: {roc_auc_score(y_test_small, preds[:,1]):.3f}")
# Often matches XGBoost on small datasets without any hyperparameter tuning!
</syntaxhighlight>

; Tabular DL framework selection guide
: '''Small data (<1K samples)''' → TabPFN (no training needed), XGBoost with Bayesian HPO
: '''Medium data (1K–100K)''' → XGBoost/LightGBM baseline; try FT-Transformer
: '''Large data (>100K)''' → FT-Transformer, SAINT; potentially beats GBDT
: '''High-cardinality categoricals''' → Entity embeddings + any DL model; CatBoost also strong
: '''Multi-modal (tabular + text)''' → TabTransformer/FT-Transformer + BERT fusion
: '''AutoML''' → AutoGluon-Tabular (tests multiple models); strong default

== Analyzing ==
{| class="wikitable"
|+ Tabular ML Method Comparison (2024)
! Method !! Small Data || Large Data || Categorical Handling || Training Speed || Interpretability
|-
| XGBoost/LightGBM || Excellent || Good || Good (ordinal encoding) || Fast || Medium (SHAP)
|-
| CatBoost || Excellent || Good || Excellent (native) || Fast || Medium (SHAP)
|-
| TabNet || Good || Good || Medium || Slow || High (attention masks)
|-
| FT-Transformer || Good || Excellent || Excellent (embeddings) || Slow || Low
|-
| TabPFN || Excellent (≤1K) || N/A || Good || Instant (inference) || Low
|-
| Random Forest || Good || Good || Poor || Medium || Medium
|}

'''Failure modes''': Overfitting to small tabular datasets with deep learning (more parameters than samples). Forgetting to normalize numerical features for neural networks. Missing value handling — NNs require explicit imputation; GBDTs handle natively. Hyperparameter sensitivity of tabular NNs (learning rate, weight decay require tuning). Scale mismatch between features causing slow convergence.

== Evaluating ==
Tabular deep learning evaluation: (1) **Benchmark against GBDT baseline**: always compare to tuned XGBoost or LightGBM — if DL doesn't beat it, use GBDT. (2) **Multiple random seeds**: tabular DL has higher variance; report mean ± std over 5+ seeds. (3) **Cross-validation**: strict k-fold with stratification for classification. (4) **Calibration**: are predicted probabilities well-calibrated? (5) **Computation budget**: account for DL training time vs. GBDT; ROI of DL must exceed the extra cost.

== Creating ==
Tabular ML production pipeline: (1) Baseline: train XGBoost/LightGBM with default hyperparameters; measure AUC/RMSE. (2) AutoML: run AutoGluon-Tabular for 1 hour; assess best model. (3) If DL worth pursuing: FT-Transformer with AdamW, learning rate warmup, cosine annealing. (4) Feature engineering: log-transform skewed numeric features; frequency encoding for very high-cardinality categoricals. (5) Ensembling: stack GBDT + FT-Transformer predictions; often beats either alone. (6) Deployment: export XGBoost as ONNX or LightGBM native; FT-Transformer as TorchScript.

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Tabular Data]]