AI Infrastructure and MLOps: Difference between revisions
New article: AI Infrastructure and MLOps structured through Bloom's Taxonomy |
BloomWiki: AI Infrastructure and MLOps |
||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
AI Infrastructure and MLOps (Machine Learning Operations) encompass the engineering systems, tools, practices, and culture required to develop, deploy, monitor, and maintain AI and machine learning systems at scale in production. Building a model that works in a Jupyter notebook is the easy part; getting it to serve millions of users reliably, at low latency, within a budget, with reproducible experiments and robust monitoring — that is MLOps. As AI becomes central to business operations, MLOps has emerged as a critical engineering discipline bridging data science and software engineering. | AI Infrastructure and MLOps (Machine Learning Operations) encompass the engineering systems, tools, practices, and culture required to develop, deploy, monitor, and maintain AI and machine learning systems at scale in production. Building a model that works in a Jupyter notebook is the easy part; getting it to serve millions of users reliably, at low latency, within a budget, with reproducible experiments and robust monitoring — that is MLOps. As AI becomes central to business operations, MLOps has emerged as a critical engineering discipline bridging data science and software engineering. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''MLOps''' — Machine Learning Operations; practices and tools for deploying and maintaining ML models in production reliably and efficiently. | * '''MLOps''' — Machine Learning Operations; practices and tools for deploying and maintaining ML models in production reliably and efficiently. | ||
* '''ML pipeline''' — An automated sequence of steps: data ingestion → preprocessing → training → evaluation → deployment → monitoring. | * '''ML pipeline''' — An automated sequence of steps: data ingestion → preprocessing → training → evaluation → deployment → monitoring. | ||
| Line 19: | Line 24: | ||
* '''Weights & Biases (W&B)''' — A popular experiment tracking and model management platform. | * '''Weights & Biases (W&B)''' — A popular experiment tracking and model management platform. | ||
* '''Kubeflow''' — An ML workflow orchestration platform built on Kubernetes. | * '''Kubeflow''' — An ML workflow orchestration platform built on Kubernetes. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
The '''"death valley of ML"''' is the gap between a model that works in a notebook and one that runs in production. MLOps bridges this gap by applying software engineering rigor to ML workflows. | The '''"death valley of ML"''' is the gap between a model that works in a notebook and one that runs in production. MLOps bridges this gap by applying software engineering rigor to ML workflows. | ||
| Line 36: | Line 43: | ||
The maturity of MLOps in an organization is a spectrum: from ad-hoc scripts to fully automated, continuously trained, monitored production systems. | The maturity of MLOps in an organization is a spectrum: from ad-hoc scripts to fully automated, continuously trained, monitored production systems. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''End-to-end ML pipeline with MLflow and FastAPI:''' | '''End-to-end ML pipeline with MLflow and FastAPI:''' | ||
| Line 96: | Line 105: | ||
: '''Monitoring''' → Evidently AI, WhyLabs, Arize, Fiddler | : '''Monitoring''' → Evidently AI, WhyLabs, Arize, Fiddler | ||
: '''GPU training infrastructure''' → SLURM + GPUs, AWS SageMaker, Azure ML, Google Vertex AI | : '''GPU training infrastructure''' → SLURM + GPUs, AWS SageMaker, Azure ML, Google Vertex AI | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ MLOps Maturity Levels | |+ MLOps Maturity Levels | ||
| Line 117: | Line 128: | ||
* '''Model versioning chaos''' — Multiple model versions in production with no tracking of which version serves which traffic. Model registry + blue/green deployment is the solution. | * '''Model versioning chaos''' — Multiple model versions in production with no tracking of which version serves which traffic. Model registry + blue/green deployment is the solution. | ||
* '''GPU waste''' — Training jobs that reserve GPUs but run at 10% utilization due to data loading bottlenecks. Profile GPU utilization; use multiple data loader workers, DALI for GPU-accelerated preprocessing. | * '''GPU waste''' — Training jobs that reserve GPUs but run at 10% utilization due to data loading bottlenecks. Profile GPU utilization; use multiple data loader workers, DALI for GPU-accelerated preprocessing. | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Evaluating</span> == | |||
Expert MLOps evaluation assesses the entire system, not just the model: | Expert MLOps evaluation assesses the entire system, not just the model: | ||
| Line 130: | Line 143: | ||
Expert practitioners define '''SLOs (Service Level Objectives)''' for ML systems just as they would for any software service: model accuracy > X% (monitored weekly), prediction latency p95 < Yms, feature pipeline freshness < Z minutes. | Expert practitioners define '''SLOs (Service Level Objectives)''' for ML systems just as they would for any software service: model accuracy > X% (monitored weekly), prediction latency p95 < Yms, feature pipeline freshness < Z minutes. | ||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Creating</span> == | |||
Designing a production ML infrastructure system: | Designing a production ML infrastructure system: | ||
| Line 192: | Line 207: | ||
[[Category:MLOps]] | [[Category:MLOps]] | ||
[[Category:AI Infrastructure]] | [[Category:AI Infrastructure]] | ||
</div> | |||
Latest revision as of 01:45, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
AI Infrastructure and MLOps (Machine Learning Operations) encompass the engineering systems, tools, practices, and culture required to develop, deploy, monitor, and maintain AI and machine learning systems at scale in production. Building a model that works in a Jupyter notebook is the easy part; getting it to serve millions of users reliably, at low latency, within a budget, with reproducible experiments and robust monitoring — that is MLOps. As AI becomes central to business operations, MLOps has emerged as a critical engineering discipline bridging data science and software engineering.
Remembering[edit]
- MLOps — Machine Learning Operations; practices and tools for deploying and maintaining ML models in production reliably and efficiently.
- ML pipeline — An automated sequence of steps: data ingestion → preprocessing → training → evaluation → deployment → monitoring.
- Feature store — A centralized repository for computing, storing, and serving ML features consistently across training and inference.
- Model registry — A central catalog tracking model versions, metadata, performance metrics, and deployment status.
- Experiment tracking — Recording hyperparameters, metrics, artifacts, and code for each training run to enable comparison and reproducibility.
- Model serving — The infrastructure for running trained models and serving predictions via API or batch processing.
- Inference server — A system optimized for serving model predictions at high throughput and low latency (Triton, TorchServe, vLLM).
- Continuous Training (CT) — Automatically retraining models on fresh data to prevent performance degradation.
- Data drift — A change in the statistical distribution of input data over time, degrading model performance.
- Concept drift — A change in the relationship between input features and the target variable, requiring model retraining.
- CI/CD for ML — Automated testing and deployment pipelines adapted for ML workflows.
- GPU cluster — A collection of GPUs used for distributed model training and inference.
- Kubernetes — A container orchestration platform used to deploy and scale ML services.
- Ray — A distributed computing framework for Python ML workloads.
- Weights & Biases (W&B) — A popular experiment tracking and model management platform.
- Kubeflow — An ML workflow orchestration platform built on Kubernetes.
Understanding[edit]
The "death valley of ML" is the gap between a model that works in a notebook and one that runs in production. MLOps bridges this gap by applying software engineering rigor to ML workflows.
The ML lifecycle has distinct phases, each requiring different infrastructure:
Data management: Raw data must be collected, validated, versioned, and transformed into features. Feature engineering is expensive and error-prone without systematic tooling. A feature store ensures that the same features computed at training time are served at inference time — preventing training-serving skew.
Experimentation: Data scientists run hundreds of experiments varying hyperparameters, architectures, and datasets. Without experiment tracking, it's impossible to reproduce results or understand what caused improvements.
Training infrastructure: Large models require distributed training across many GPUs. This involves data parallelism (split batches across GPUs), model parallelism (split model layers across GPUs), and pipeline parallelism (combine both). Frameworks like PyTorch FSDP, DeepSpeed, and Megatron-LM enable this.
Deployment: A model must be packaged, versioned, and deployed to production infrastructure. Serving requirements differ radically: a real-time API needs <100ms latency; a batch processing job can run for hours.
Monitoring: In production, models decay. Data drift, concept drift, and distribution shifts silently degrade performance. Without monitoring, you don't know your model is broken until users complain. MLOps monitoring tracks prediction distributions, feature drift, upstream data quality, and business outcome metrics.
The maturity of MLOps in an organization is a spectrum: from ad-hoc scripts to fully automated, continuously trained, monitored production systems.
Applying[edit]
End-to-end ML pipeline with MLflow and FastAPI:
<syntaxhighlight lang="python"> import mlflow import mlflow.sklearn from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, roc_auc_score from sklearn.model_selection import train_test_split import pandas as pd
- === EXPERIMENT TRACKING ===
mlflow.set_experiment("customer-churn-prediction")
with mlflow.start_run(run_name="GBM-v3"):
# Log parameters
params = {"n_estimators": 200, "max_depth": 5, "learning_rate": 0.1}
mlflow.log_params(params)
# Train model X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = GradientBoostingClassifier(**params) model.fit(X_train, y_train)
# Log metrics
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("roc_auc", roc_auc_score(y_test, y_prob))
# Log model artifact
mlflow.sklearn.log_model(model, "model",
registered_model_name="churn-predictor")
print(f"Run ID: {mlflow.active_run().info.run_id}")
- === MODEL SERVING with FastAPI ===
from fastapi import FastAPI import mlflow.pyfunc
app = FastAPI() model = mlflow.pyfunc.load_model("models:/churn-predictor/Production")
@app.post("/predict") async def predict(features: dict):
df = pd.DataFrame([features])
prediction = model.predict(df)
return {"churn_probability": float(prediction[0]),
"will_churn": bool(prediction[0] > 0.5)}
</syntaxhighlight>
- MLOps tool ecosystem
- Experiment tracking → MLflow, Weights & Biases, Neptune, CometML
- Pipeline orchestration → Apache Airflow, Prefect, Kubeflow Pipelines, ZenML
- Feature store → Feast, Tecton, Hopsworks, Databricks Feature Store
- Model registry → MLflow Model Registry, Hugging Face Hub, W&B Model Registry
- Model serving (traditional) → BentoML, Seldon, TorchServe, MLflow Serving
- LLM serving → vLLM, Text Generation Inference (TGI), Triton Inference Server
- Monitoring → Evidently AI, WhyLabs, Arize, Fiddler
- GPU training infrastructure → SLURM + GPUs, AWS SageMaker, Azure ML, Google Vertex AI
Analyzing[edit]
| Level | Description | Automation | Typical Organization |
|---|---|---|---|
| Level 0 | Manual process: scripts, Jupyter notebooks | None | Startups, research teams |
| Level 1 | ML pipelines automated; manual deployment | Data pipeline CT | Small ML teams |
| Level 2 | Full CI/CD for ML; automated retraining | Full CT + CD | Mature ML teams |
| Level 3 | Continuous monitoring, auto-retraining, auto-deployment | Fully automated | Large-scale ML platforms |
Common failures and anti-patterns:
- Training-serving skew — Features computed differently at training time vs. inference time. Example: training used the full historical average; inference uses a 30-day rolling average. Results in silent model degradation. A feature store solves this.
- No monitoring — Model is deployed and forgotten. Performance degrades silently as data distribution shifts. Always implement prediction monitoring on day one.
- Irreproducible experiments — Data and code versions not tracked; can't reproduce the best model from 3 months ago. Use data versioning (DVC, Delta Lake) and code pinning from the start.
- Model versioning chaos — Multiple model versions in production with no tracking of which version serves which traffic. Model registry + blue/green deployment is the solution.
- GPU waste — Training jobs that reserve GPUs but run at 10% utilization due to data loading bottlenecks. Profile GPU utilization; use multiple data loader workers, DALI for GPU-accelerated preprocessing.
Evaluating[edit]
Expert MLOps evaluation assesses the entire system, not just the model:
Deployment reliability: What is the rollback time if a new model causes a regression? Blue/green and canary deployments allow traffic shifting with instant rollback.
Latency SLA compliance: Track p50/p95/p99 inference latency. Alert if p95 exceeds target. Profile inference to identify bottlenecks (tokenization, model forward pass, post-processing).
Data pipeline health: Feature pipeline reliability (what % of features are computed on time?), data freshness lag, schema validation failure rates. Upstream data quality directly impacts model quality.
Model decay tracking: Plot key prediction metrics (PSI for probability distribution shift, prediction mean, business KPIs) over time. Statistical tests (Kolmogorov-Smirnov, Population Stability Index) detect distribution drift before it degrades business metrics.
Expert practitioners define SLOs (Service Level Objectives) for ML systems just as they would for any software service: model accuracy > X% (monitored weekly), prediction latency p95 < Yms, feature pipeline freshness < Z minutes.
Creating[edit]
Designing a production ML infrastructure system:
1. Data infrastructure layer <syntaxhighlight lang="text"> Raw data sources (databases, streaming, APIs)
↓
[Data lake: S3, GCS, ADLS — raw storage]
↓
[Data processing: Spark / dbt / Flink — transform to features]
↓
[Feature store: online (Redis/DynamoDB) + offline (Parquet/Delta Lake)]
↓
[Data versioning: DVC, Delta Lake time travel]
↓
[Data quality: Great Expectations, Deequ — validate before training] </syntaxhighlight>
2. Training infrastructure <syntaxhighlight lang="text"> Experiment definition (config file: model, hyperparameters, dataset)
↓
[Orchestrator: Kubeflow / Airflow — schedule training job]
↓
[Distributed training: GPU cluster with FSDP/DeepSpeed]
↓
[Experiment tracking: MLflow / W&B — log metrics, artifacts]
↓
[Model evaluation: automated test suite + holdout evaluation]
↓
[Model registry: promote to Staging if metrics pass thresholds] </syntaxhighlight>
3. Serving infrastructure <syntaxhighlight lang="text"> Model from registry
↓
[Container image build: Docker + model artifact]
↓
[Canary deployment: 5% traffic → new model]
↓
[A/B test: monitor business KPIs + latency]
↓
[Promote to 100% if no regression; rollback if regression detected]
↓
[Inference API: FastAPI / Triton / vLLM behind load balancer]
↓
[Auto-scaling: scale replicas on GPU utilization / queue depth] </syntaxhighlight>
4. Monitoring and retraining loop
- Real-time prediction logging with sampling (100% is too expensive)
- Statistical drift tests run daily on sampled prediction distributions
- Alert on: latency SLO breach, drift detected, business KPI degradation
- Automated retraining triggered by drift alerts or scheduled (weekly)
- Human approval gate before promoting retrained model to production