Finetuning Llms: Difference between revisions
BloomWiki: Finetuning Llms |
BloomWiki: Finetuning Llms |
||
| (One intermediate revision by the same user not shown) | |||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Fine-tuning is the process of taking a large language model (LLM) that has already been pre-trained on a vast corpus and continuing its training on a smaller, task-specific dataset to specialize its capabilities. It is one of the most powerful techniques in practical AI deployment, enabling organizations to adapt frontier models to domain-specific language, formats, reasoning styles, or behaviors — often with only thousands of examples. Fine-tuning sits at the intersection of deep learning theory and production engineering. | Fine-tuning is the process of taking a large language model (LLM) that has already been pre-trained on a vast corpus and continuing its training on a smaller, task-specific dataset to specialize its capabilities. It is one of the most powerful techniques in practical AI deployment, enabling organizations to adapt frontier models to domain-specific language, formats, reasoning styles, or behaviors — often with only thousands of examples. Fine-tuning sits at the intersection of deep learning theory and production engineering. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Pre-training''' — The initial phase where a model is trained on massive, general-purpose datasets to develop broad language capabilities. This is done once and is extremely expensive. | * '''Pre-training''' — The initial phase where a model is trained on massive, general-purpose datasets to develop broad language capabilities. This is done once and is extremely expensive. | ||
* '''Fine-tuning''' — Continuing training of a pre-trained model on a smaller dataset to specialize behavior. The model's weights are adjusted, typically starting from the pre-trained state. | * '''Fine-tuning''' — Continuing training of a pre-trained model on a smaller dataset to specialize behavior. The model's weights are adjusted, typically starting from the pre-trained state. | ||
| Line 16: | Line 21: | ||
* '''Prompt template''' — The format used to structure training examples, which must match the format used at inference time. | * '''Prompt template''' — The format used to structure training examples, which must match the format used at inference time. | ||
* '''Validation loss''' — The key metric monitored during fine-tuning to detect overfitting and determine when to stop. | * '''Validation loss''' — The key metric monitored during fine-tuning to detect overfitting and determine when to stop. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
Fine-tuning works because pre-trained LLMs have already learned rich representations of language, facts, and reasoning patterns. Fine-tuning doesn't teach the model new knowledge so much as it '''reconfigures how the model accesses and expresses what it already knows'''. | Fine-tuning works because pre-trained LLMs have already learned rich representations of language, facts, and reasoning patterns. Fine-tuning doesn't teach the model new knowledge so much as it '''reconfigures how the model accesses and expresses what it already knows'''. | ||
| Line 27: | Line 34: | ||
The '''data format''' matters enormously. Fine-tuning teaches the model a specific input-output pattern. If training examples don't precisely match the inference format (including chat templates, special tokens, and prompt structures), the model will underperform. | The '''data format''' matters enormously. Fine-tuning teaches the model a specific input-output pattern. If training examples don't precisely match the inference format (including chat templates, special tokens, and prompt structures), the model will underperform. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''LoRA fine-tuning with HuggingFace + PEFT:''' | '''LoRA fine-tuning with HuggingFace + PEFT:''' | ||
| Line 88: | Line 97: | ||
: '''Assistant turn''' → The desired response (what the model learns to produce) | : '''Assistant turn''' → The desired response (what the model learns to produce) | ||
: '''Special tokens''' → [INST], [/INST], <<SYS>> etc. must exactly match the model's chat template | : '''Special tokens''' → [INST], [/INST], <<SYS>> etc. must exactly match the model's chat template | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ Fine-tuning Method Comparison | |+ Fine-tuning Method Comparison | ||
| Line 111: | Line 122: | ||
* '''Reward hacking (RLHF)''' — The model learns to produce responses that score well according to the reward model without actually being more helpful — for example, becoming verbose without substance. | * '''Reward hacking (RLHF)''' — The model learns to produce responses that score well according to the reward model without actually being more helpful — for example, becoming verbose without substance. | ||
* '''Capability regression''' — Fine-tuning on a narrow task can degrade performance on other tasks. Evaluate on a broad benchmark before and after. | * '''Capability regression''' — Fine-tuning on a narrow task can degrade performance on other tasks. Evaluate on a broad benchmark before and after. | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Evaluating</span> == | |||
Expert practitioners treat fine-tuning evaluation as multi-dimensional: | Expert practitioners treat fine-tuning evaluation as multi-dimensional: | ||
| Line 124: | Line 137: | ||
Expert practitioners maintain a '''regression test suite''' — a fixed set of prompts with expected behaviors — and run it after every fine-tuning run to catch regressions automatically. | Expert practitioners maintain a '''regression test suite''' — a fixed set of prompts with expected behaviors — and run it after every fine-tuning run to catch regressions automatically. | ||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Creating</span> == | |||
Designing a full fine-tuning pipeline: | Designing a full fine-tuning pipeline: | ||
| Line 156: | Line 171: | ||
'''4. Serving the fine-tuned model''' | '''4. Serving the fine-tuned model''' | ||
* Merge LoRA adapters into base model: | * Merge LoRA adapters into base model: <code>model.merge''and''unload()</code> | ||
* Export to GGUF format for llama.cpp (local/edge deployment) | * Export to GGUF format for llama.cpp (local/edge deployment) | ||
* Push to HuggingFace Hub or deploy with vLLM for API serving | * Push to HuggingFace Hub or deploy with vLLM for API serving | ||
| Line 163: | Line 178: | ||
[[Category:Large Language Models]] | [[Category:Large Language Models]] | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
</div> | |||
Latest revision as of 01:51, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Fine-tuning is the process of taking a large language model (LLM) that has already been pre-trained on a vast corpus and continuing its training on a smaller, task-specific dataset to specialize its capabilities. It is one of the most powerful techniques in practical AI deployment, enabling organizations to adapt frontier models to domain-specific language, formats, reasoning styles, or behaviors — often with only thousands of examples. Fine-tuning sits at the intersection of deep learning theory and production engineering.
Remembering[edit]
- Pre-training — The initial phase where a model is trained on massive, general-purpose datasets to develop broad language capabilities. This is done once and is extremely expensive.
- Fine-tuning — Continuing training of a pre-trained model on a smaller dataset to specialize behavior. The model's weights are adjusted, typically starting from the pre-trained state.
- Supervised Fine-Tuning (SFT) — Fine-tuning on labeled input-output pairs, teaching the model to follow instructions or produce specific response formats.
- Instruction tuning — A form of SFT where the model is trained on instruction-following examples to make it more helpful and controllable.
- RLHF (Reinforcement Learning from Human Feedback) — A multi-stage process: SFT, then reward model training, then RL optimization — used to align model outputs with human preferences.
- LoRA (Low-Rank Adaptation) — A parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen base model weights, drastically reducing compute and memory requirements.
- QLoRA — LoRA applied to a quantized base model (typically 4-bit), enabling fine-tuning of large models on consumer GPUs.
- PEFT (Parameter-Efficient Fine-Tuning) — An umbrella term for methods like LoRA, Prefix Tuning, and Adapter layers that update only a small fraction of model parameters.
- Catastrophic forgetting — The tendency of a model to lose previously learned capabilities when trained extensively on new data.
- Learning rate — Typically much lower during fine-tuning than pre-training (e.g., 1e-5 to 2e-4) to avoid destroying pre-trained representations.
- Chat template — A structured format for instruction-tuned models defining how system prompts, user turns, and assistant turns are delimited.
- Prompt template — The format used to structure training examples, which must match the format used at inference time.
- Validation loss — The key metric monitored during fine-tuning to detect overfitting and determine when to stop.
Understanding[edit]
Fine-tuning works because pre-trained LLMs have already learned rich representations of language, facts, and reasoning patterns. Fine-tuning doesn't teach the model new knowledge so much as it reconfigures how the model accesses and expresses what it already knows.
Analogy: A pre-trained LLM is like a broadly educated graduate. Fine-tuning is like a specialized internship — they don't forget everything they learned in university; they learn how to apply their knowledge in a specific context, following specific conventions and communicating in specific ways.
Full fine-tuning updates all model parameters. It is most powerful but requires enormous compute (multiple GPUs, hours to days) and is prone to catastrophic forgetting of general capabilities.
LoRA (Low-Rank Adaptation) is the dominant technique in practice. It freezes the original weights and adds small trainable matrices A and B to each attention layer such that the effective weight update is W + ΔW = W + AB, where A is d×r and B is r×d, with rank r ≪ d. With r=16, a 7B model might add only ~20M trainable parameters (0.3% of total). This dramatically reduces compute, memory, and overfitting risk.
The data format matters enormously. Fine-tuning teaches the model a specific input-output pattern. If training examples don't precisely match the inference format (including chat templates, special tokens, and prompt structures), the model will underperform.
Applying[edit]
LoRA fine-tuning with HuggingFace + PEFT:
<syntaxhighlight lang="python"> from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from peft import LoraConfig, get_peft_model, TaskType from trl import SFTTrainer import datasets
- Load base model (quantized for efficiency)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf", load_in_4bit=True, # QLoRA: quantize to 4-bit device_map="auto"
)
- LoRA configuration
lora_config = LoraConfig(
r=16, # Rank lora_alpha=32, # Scaling factor target_modules=["q_proj", "v_proj"], # Which layers to adapt lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config) model.print_trainable_parameters()
- trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
- Training setup
training_args = TrainingArguments(
output_dir="./finetuned_model", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, save_steps=100, logging_steps=25,
)
- Dataset: each sample has "text" field with full formatted prompt+response
dataset = datasets.load_dataset("json", data_files="train.jsonl")["train"]
trainer = SFTTrainer(
model=model, args=training_args, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048,
) trainer.train() </syntaxhighlight>
- Data format for instruction tuning (Llama chat template)
- System → Defines the model's role and constraints
- User turn → The instruction or question
- Assistant turn → The desired response (what the model learns to produce)
- Special tokens → [INST], [/INST], <<SYS>> etc. must exactly match the model's chat template
Analyzing[edit]
| Method | Params Updated | GPU Memory | Risk of Forgetting | Quality |
|---|---|---|---|---|
| Full fine-tuning | 100% | Very high (multiple GPUs) | High | Highest |
| LoRA | 0.1–1% | Low (1 GPU possible) | Low | Near-full for most tasks |
| QLoRA | 0.1–1% (on 4-bit model) | Very low (fits on 24GB GPU) | Low | Slightly below LoRA |
| Prefix tuning | ~0.1% | Low | Very low | Moderate |
| Prompt tuning | ~0.01% | Very low | Very low | Lower than LoRA |
Failure modes:
- Overfitting on small datasets — With <500 examples, the model can memorize rather than generalize. Monitor validation loss; stop early.
- Format mismatch — Training on incorrectly formatted examples causes the model to generate malformed outputs or include spurious tokens.
- Instruction following collapse — Aggressive fine-tuning can make the model rigid, losing the flexibility to handle instructions it wasn't trained on.
- Reward hacking (RLHF) — The model learns to produce responses that score well according to the reward model without actually being more helpful — for example, becoming verbose without substance.
- Capability regression — Fine-tuning on a narrow task can degrade performance on other tasks. Evaluate on a broad benchmark before and after.
Evaluating[edit]
Expert practitioners treat fine-tuning evaluation as multi-dimensional:
Task-specific metrics: Whatever the downstream task demands — ROUGE for summarization, exact match for QA, pass@k for code generation, human preference rates for chat.
General capability retention: Run the fine-tuned model on standard benchmarks (MMLU, HellaSwag, HumanEval) to verify general capabilities weren't degraded. A model fine-tuned for customer service shouldn't lose its ability to reason.
Alignment and safety evaluation: Does fine-tuning introduce new failure modes? Run adversarial prompts, jailbreak attempts, and harmful content evaluations on the fine-tuned model.
Human preference evaluation (A/B testing): For conversational models, human raters compare base model vs. fine-tuned model outputs on real user queries. This is the ground truth for whether fine-tuning achieved its goal.
Expert practitioners maintain a regression test suite — a fixed set of prompts with expected behaviors — and run it after every fine-tuning run to catch regressions automatically.
Creating[edit]
Designing a full fine-tuning pipeline:
1. Dataset curation (most important step) <syntaxhighlight lang="text"> Source data collection (domain documents, logs, demonstrations)
↓
Quality filtering (deduplication, length filtering, toxic content removal)
↓
Formatting (convert to chat template, add system prompt)
↓
Review sample (manually inspect 100+ examples)
↓
Train/validation split (90/10 or 95/5) </syntaxhighlight>
2. Training configuration decision tree
- <1k examples and 1 GPU → QLoRA with early stopping
- 1k–100k examples and 2–8 GPUs → LoRA with gradient checkpointing
- >100k examples and production budget → Full fine-tune with DDP/FSDP
3. Iterative refinement loop <syntaxhighlight lang="text"> v1: SFT on demonstrations
↓ evaluate → identify failure cases
v2: Add failure case examples to dataset, retrain
↓ evaluate → identify preference gaps
v3: Collect human preference data → train reward model → PPO/DPO fine-tune </syntaxhighlight>
4. Serving the fine-tuned model
- Merge LoRA adapters into base model:
model.mergeandunload() - Export to GGUF format for llama.cpp (local/edge deployment)
- Push to HuggingFace Hub or deploy with vLLM for API serving