Finetuning Llms

From BloomWiki
Revision as of 14:34, 23 April 2026 by Wordpad (talk | contribs) (BloomWiki: Finetuning Llms)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Fine-tuning is the process of taking a large language model (LLM) that has already been pre-trained on a vast corpus and continuing its training on a smaller, task-specific dataset to specialize its capabilities. It is one of the most powerful techniques in practical AI deployment, enabling organizations to adapt frontier models to domain-specific language, formats, reasoning styles, or behaviors — often with only thousands of examples. Fine-tuning sits at the intersection of deep learning theory and production engineering.

Remembering

  • Pre-training — The initial phase where a model is trained on massive, general-purpose datasets to develop broad language capabilities. This is done once and is extremely expensive.
  • Fine-tuning — Continuing training of a pre-trained model on a smaller dataset to specialize behavior. The model's weights are adjusted, typically starting from the pre-trained state.
  • Supervised Fine-Tuning (SFT) — Fine-tuning on labeled input-output pairs, teaching the model to follow instructions or produce specific response formats.
  • Instruction tuning — A form of SFT where the model is trained on instruction-following examples to make it more helpful and controllable.
  • RLHF (Reinforcement Learning from Human Feedback) — A multi-stage process: SFT, then reward model training, then RL optimization — used to align model outputs with human preferences.
  • LoRA (Low-Rank Adaptation) — A parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen base model weights, drastically reducing compute and memory requirements.
  • QLoRA — LoRA applied to a quantized base model (typically 4-bit), enabling fine-tuning of large models on consumer GPUs.
  • PEFT (Parameter-Efficient Fine-Tuning) — An umbrella term for methods like LoRA, Prefix Tuning, and Adapter layers that update only a small fraction of model parameters.
  • Catastrophic forgetting — The tendency of a model to lose previously learned capabilities when trained extensively on new data.
  • Learning rate — Typically much lower during fine-tuning than pre-training (e.g., 1e-5 to 2e-4) to avoid destroying pre-trained representations.
  • Chat template — A structured format for instruction-tuned models defining how system prompts, user turns, and assistant turns are delimited.
  • Prompt template — The format used to structure training examples, which must match the format used at inference time.
  • Validation loss — The key metric monitored during fine-tuning to detect overfitting and determine when to stop.

Understanding

Fine-tuning works because pre-trained LLMs have already learned rich representations of language, facts, and reasoning patterns. Fine-tuning doesn't teach the model new knowledge so much as it reconfigures how the model accesses and expresses what it already knows.

Analogy: A pre-trained LLM is like a broadly educated graduate. Fine-tuning is like a specialized internship — they don't forget everything they learned in university; they learn how to apply their knowledge in a specific context, following specific conventions and communicating in specific ways.

Full fine-tuning updates all model parameters. It is most powerful but requires enormous compute (multiple GPUs, hours to days) and is prone to catastrophic forgetting of general capabilities.

LoRA (Low-Rank Adaptation) is the dominant technique in practice. It freezes the original weights and adds small trainable matrices A and B to each attention layer such that the effective weight update is W + ΔW = W + AB, where A is d×r and B is r×d, with rank r ≪ d. With r=16, a 7B model might add only ~20M trainable parameters (0.3% of total). This dramatically reduces compute, memory, and overfitting risk.

The data format matters enormously. Fine-tuning teaches the model a specific input-output pattern. If training examples don't precisely match the inference format (including chat templates, special tokens, and prompt structures), the model will underperform.

Applying

LoRA fine-tuning with HuggingFace + PEFT:

<syntaxhighlight lang="python"> from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from peft import LoraConfig, get_peft_model, TaskType from trl import SFTTrainer import datasets

  1. Load base model (quantized for efficiency)

model = AutoModelForCausalLM.from_pretrained(

   "meta-llama/Llama-2-7b-hf",
   load_in_4bit=True,      # QLoRA: quantize to 4-bit
   device_map="auto"

)

  1. LoRA configuration

lora_config = LoraConfig(

   r=16,                           # Rank
   lora_alpha=32,                  # Scaling factor
   target_modules=["q_proj", "v_proj"],  # Which layers to adapt
   lora_dropout=0.05,
   bias="none",
   task_type=TaskType.CAUSAL_LM

)

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

  1. trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
  1. Training setup

training_args = TrainingArguments(

   output_dir="./finetuned_model",
   num_train_epochs=3,
   per_device_train_batch_size=4,
   gradient_accumulation_steps=4,
   learning_rate=2e-4,
   fp16=True,
   save_steps=100,
   logging_steps=25,

)

  1. Dataset: each sample has "text" field with full formatted prompt+response

dataset = datasets.load_dataset("json", data_files="train.jsonl")["train"]

trainer = SFTTrainer(

   model=model,
   args=training_args,
   train_dataset=dataset,
   dataset_text_field="text",
   max_seq_length=2048,

) trainer.train() </syntaxhighlight>

Data format for instruction tuning (Llama chat template)
System → Defines the model's role and constraints
User turn → The instruction or question
Assistant turn → The desired response (what the model learns to produce)
Special tokens → [INST], [/INST], <<SYS>> etc. must exactly match the model's chat template

Analyzing

Fine-tuning Method Comparison
Method Params Updated GPU Memory Risk of Forgetting Quality
Full fine-tuning 100% Very high (multiple GPUs) High Highest
LoRA 0.1–1% Low (1 GPU possible) Low Near-full for most tasks
QLoRA 0.1–1% (on 4-bit model) Very low (fits on 24GB GPU) Low Slightly below LoRA
Prefix tuning ~0.1% Low Very low Moderate
Prompt tuning ~0.01% Very low Very low Lower than LoRA

Failure modes:

  • Overfitting on small datasets — With <500 examples, the model can memorize rather than generalize. Monitor validation loss; stop early.
  • Format mismatch — Training on incorrectly formatted examples causes the model to generate malformed outputs or include spurious tokens.
  • Instruction following collapse — Aggressive fine-tuning can make the model rigid, losing the flexibility to handle instructions it wasn't trained on.
  • Reward hacking (RLHF) — The model learns to produce responses that score well according to the reward model without actually being more helpful — for example, becoming verbose without substance.
  • Capability regression — Fine-tuning on a narrow task can degrade performance on other tasks. Evaluate on a broad benchmark before and after.

Evaluating

Expert practitioners treat fine-tuning evaluation as multi-dimensional:

Task-specific metrics: Whatever the downstream task demands — ROUGE for summarization, exact match for QA, pass@k for code generation, human preference rates for chat.

General capability retention: Run the fine-tuned model on standard benchmarks (MMLU, HellaSwag, HumanEval) to verify general capabilities weren't degraded. A model fine-tuned for customer service shouldn't lose its ability to reason.

Alignment and safety evaluation: Does fine-tuning introduce new failure modes? Run adversarial prompts, jailbreak attempts, and harmful content evaluations on the fine-tuned model.

Human preference evaluation (A/B testing): For conversational models, human raters compare base model vs. fine-tuned model outputs on real user queries. This is the ground truth for whether fine-tuning achieved its goal.

Expert practitioners maintain a regression test suite — a fixed set of prompts with expected behaviors — and run it after every fine-tuning run to catch regressions automatically.

Creating

Designing a full fine-tuning pipeline:

1. Dataset curation (most important step) <syntaxhighlight lang="text"> Source data collection (domain documents, logs, demonstrations)

Quality filtering (deduplication, length filtering, toxic content removal)

Formatting (convert to chat template, add system prompt)

Review sample (manually inspect 100+ examples)

Train/validation split (90/10 or 95/5) </syntaxhighlight>

2. Training configuration decision tree

  • <1k examples and 1 GPU → QLoRA with early stopping
  • 1k–100k examples and 2–8 GPUs → LoRA with gradient checkpointing
  • >100k examples and production budget → Full fine-tune with DDP/FSDP

3. Iterative refinement loop <syntaxhighlight lang="text"> v1: SFT on demonstrations

   ↓ evaluate → identify failure cases

v2: Add failure case examples to dataset, retrain

   ↓ evaluate → identify preference gaps

v3: Collect human preference data → train reward model → PPO/DPO fine-tune </syntaxhighlight>

4. Serving the fine-tuned model

  • Merge LoRA adapters into base model: model.mergeandunload()
  • Export to GGUF format for llama.cpp (local/edge deployment)
  • Push to HuggingFace Hub or deploy with vLLM for API serving