Finetuning Llms

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Fine-tuning is the process of taking a large language model (LLM) that has already been pre-trained on a vast corpus and continuing its training on a smaller, task-specific dataset to specialize its capabilities. It is one of the most powerful techniques in practical AI deployment, enabling organizations to adapt frontier models to domain-specific language, formats, reasoning styles, or behaviors — often with only thousands of examples. Fine-tuning sits at the intersection of deep learning theory and production engineering.

Remembering

Pre-training — The initial phase where a model is trained on massive, general-purpose datasets to develop broad language capabilities. This is done once and is extremely expensive.
Fine-tuning — Continuing training of a pre-trained model on a smaller dataset to specialize behavior. The model's weights are adjusted, typically starting from the pre-trained state.
Supervised Fine-Tuning (SFT) — Fine-tuning on labeled input-output pairs, teaching the model to follow instructions or produce specific response formats.
Instruction tuning — A form of SFT where the model is trained on instruction-following examples to make it more helpful and controllable.
RLHF (Reinforcement Learning from Human Feedback) — A multi-stage process: SFT, then reward model training, then RL optimization — used to align model outputs with human preferences.
LoRA (Low-Rank Adaptation) — A parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen base model weights, drastically reducing compute and memory requirements.
QLoRA — LoRA applied to a quantized base model (typically 4-bit), enabling fine-tuning of large models on consumer GPUs.
PEFT (Parameter-Efficient Fine-Tuning) — An umbrella term for methods like LoRA, Prefix Tuning, and Adapter layers that update only a small fraction of model parameters.
Catastrophic forgetting — The tendency of a model to lose previously learned capabilities when trained extensively on new data.
Learning rate — Typically much lower during fine-tuning than pre-training (e.g., 1e-5 to 2e-4) to avoid destroying pre-trained representations.
Chat template — A structured format for instruction-tuned models defining how system prompts, user turns, and assistant turns are delimited.
Prompt template — The format used to structure training examples, which must match the format used at inference time.
Validation loss — The key metric monitored during fine-tuning to detect overfitting and determine when to stop.

Understanding

Fine-tuning works because pre-trained LLMs have already learned rich representations of language, facts, and reasoning patterns. Fine-tuning doesn't teach the model new knowledge so much as it reconfigures how the model accesses and expresses what it already knows.

Analogy: A pre-trained LLM is like a broadly educated graduate. Fine-tuning is like a specialized internship — they don't forget everything they learned in university; they learn how to apply their knowledge in a specific context, following specific conventions and communicating in specific ways.

Full fine-tuning updates all model parameters. It is most powerful but requires enormous compute (multiple GPUs, hours to days) and is prone to catastrophic forgetting of general capabilities.

LoRA (Low-Rank Adaptation) is the dominant technique in practice. It freezes the original weights and adds small trainable matrices A and B to each attention layer such that the effective weight update is W + ΔW = W + AB, where A is d×r and B is r×d, with rank r ≪ d. With r=16, a 7B model might add only ~20M trainable parameters (0.3% of total). This dramatically reduces compute, memory, and overfitting risk.

The data format matters enormously. Fine-tuning teaches the model a specific input-output pattern. If training examples don't precisely match the inference format (including chat templates, special tokens, and prompt structures), the model will underperform.

Applying

LoRA fine-tuning with HuggingFace + PEFT:

<syntaxhighlight lang="python"> from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from peft import LoraConfig, get_peft_model, TaskType from trl import SFTTrainer import datasets

Load base model (quantized for efficiency)

model = AutoModelForCausalLM.from_pretrained(

   "meta-llama/Llama-2-7b-hf",
   load_in_4bit=True,      # QLoRA: quantize to 4-bit
   device_map="auto"

)

LoRA configuration

lora_config = LoraConfig(

   r=16,                           # Rank
   lora_alpha=32,                  # Scaling factor
   target_modules=["q_proj", "v_proj"],  # Which layers to adapt
   lora_dropout=0.05,
   bias="none",
   task_type=TaskType.CAUSAL_LM

)

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

Training setup

training_args = TrainingArguments(

   output_dir="./finetuned_model",
   num_train_epochs=3,
   per_device_train_batch_size=4,
   gradient_accumulation_steps=4,
   learning_rate=2e-4,
   fp16=True,
   save_steps=100,
   logging_steps=25,

)

Dataset: each sample has "text" field with full formatted prompt+response

dataset = datasets.load_dataset("json", data_files="train.jsonl")["train"]

trainer = SFTTrainer(

   model=model,
   args=training_args,
   train_dataset=dataset,
   dataset_text_field="text",
   max_seq_length=2048,

) trainer.train() </syntaxhighlight>

Data format for instruction tuning (Llama chat template): System → Defines the model's role and constraints; User turn → The instruction or question; Assistant turn → The desired response (what the model learns to produce); Special tokens → [INST], [/INST], <<SYS>> etc. must exactly match the model's chat template

Analyzing

Fine-tuning Method Comparison
Method	Params Updated	GPU Memory	Risk of Forgetting	Quality
Full fine-tuning	100%	Very high (multiple GPUs)	High	Highest
LoRA	0.1–1%	Low (1 GPU possible)	Low	Near-full for most tasks
QLoRA	0.1–1% (on 4-bit model)	Very low (fits on 24GB GPU)	Low	Slightly below LoRA
Prefix tuning	~0.1%	Low	Very low	Moderate
Prompt tuning	~0.01%	Very low	Very low	Lower than LoRA

Failure modes:

Overfitting on small datasets — With <500 examples, the model can memorize rather than generalize. Monitor validation loss; stop early.
Format mismatch — Training on incorrectly formatted examples causes the model to generate malformed outputs or include spurious tokens.
Instruction following collapse — Aggressive fine-tuning can make the model rigid, losing the flexibility to handle instructions it wasn't trained on.
Reward hacking (RLHF) — The model learns to produce responses that score well according to the reward model without actually being more helpful — for example, becoming verbose without substance.
Capability regression — Fine-tuning on a narrow task can degrade performance on other tasks. Evaluate on a broad benchmark before and after.

Evaluating

Expert practitioners treat fine-tuning evaluation as multi-dimensional:

Task-specific metrics: Whatever the downstream task demands — ROUGE for summarization, exact match for QA, pass@k for code generation, human preference rates for chat.

General capability retention: Run the fine-tuned model on standard benchmarks (MMLU, HellaSwag, HumanEval) to verify general capabilities weren't degraded. A model fine-tuned for customer service shouldn't lose its ability to reason.

Alignment and safety evaluation: Does fine-tuning introduce new failure modes? Run adversarial prompts, jailbreak attempts, and harmful content evaluations on the fine-tuned model.

Human preference evaluation (A/B testing): For conversational models, human raters compare base model vs. fine-tuned model outputs on real user queries. This is the ground truth for whether fine-tuning achieved its goal.

Expert practitioners maintain a regression test suite — a fixed set of prompts with expected behaviors — and run it after every fine-tuning run to catch regressions automatically.

Creating

Designing a full fine-tuning pipeline:

1. Dataset curation (most important step) <syntaxhighlight lang="text"> Source data collection (domain documents, logs, demonstrations)

↓

Quality filtering (deduplication, length filtering, toxic content removal)

↓

Formatting (convert to chat template, add system prompt)

↓

Review sample (manually inspect 100+ examples)

↓

Train/validation split (90/10 or 95/5) </syntaxhighlight>

2. Training configuration decision tree

<1k examples and 1 GPU → QLoRA with early stopping
1k–100k examples and 2–8 GPUs → LoRA with gradient checkpointing
>100k examples and production budget → Full fine-tune with DDP/FSDP

3. Iterative refinement loop <syntaxhighlight lang="text"> v1: SFT on demonstrations

   ↓ evaluate → identify failure cases

v2: Add failure case examples to dataset, retrain

   ↓ evaluate → identify preference gaps

v3: Collect human preference data → train reward model → PPO/DPO fine-tune </syntaxhighlight>

4. Serving the fine-tuned model

Merge LoRA adapters into base model: model.mergeandunload()
Export to GGUF format for llama.cpp (local/edge deployment)
Push to HuggingFace Hub or deploy with vLLM for API serving

Finetuning Llms

Contents

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Finetuning Llms

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Search