Editing AI Evaluation and Benchmarking (section)

== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Evaluation Method Comparison
! Method !! Scale !! Objectivity !! Ground Truth !! Cost
|-
| Automated metrics (accuracy, F1) || Unlimited || High || Required || Very low
|-
| Reference-based generation (BLEU, ROUGE) || Unlimited || Medium || Required || Very low
|-
| LLM-as-judge || Large || Medium (model bias) || Not required || Medium
|-
| Human evaluation || Small-medium || High || Not required || High
|-
| Red teaming || Small || Expert judgment || Not required || Very high
|}

'''Key evaluation pitfalls:'''
* '''Test set contamination''' — If training data included examples similar to test benchmarks, scores are inflated. Decontamination analyses are now standard practice.
* '''Prompt sensitivity''' — Different prompt formulations of the same question can change model scores by 5-20%. Robust evaluation uses multiple prompt templates.
* '''Task framing bias''' — Benchmarks using multiple choice may advantage models good at process of elimination, not the underlying capability.
* '''LLM-judge bias''' — LLM judges favor longer, more confident-sounding answers and show strong position bias (preferring the first answer in comparisons). Mitigate with random presentation order and calibration.
* '''Narrow benchmarks''' — High performance on academic benchmarks may not correlate with real-world task performance. Always include domain-specific evaluation.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">