Editing AI Evaluation and Benchmarking (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Benchmark''' — A standardized test or dataset used to compare AI systems on specific capabilities. Examples: MMLU, HumanEval, GLUE, ImageNet.
* '''Metric''' — A quantitative measure of performance. Examples: accuracy, F1 score, BLEU, pass@k, perplexity.
* '''Accuracy''' — The fraction of test examples classified correctly. Simple but can be misleading with imbalanced classes.
* '''F1 score''' — The harmonic mean of precision and recall; better than accuracy for imbalanced datasets.
* '''Precision''' — Of all items predicted positive, what fraction were actually positive?
* '''Recall''' — Of all actual positive items, what fraction were predicted positive?
* '''Leaderboard''' — A ranking of AI systems by performance on a shared benchmark.
* '''Held-out test set''' — Data that is kept separate from training and validation, used only for final evaluation to prevent overfitting.
* '''Contamination''' — When test data appears in a model's training data, inflating benchmark scores.
* '''Capability evaluation''' — Measuring what a model can do (intelligence, reasoning, knowledge).
* '''Safety evaluation''' — Measuring a model's tendency to produce harmful, biased, or misaligned outputs.
* '''Human evaluation''' — Using human raters to assess model outputs, the gold standard for subjective tasks.
* '''LLM-as-judge''' — Using a capable LLM (e.g., GPT-4) to evaluate the outputs of another model at scale.
* '''MMLU''' — Massive Multitask Language Understanding; a benchmark testing knowledge across 57 academic subjects.
* '''HumanEval''' — A benchmark for code generation; tests whether generated code passes unit tests.
* '''MT-Bench''' — A benchmark for multi-turn conversation and instruction following.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">