Editing Reinforcement Learning (section)

== <span style="color: #FFFFFF;">Evaluating</span> ==
Experts evaluate RL systems along dimensions that casual practitioners often overlook:

'''Sample efficiency vs. wall-clock time''': The number of environment interactions required to reach a target performance level. A method that converges in 1M steps may be preferred over one that converges in 500k if the latter requires a larger compute budget per step.

'''Stability and reproducibility''': RL training is notoriously sensitive to random seeds, hyperparameters, and implementation details. Expert-level evaluation runs multiple seeds and reports mean ± standard deviation, not just the best run.

'''Policy interpretability''': For safety-critical applications, can you explain why the agent takes a given action? Experts use visualization, attention maps, or mechanistic analysis to build trust.

'''Transfer and generalization''': Does the policy hold up in environments slightly different from training? Evaluate on held-out environment variants. Domain randomization during training is a key technique for robustness.

A common expert mistake is '''Goodhart's Law''' in reward design — "When a measure becomes a target, it ceases to be a good measure." The reward specification must be treated as rigorously as any other design document.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">