A training approach where the model learns by receiving rewards or penalties for its outputs rather than by studying labeled examples. A primary benefit of RL is that in some circumstances it allows synthetic data or self-play to be used in the place of human-labeled data. RL is the foundation for RLHF and is used in game-playing AI and robotics; understanding RL helps explain how models learn to follow instructions.
See: Reward function; Training