Training models using human preference data to improve alignment with human values and instructions. RLHF is the dominant technique for aligning LLMs and is key to understanding how models are made to follow instructions and refuse harmful requests.
See: Alignment; Post-training; Reward model