A training technique for reasoning models that optimizes based on relative performance within groups of responses. Introduced in late 2024 and used in many frontier models.
See: Direct Preference Optimization (training); Reasoning model
A training technique for reasoning models that optimizes based on relative performance within groups of responses. Introduced in late 2024 and used in many frontier models.
See: Direct Preference Optimization (training); Reasoning model