From RLHF to Direct Preference Learning
It’s well known that the state-of-the-art LLM models are trained with massive human quality feedback. This feedback is either coming from a massive rater pool, or can come from the end users implicitly (sometimes explicitly as well.. remember when ChatGPT presents you 2 responses to choose from?). However, I found that there are many subtleties for one to truly understand what’s going on under the hood. This is a blog to capture my understanding of the RLHF algorithm, and how it evolves into rewardless model such as DPO and IPO....