Evaluation

Disclaimer: All of the following content are public information. Academic Evals are Cracked It all starts with the question: How do you evaluate the quality of something that’s as strong as GPT-4 (or Gemini). Before the era of large language model, various of researchers spend many times constructing evaluation benchmark to evaluate the model’s capability progress. I’d argue that a good benchmark is what drive progress in the field of NLP, and claiming the lead on a benchmark usually comes with fame and fortune, driving researchers and companies to compete with each other to create a better model....