FastHTML page

DIY LLM Evaluation: A Case Study of Rhyming in ABBA Schema

Originally posted on Xebia's blog, my employer at the time of writing.

It's becoming common knowledge: You should not choose your LLMs based on static benchmarks.

As Andrej Karpathy, former CTO of OpenAI, once said on Twitter: "I pretty much only trust two LLM evals right now: Chatbot Arena and the r/LocalLlama comments section". Chatbot Arena is a website where you can submit a prompt, see two results, and then choose the best result. All results are then aggregated and scored. On the r/LocalLlama subreddit people discuss finetuning LLMs on custom usecases.

The lesson is: only trust people evaluating LLMs on the tasks they themselves care about.

But there's something better: evaluate LLMs yourself on tasks you care about! Then you do not only get the most relevant scoring metrics for your task. But, in the process, you will also learn a whole lot more about the problem you're actually trying to solve.

In this blogpost, I will share with you my journey into evaluating LLMs on a ridiculous task. I've been obsessed with it for almost a year: rhyming in ABBA schema. For some reason, most LLMs can't create a 4-line poem, where the first line rhymes with the last, and the second rhymes with the third.

Curious to know why this is the case? In this rest of this blogpost I will share with you:

Why rhyming in ABBA schema is an interesting task
What the results were of my analysis
What lessons I learned from going through this exercise

Keep reading

LLMEvals

Rens' Blog

Posts Tagged: LLM

DIY LLM Evaluation: A Case Study of Rhyming in ABBA Schema