Ever used a top-ranked LLM that just... felt wrong for you?
You’re not alone. Instead of leaderboards, many of us turn to "vibe-testing" - manually comparing models to our own needs. But can we turn these feelings into a structured evaluation?
New paper: "From Feelings to Metrics" 🧵