In evaluation, we find that people agree with labels from off-the-shelf LLMs less than a random other person! But fine-tuning and then applying our personalization method yields a 66% relative improvement in agreement compared to human-human agreement rates, leading to SOTA performance.