How does GENIE compare to existing novelty metrics? We tested this using minimal pairs of creative writing responses differing by one feature (e.g. plot).
Key findings:
🧱 Many holistic metrics are not paraphrase-robust
🎯 GENIE is sensitive to interventions & paraphrase-robust