Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.
Can embedding models capture this? We study this in the context of fanfiction!
I’ll be presenting this work in **2 hours** at EMNLP’s Gather Session 3. Come by to chat about fanfiction, literary notions of similarity, long-context modeling, and consent-focused data collection!
We introduce FicSim, a dataset of 90 recently written long-form fanfics from Archive of Our Own. We *reach out to the authors for permission* to use each work and prioritize continual, informed author consent. Fics range in length from 10K to 400K+ words.
Natasha Johnson
Natasha Johnson
Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features.
Can embedding models capture this? We study this in the context of fanfiction!
Looking back and forth between Barthes, Sedgwick, and Hirsch trying to interpret a Star Trek scene when I'm 90% sure the explanation is just "the actor had a crush on his costar"
All selected fanfiction has detailed metadata and author-generated tags describing the fanfic content. Informed by fan studies and digital humanities literature, we classify these into 12 categories to construct gold labels for a fine-grained semantic similarity task.
Even strong embedding models over-index on surface features—for every model tested, similarity scores are more reflective of author or fandom than semantic aspects like theme or characterization. This is true even if models are explicitly instructed to focus on these aspects!
Unsurprising: Using longer words makes female authors more “literary”
Surprising: The opposite is true for male authors
For more cool plots + findings, take a look at my #CHR2025 paper exploring the role of form vs gender in the classification of genre & literary fiction
doi.org/10.63744/Ztw...
This was joint work with @abertsch.bsky.social, Maria-Emil Deal, and @strubell.bsky.social
Paper: arxiv.org/abs/2510.20926
Dataset: huggingface.co/datasets/fic...