Inlay

Profile

Digital humanities researchers often care about fine-grained similarity based on narrative elements like plot or tone, which don’t necessarily correlate with surface-level textual features. Can embedding models capture this? We study this in the context of fanfiction!

I’ll be presenting this work in **2 hours** at EMNLP’s Gather Session 3. Come by to chat about fanfiction, literary notions of similarity, long-context modeling, and consent-focused data collection!

7mo

We introduce FicSim, a dataset of 90 recently written long-form fanfics from Archive of Our Own. We *reach out to the authors for permission* to use each work and prioritize continual, informed author consent. Fics range in length from 10K to 400K+ words.

Natasha Johnson

Looking back and forth between Barthes, Sedgwick, and Hirsch trying to interpret a Star Trek scene when I'm 90% sure the explanation is just "the actor had a crush on his costar"

7mo

All selected fanfiction has detailed metadata and author-generated tags describing the fanfic content. Informed by fan studies and digital humanities literature, we classify these into 12 categories to construct gold labels for a fine-grained semantic similarity task.

Even strong embedding models over-index on surface features—for every model tested, similarity scores are more reflective of author or fandom than semantic aspects like theme or characterization. This is true even if models are explicitly instructed to focus on these aspects!

3mo

Unsurprising: Using longer words makes female authors more “literary” Surprising: The opposite is true for male authors For more cool plots + findings, take a look at my #CHR2025 paper exploring the role of form vs gender in the classification of genre & literary fiction doi.org/10.63744/Ztw...

7mo

This was joint work with @abertsch.bsky.social, Maria-Emil Deal, and @strubell.bsky.social Paper: arxiv.org/abs/2510.20926 Dataset: huggingface.co/datasets/fic...

7mo

Natasha Johnson

7mo

Natasha Johnson