//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...







Loading...
6/ Mechanistically, bias often doesn’t live in explicit demographic tokens. It instead hides in contextual proxies like formality, technical language, and “competence” cues. This explains why direct ablation methods can often fail.
4mo
Can steering remove LLM shortcuts without breaking legitimate LLM capabilities? In our @eaclmeeting.bsky.social paper, we show that conceptual bias is separable from concept detection; this means inference-time debiasing is possible with minimal capability loss.
3/ We study the use of demographic information where this info is: • causally relevant (name → demographic), • irrelevant (profession → demographic), or • partially relevant (profession → education). This lets us separate legitimate recognition from stereotyping.
4mo
4mo
2/ We study implicit biases via a word association task: the model assigns demographic labels to names or professions (e.g., “engineer → ?”, “Jack → ?”). Inspired by prior work on implicit associations in LLMs (e.g., Xuechunzi Bai et al., 2025).
4/ We compare attribution-based (“output” features) and correlation-based (“input” features) steering in LLMs. This follows the input/output distinction of @danaarad.bsky.social and @boknilev.bsky.social: some representations detect concepts in inputs, while others predict concepts in outputs.
5/ We find that race, gender, and education shortcuts rely on different internal mechanisms, so no single debiasing method works universally. In other words, there is no one-size-fits-all debiasing method!
7/ Takeaway: fairness interventions must be mechanism-aware and task-specific. With the right causal targets, we can do surgical debiasing while preserving general capabilities. 📃 Paper: arxiv.org/pdf/2512.20796 🙏 Amazing advisor: Aaron Mueller @amuuueller.bsky.social