//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...









Loading...
PhD @ UT Linguistics Semantics/Pragmatics/NLP https://asherz720.github.io/ Prev.@UoEdinburgh @Hanyang
Asher Zheng
Try it yourself :point_down:   ▎🎮 Step through quests & explore the leaderboard: asherz720.github.io/HerosJourney   ▎📄 arXiv: arxiv.org/abs/2606.02556   ▎💻 pip install herosjourney   ▎ w/ @kanishka.bsky.social @jessyjli.bsky.social @David Beaver
Spotting the rule from past experience is one thing; acting on it correctly is another. To find out, we introduce HERO's JOURNEY🦸‍♀️ to test for the LLMs’ inductive reasoning ability in multi-step setups. We found models show signs of rule induction, but scratch the surface.😮
We explore how strategic effectiveness can be quantified by a bunch of discourse properties and evaluate a suite of LLMs in terms of how they understand such strategic dialogues under adversarial settings.
Do they actually induce? There's correlation: models that can state the rule tend to solve the task, and success climbs only as the past trajectories become enough to pin the rule down. But it's uneven across rule types, and some wins look more like copying seen answers than genuine reasoning.
Does acting it out cost extra? Clearly. Even when a model names the rule correctly, that often doesn't carry over to executing it step by step in the world: knowing ≠ doing, and that execution gap is the real bottleneck.
HERO'S JOURNEY lets you inject rules of your interests and how they interact, then builds tasks around them. By default: 4 rule-interaction types × 2 families, keyed to a foe's attributes: 1️⃣ Attribute induction: which item to buy 2️⃣ Procedural induction: which action, in what order.