//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileHeaderAlt
by @jakesimonds.com
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...
name: crumb | she / xe / it / fae | https://hf.co/crumb | modelling cognitive systems @ my house
_ - \.







Loading...
(prefix) <|im_start|>generator <think> (model deliberates about completion) </think> (model completes text) (full text) <|im_start|>discriminator <think> (model deliberates about score) </think> Score: (float between 0 & 1)
always thought those papers where one task's reward depends on another task's reward were super messy and gross when you have to do group-mean because it literally multiplying task 2's compute when you don't really have to, and learned value functions are annoying and lag behind
this (A) oscillated wildly and the divergence was freaking out at me, policy collapse, but i switched to the value function i talked ab yesterday and it (B) is just kinda nice, just a little noisy, mostly just stable learning 😇 still have to do more tests & write it (the vf) up
its so exciting to watch this train and see it get better and better at reasoning about <everything>
this is a perfect baseline for self-play setups i think