sign in
<Profile> by @dansshadow.bsky.social▾
uri: at://crumb.bsky.social/app.bsky.actor.profile/self
loading...
Loading...
name: crumb | she / xe / it / fae | https://hf.co/crumb | modelling cognitive systems @ my house
_ - \.







Loading...
restarted again to go from 0.5b to 2b, this is an example of the format im using, this is outputs from step 64 (out of 2k). it was trained on truncated reasoning chains so it's gonna have to unlearn that but otherwise this is what I'm working with
(prefix) <|im_start|>generator <think> (model deliberates about completion) </think> (model completes text) (full text) <|im_start|>discriminator <think> (model deliberates about score) </think> Score: (float between 0 & 1)
its so exciting to watch this train and see it get better and better at reasoning about <everything>
this is a perfect baseline for self-play setups i think
always thought those papers where one task's reward depends on another task's reward were super messy and gross when you have to do group-mean because it literally multiplying task 2's compute when you don't really have to, and learned value functions are annoying and lag behind
this (A) oscillated wildly and the divergence was freaking out at me, policy collapse, but i switched to the value function i talked ab yesterday and it (B) is just kinda nice, just a little noisy, mostly just stable learning 😇 still have to do more tests & write it (the vf) up
llama2-34B actually defeated the red team and has been running meta ai for the past 3 years
do you think they've had time to sufficiently redteam Llama2-34B yet because im still waiting