name: crumb | she / xe / it / fae | https://hf.co/crumb | modelling cognitive systems @ my house
_ - \.
Loading...
(prefix)
<|im_start|>generator
<think>
(model deliberates about completion)
</think>
(model completes text)
(full text)
<|im_start|>discriminator
<think>
(model deliberates about score)
</think>
Score: (float between 0 & 1)
always thought those papers where one task's reward depends on another task's reward were super messy and gross when you have to do group-mean because it literally multiplying task 2's compute when you don't really have to, and learned value functions are annoying and lag behind
this (A) oscillated wildly and the divergence was freaking out at me, policy collapse, but i switched to the value function i talked ab yesterday and it (B) is just kinda nice, just a little noisy, mostly just stable learning 😇
still have to do more tests & write it (the vf) up
its so exciting to watch this train and see it get better and better at reasoning about <everything>
this is a perfect baseline for self-play setups i think