We propose a protocol that combines evaluation datasets with persona dialogue prefixes to measure the effect of conversation length on model behavior.
We then use it to measure the impact of length on:
š Persona Fidelity
ā Instruction Following
š Safety