well, sometimes it's just dumb stuff like finally deciding to disassemble the ROM, looking into the assembler code and realizing that the compiler inlined a function outside IWRAM despite clear definitions.
At least __attribute__((noinline)) came to the rescue and lowered cycles drastically.