So far I used a simple a=a[a] pattern to test GPU memory latency, but that indexed addressing penalty always bothered me. I finally got around to making the compiler spit out a chain of dependent loads and nothing else.
Good start on AMD. I save ~4 or ~12 ns for scalar and vector accesses