clamtech.org?dest=intel_u... - Sharing some thoughts on Intel's possible unified core project. Basically, I think the easiest route is a Zen 4c/5c style shrink of their P-Core. But of course, Intel has more options than that
Intel's desktop Arrow Lake always keeps the SNCU (die to die interface and some other parts of the uncore) at 2.6 GHz. On Meteor Lake, it goes up to 2.4 GHz but varies a lot probably to save power.
So far I used a simple a=a[a] pattern to test GPU memory latency, but that indexed addressing penalty always bothered me. I finally got around to making the compiler spit out a chain of dependent loads and nothing else.
Good start on AMD. I save ~4 or ~12 ns for scalar and vector accesses
Intel's Arrow Lake is impressively efficient when running throughput-bound stuff on 16 underclocked E-Cores. I'm running all Geekbench 6 workloads in parallel through Intel SDE (so emulated, to get exact instruction counts), and this chip is getting over 3.7G instructions/watt.
Sharing a piece I wrote a while ago on Zen 1. I mostly did this to test my site design, with plenty of pagination, tables, and captioned images. It's a pretty complete article by itself though, and I hope yall find it a fun read!
clamtech.org?dest=zen1
I wanted to bring in RDNA4 since I have an example of that card now, but never found the time. That's stuck on the back of a long todo list :/
Sharing another piece I wrote last year, comparing hardware AV1 encoding on Intel's Arc B580 and AMD's Hawk Point, at clamtech.org?dest=av1hwenc
Should be a good test of image handling on the site, with sliders for quality comparisons
Time for a little site with some multi-page support! I plan to write random thoughts on hardware there. To start, here's some commentary on drilling down GPU cache latency using very funny OpenCL kernels: clamtech.org?dest=gpudire...
clamtech.org?dest=gpuwrite
Here's a look at GPU cache/memory write bandwidth across a variety of hardware
Looking good on Intel too, improving measured latency by ~6.8 ns, or 19-20 cycles