AVX-512: First Impressions on Performance and Programmability

125 points24 daysshihab-shahriar.github.io

physicsguy • 19 days ago

A few gentle points:

(a) You mention that the NVidia docs push people to use libraries, etc. to really get high performance CUDA kernels, rather than writing them themselves. My argument would be that SIMD is exactly the same - they're something really that are perfect if you're writing a BLAS implementation but are too low level for most developers thinking about a problem to make use of.

(b) You show a problem where autovectorisation fails because of branching, and jump straight to intrinsics as the solution which you basically say are ugly. Looking at the intrinsic code, you're using a mask to deal with the branching. But there's a middle ground - almost always you would want to try restructuring the problem, e.g. splitting up loops and adding masks where there's conditions - i.e. lean into the SIMD paradigm. This would also be the same advice in CUDA.

(c) As you've found, GCC actually performs quite poorly for x86-64 optimisations compared to Intel. It's not always clear cut though, the Intel Compiler for e.g. sacrifices IEEE 764 float precision and go down to ~14 digits of precision in it's defaults, because it sets the flag `-fp-model=fast -fma`. This is true of both the legacy and new Intel compiler. If you switch to `-fp-model=strict` then you may find that the results are closer.

(d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.

mort96 • 19 days ago

Regarding (b), I would never rely on auto vectorization because I have no insight into it. The only way to check if my code is written such that auto-vectorization can do its thing is to compile my program with every combination of compiler, optimization setting and platform I intend to support, disassemble all the resulting binaries, and analyze the disassembly to try to figure out if it did autovectorization in the way I expect. That's a terrible developer experience; writing intrinsics by hand is much easier, and more robust. I'd need to re-check every piece of autovectorized code after every change and after every compiler upgrade.

I just treat autovectorization like I treat every other fancy optimization that's not just constant folding and inlining: nice when it happens to work out, it probably happens to make my code a couple percent faster on average, but I absolutely can't rely on it in any place where I depend on the performance.

physicsguy • 19 days ago

> every combination of compiler, optimization setting and platform I intend to support, disassemble all the resulting binaries, and analyze the disassembly to try to figure out if it did autovectorization in the way I expect

I just used to fire up VTune and inspect the hot loops... typically if you care about this you're only really working on hardware targeting the latest instruction sets anyway in my experience. It's only if you're working on low level libraries I would bother doing intrinsics all over the place.

For most consumer software you want to be able to fall back to some lowest-common-denominator hardware anyway otherwise people using it run into issues - same reason that Debian, Conda, etc. only go up to really old instructions sets.

mort96 • 19 days ago

I work on games sometimes, where the goal is: "run as fast as possible on everyone's computer, whether it's a 15 year old netbook with an Intel Atom or a freshly built beast of a gaming desktop". As a result, the best approach is to discover supported instructions at runtime and dispatch to a function that's using those instructions (maybe populating a global vector function table at launch?). Next best is to assume some base level vector support (maybe the original AVX for x86, Neon for ARM) and unconditionally use those. Targeting only the latest instruction sets is a complete non-starter.

grumbelbart2 • 19 days ago

> (d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.

To re-iterate, this is our observation as well. The first AVX512 processors would execute such code quite fast for a short time, then overheat and throttle, leading to a worse wall-time performance than the corresponding AVX256 code.

I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.

adrian_b • 19 days ago

The processors with severe throttling from AVX-512 were server CPUs, i.e. Skylake Server and its derivatives, like Cascade Lake and Cooper Lake.

Only few of those CPUs have been used in workstations, i.e. high-end desktop computers.

The vast majority of the CPUs with AVX-512 that can be encountered at the general population are either AMD Zen 4 and Zen 5 CPUs or some old Intel CPUs from the Tiger Lake, Ice Lake and Rocket Lake families. All these do not have AVX-512 throttling problems.

The owners of server computers are more likely to be knowledgeable about them and choose programs compiled with an appropriate target CPU model.

Therefore I believe that nowadays, when the percentage of computers with good AVX-512 support is increasing, and even Intel is expected to introduce by the end of the year Nova Lake with AVX-512 support, an application should be compiled such that whenever it detects AVX-512 support it should dispatch to the corresponding branch.

On the computers with AVX-512 support, using it can provide a significant increase in performance, while the computers where this could be harmful are more and more unlikely to be encountered outside datacenters that have failed to update their servers.

Skylake Server was introduced 9 years ago and Ice Lake Server, which corrected the behavior, was introduced 6 years ago. Therefore, wherever performance matters, the Skylake Server derivatives would have been replaced by now, as a single Epyc server can replace a cluster of servers with Skylake Server CPUs, at a much lower power consumption and with a higher performance.

jandrewrogers • 19 days ago

There is a large variance in AVX-512 performance across the microarchitectures, particularly the early ones. In addition to throttling, which was mitigated relatively early, some relatively feature-complete microarchitectures (e.g. Ice Lake) were sensitive to alignment. The microarchitectures with these issues are approaching obsolescence at this point. AVX-512 runs very well with predictable performance on most vaguely recent microarchitectures.

In my case I find AVX-512 to be usable in practice because the kinds of users that are obsessed with performance-engineered systems are typically not running creaky hardware.

Joker_vD • 19 days ago

> I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.

Yeah, and it's incredibly frustrating because there is almost zero theory on how to write performant code. Will caching things in memory be faster than re-requesting them over network? Who knows! Sometimes it won't! But you can't predict what those times will be beforehand which turns this whole field into pure black magic instead of anything remotely similar to engineering or science, since theoretical knowledge has no relation to reality.

At my last job we had one of the weirdest "memory access is slo-o-o-ow" scenarios I've ever seen (and it would reproduce pretty reliably... after about 20 hours of the service's continuous execution): somehow, due to peculiarities of the GC and Linux physical memory manager, almost all of the working set of our application would end up in a single physical DDR stick, as opposed to being evenly spread across 4 stick the server actually has. Since a single memory stick literally can't cope with such high data throughput, the performance tanked. And it took quite some time us to figure out what the hell was going on because nothing came up on the perf graphs or metrics or whatever: it's just that almost everything in the the application's userspace became slower. No, the CPU is definitely not throttled, it's actually 20–30% idle. No, there is almost zero disk activity, and the network is fine. Huh?!

physicsguy • 19 days ago

You do have NUMA to control memory placement, it's not that easy to use though: https://blog.rwth-aachen.de/itc-events/files/2021/02/13-open...

Joker_vD • 18 days ago

Well, we didn't, for obvious reasons, patch JVM to manually finagle with physical memory allocation — which it probably wouldn't be able to do anyway, being run in a container.

shihab • 19 days ago

Hi, thanks for reading.

Re (b) I'm curious what that middle ground is. Is there any simple refactor to help GCC to get rid of this `if`? (Note, ISPC did fine here)

Regarding (a), one of the points I wanted to get across was that it didn't feel that complicated to program in the end as I had thought. Porting to AVX-512 felt mechanical (hence the success of LLMs in one-shotting the whole thing).

This is a subjective opinion, depends on programmer's experience etc- so I won't dwell on it. I just wish more CPU programmers gave it a try.

Remnant44 • 19 days ago

Fort what it's worth, I had the exact same experience you did when I started writing SIMD code explicitly with intrinsics.

I avoided it for a long time because, well, it was so damn ugly and verbose to do simple things. However, in actual practice it's not nearly as painful as it looks, and you get used to it quickly.

physicsguy • 17 days ago

The typical way would be to unroll the inner loop manually; often you can get away with:

    for (int i = 0; i < N; i += SIMD_WIDTH) {
        for (int j = 0; j < SIMD_WIDTH) {
            // do code
        }
    }

but failing the compiler optimising that you can do it more like:

    for(int i = 0; i < N; i+= SIMD_WIDTH) {
        float mask[8];
        // do work into mask, find max of the mask
    }

That's effectively what you're doing anyway in the SIMD code, but it keeps it more readable for mere mortals, and because you can define SIMD_WIDTH as a constant, it's also slightly easier to change if a new instruction set comes along; you're not maintaining multiple kernels.

nnevatie • 19 days ago

I found this a weird article.

If you wish to see some speedups using AVX512, without limiting yourself to C or C++, you might want to try ISPC (https://ispc.github.io/index.html).

You'll get sane aliasing rules from the perspective of performance, multi-target binaries with dynamic dispatching and a lot more control over the code generated.

theowaway • 19 days ago

ispc is something that deserves to be much more widely known about- it does an excellent job of bringing the cuda programming model to cpus

grumbelbart2 • 19 days ago

Is there a way to compile it to something else than x86, like arm/aarch64?

nnevatie • 19 days ago

> It currently supports multiple flavours of x86 (SSE2, SSE4, AVX, AVX2, and AVX512), ARM (NEON), and Intel® GPU architectures (Xe family).

shihab • 19 days ago

Hi, I actually mentioned ISPC several times there. And although I strenuously avoided crowning one approach "better" over the other, it is worth pointing out that 1) Many of these benefits of ISPC can be had from explicit SIMD libraries like Google's Highway, and 2) ISPC (or any SIMT model) is a departure from how the underlying hardware works, and as the AI community is discovering with GPU, this abstraction can sometimes be lot more headache than its worth.

majke • 19 days ago

Ispc looks interesting. Does it work with amd? They hint on gpu’s , i guess mostly intel ones?

dataking • 19 days ago

Yes, it works with AMD CPUs as well as various ARM ones, e.g. Apple silicon.

See for instance https://github.com/ispc/ispc/pull/2160

nnevatie • 19 days ago

Yes, works well with AMD. You can compile multi-target so that you'll have e.g. SSE4.2, AVX2, AVX512 support built to your binaries and the best (widest) version is picked by the runtime automatically.

alecco • 19 days ago

> CUDA architects [...] happily exposed every ugly details of underlying hardware to the programmer if that allows a bit more performance.

After spending more than a decade dancing around all the underlying x86 hidden stuff for low-level optimization, I appreciate CUDA a lot. Everything is there under your total control. No more one size fits all. Higher barrier of entry but no surprises and less time spent debugging to figure out what landmine your code stepped into.

user3939382 • 19 days ago

Then you’re locked into the ecosystem and whims of signed proprietary drivers so in a way you have no control whatsoever.

alecco • 19 days ago

Sure. But Intel's beancounter board is way more scary [1] and moving from CUDA to AMD's ROCm isn't that hard, anyway.

[1] "Intel Officially Introduces Pay-As-You-Go Chip Licensing - Intel's Xeon Sapphire Rapids CPUs to activate additional features on demand" https://www.tomshardware.com/news/intel-officially-introduce... (2022)

pjmlp • 24 days ago

> In CPU world there is a desire to shield programmers from those low-level details, but I think there are two interesting forces at play now-a-days that’ll change it soon. On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.

The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.

I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.

adgjlsfhk1 • 19 days ago

The good news is that the portable subset of SIMD is all you really need anyway. If you go beyond the portable subset, you need per-architecture code writing and testing, and you're mostly talking about pretty small gains relative to the cost.

camel-cdr • 24 days ago

> The answer, if it’s not obvious from my tone already:), is 8%.

Not if the data is small and in cache.

> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.

I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.

shihab • 24 days ago

Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.

praptak • 19 days ago

> Not if the data is small and in cache.

Isn't it another way of saying what the author says in the previous paragraph, namely that "ideal SIMD speedup can only come from problems that are compute bound"?

If the cost of getting the input data into the cache is already large compared to processing it with the non-vectorized code, then SIMD cannot achieve meaningful speedup. The opposite of this condition (processing is expensive compared to the cost of data into the cache) is basically the definition "compute bound".

ecesena • 19 days ago

If you have the opportunity, try out a zen5. Significant improvements.

DeathArrow • 19 days ago

>On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.

There are lots of people using Javascript frameworks to build slow desktop and mobile software.

user_7832 • 19 days ago

I wonder if the excess CO2 emitted by devices around the world using bloated software that has no need to be so (hullo MS Teams) could be calculated in terms of # of cross atlantic voyages of jets.

chillitom • 19 days ago

Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.

Would be interesting to see if auto vec performs better with that addition.

chillitom • 19 days ago

Also trying to let the compilers know that the float* are aligned would be a good move.

auto aligned_p = std::assume_aligned<16>(p)

magicalhippo • 19 days ago

> let the compilers know that the float* are aligned

Reminded me of way back before OpenGL 2.0, and I was trying to get Vertex Buffer Objects working in my Delphi program using my NVIDIA graphics card. However it kept crashing occasionally, and I just couldn't figure out why.

I've forgotten a lot of the details, but either the exception message didn't make sense or I didn't understand it.

Anyway, after bashing my head for a while I had an epiphany of sorts. NVIDIA liked speed, vertices had to be manipulated before uploading to the GPU, maybe the driver used aligned SIMD instructions and relied on the default alignment of the C memory allocator?

In Delphi the default memory allocator at the time only did 4 byte aligned allocations, and so I searched and found that Microsoft's malloc indeed was default aligned to 16 bytes. However the OpenGL standard and VBO extension didn't say anything about alignment...

Manually aligned the buffers and voila, the crashes stopped. Good times.

Remnant44 • 19 days ago

which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.

No reason for the compiler to balk at vectorizing unaligned data these days.

dmpk2k • 19 days ago

> There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput

Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...

Remnant44 • 19 days ago

There are many situations where your data is essentially _majority_ unaligned. Considerable effort by the hardware guys has gone into making that situation work well.

A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +- N element neighborhood around a pixel. By definition most of those reads will be unaligned!

A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is constrained by bandwidth over compute, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed anyways, at which point the half rate penalty doesn't really matter.

The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.

Sesse__ • 19 days ago

Even with AVX512, memory arguments used in most instructions (those that are not explicitly unaligned loads) need to be aligned, no? E.g., for vaddps zmm0, zmm0, [rdi] (saving a register and an instruction over vmovups + vaddps reg, reg, reg), rdi must be suitably aligned.

Apart from that, there indeed hasn't been a real unaligned (non-atomic) penalty on Intel since Nehalem or something. Although there definitely is an extra cost for crossing a page, and I would assume also a smaller one for crossing a cache line—which is quite relevant when your ops are the same size as one!

jandrewrogers • 19 days ago

With the older microarchitectures there was a large penalty for crossing a cache line with AVX-512. In some cases, the performance could be worse than AVX2!

In older microarchitectures like Ice Lake it was pretty bad, so you wanted to avoid unaligned loads if you could. This penalty has rapidly shrunk across subsequent generations of microarchitectures. The penalty is still there but on recent microarchitectures it is small enough that the unaligned case often isn't a showstopper.

The main reason to use aligned loads in code is to denote cases where you expect the address to always be aligned i.e. it should blow up if it isn't. Forcing alignment still makes sense if you want predictable, maximum performance but it isn't strictly necessary for good performance on recent hardware in the way it used to be.

Remnant44 • 19 days ago

fithisux • 24 days ago

What I get in these article is that the original intent on C language stands true.

Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).

Complex compiler doing crazy optimizations, in my opinion, is not worth it.

vjerancrnjak • 19 days ago

I remember being quite surprised that my implementation which uses manual stack updates is much slower than what compiler had with recursion.

Turns out, I was pushing and popping from stack on every conceptual "recursive call", but compiler figured out it can keep 2-3 recursive levels in registers and pop/push 30% of the time, had more stuff in memory than my version as well.

Even when I reduced memory read/writes to ~50% of the recursive program, kept most of the state in registers, the recursive program was faster anyway due to just using more registers than me.

I realized then that I cannot reason about the microoptimizations at all if I'm coding in a high-level language like C or C++.

Hard to predict the CPU pipeline, sometimes profile guided optimization gets me there faster than my own silliness of assuming I can reason about it.

kergonath • 19 days ago

> Complex compiler doing crazy optimizations, in my opinion, is not worth it.

For these optimisations that are in the back-end, they are used for other languages that can be higher-level or that cannot drop to assembler as easily. C is just one of the front-ends of modern compiler suites.

ziml77 • 18 days ago

Some of them even give you access to this stuff. C# has Vector<T> which picks implementations of various operations based on the current CPU's capabilities. But they also let you have complete control if you want by providing functions in the System.Runtime.Instrinsics namespace.

eru • 19 days ago

Well, C is a lie anyway: it's not how computers work any more (and I'm not sure it's how they ever worked).

woooooo • 19 days ago

Assembly isnt how they work under the hood, but its the highest fidelity interface we have. C as "portable assembler" is targeting the surface that chip designers give us and the one that they try to make fast via all their microcode tricks.

saagarjha • 19 days ago

Isn’t k-means memory bandwidth bound? What was the arithmetic intensity of the final code?

shihab • 19 days ago

No. Assuming `k` is small enough, which in practice often is, the arithmetic intensity of this kernel is 25-90 Flops/Byte, way above the roofline knee of any modern CPU.

NohatCoder • 19 days ago

I assume that the image would at least fit in L3.

BoredomIsFun • 19 days ago

A factoid - earlier batches of Alder Lake 12th gen consumer CPU has a rare AVX512 _FP16_ extension. Afaik it was very very fast.

randomint64 • 19 days ago

For those who want to get started with SIMD programming in Rust, here is a great resource: https://kerkour.com/introduction-rust-simd