The MM registers aren't the lower 64 bits of the XMM registers, [And this where I think Intel made a BIG mistake], they share the x87 FPU registers, so you couldn't mix MMX and FPU instructions without suffering HUGE perfomance penalty for switching modes! The XMM, YMM and ZMM registers do, as you said, all share the lowest bits.
@alex_d_lee
16 күн бұрын
You're 100% right and that was a complete overlook when making that part of the video. Thanks for the correction and I'll pin this for other people 👍
@BenjaminWheeler0510
8 күн бұрын
This isn’t “abusing” vector instructions. It’s using them for their intended purpose!
@nand3kudasai
16 күн бұрын
+short +no fluff +no ads +no distracting music +clear voice +well tested on hardware +code available +not a hype vid advocating on the latest buzzword +actually explains something +very interesting Well done. I loved it!
@alex_d_lee
15 күн бұрын
@@nand3kudasai Thank you! Really glad you found this useful :)
@crackedemerald4930
18 сағат бұрын
you made computers feel like actual physical machines instead of a magic box that heats your room and confounds the mind.
@tolkienfan1972
7 күн бұрын
One tbing I didn't see you address is unnecessary dependencies, which cause pipeline stalls. E.g. the classic is a dot product. Instead of for(k=0;k
@Asdayasman
27 күн бұрын
Dang if only there was a giant floating point vector co-pro in my computer right now that is way better at doing this than my CPU so I could free up the wasted AVX chip space there for much much more useful L1 cache. If only. (Excellent video).
@alex_d_lee
26 күн бұрын
Haha if only 🤔. Thanks!
@chickenbobbobba
10 күн бұрын
all seriousness there are a lot of cases where you dont want to use the gpu. mostly when latency is an issue. the time it takes to upload and call back the data often takes longer than calculating with simd, even if the gpu does no compute.
@alex_d_lee
10 күн бұрын
@@chickenbobbobba yup very true. Considering most x64 vector instructions take between ~1-15 cpu cycles, it’s pretty unreasonable to send relatively small workloads to the gpu. Obviously in the case of a ray tracer, the volume of work kinda makes a gpu a better choice but you’re completely right for the majority of cases you would ever considering using simd for.
@balala7567
5 күн бұрын
The runtimes for actually using the GPU are typically extremely large, not to mention the latency of inputting into the GPU and receiving an output from memory. Embedded or real-time operating systems will thus want to use the CPU for this; as well as homebrew operating systems which do not yet have GPU drivers.
@novantha1
Күн бұрын
Apparently unpopular opinion: AVX did nothing wrong. It sounds great to have a larger dedicated GPU for parallel operations, and in some ways it is, but the cycle overhead of moving data between CPU and GPU is prohibitive, even on the same die, and there’s a lot of advantages to having SIMD directly in the CPU itself. You can more easily do conditional / branching compute, the programming model is a lot easier to fully optimize (There’s a reason CUDA devs at AI startups can make seven figures ATM), and a lot of the space used by AVX instructions would generally be “dead silicon” anyway, in the sense that in modern CPUs you have to leave large areas of the silicon blank for heat dissipation concerns, so it’s really not that expensive to throw some extra conditional accelerators in those areas anyway. Plus, adding a moderate degree of AVX acceleration multiplies the value of your CPU in those types of operations for a relatively small amount of die area, so it’s actually incredible value for what you really get out of it. Plus, there are some workloads where you want SIMD, and need a lot of RAM. Generative AI comes to mind, but there’s others. In those cases it can actually be really beneficial to have access to a huge pool of very affordable RAM (even if it’s not as fast as VRAM), because Nvidia seems hellbent on ensuring people spend the absolute maximum possible to go beyond 8GB of VRAM. Finally, deploying AVX code is way simpler than deploying GPU compute shaders; to deploy AVX, you literally just send someone the binary. With GPU? It’s way harder to package the appropriate compute shader libraries with your program, and adds another headache in managing dependencies. Having walked something like a dozen people through installing ROCm, (and non-technical people through installing CUDA, which actually ended up being about the same difficulty), I would honestly rather spend an extra eight hours optimizing SIMD code on CPU than spend one hour each with a hundred people who can’t be trusted to install their own GPU compute shaders.
@Speechrezz
25 күн бұрын
Great video, we love SIMD!
@egonkirchof
4 күн бұрын
SIMD is the grandfather of the modern GPU.
@nates9778
15 күн бұрын
HELL YEAH!!! SIMD BITTCCHHH!!!!!!!!!!
@alex_d_lee
15 күн бұрын
@@nates9778 YEEE
@anon_y_mousse
12 күн бұрын
This is definitely an experiment I can get behind as I've got onboard graphics in my potato. I don't know if anyone else has this problem with building it, but I had to push down the cmake version it "required" to get it to attempt building, then add PThreads to the link command to actually get it to build. Possibly the latter stemmed from the former, but I don't know because I'm not going to upgrade my system any time soon. However, it might be interesting to design a CPU-heavy game engine. It wouldn't be as fast, it couldn't be, but it would still be an interesting experiment.
@alex_d_lee
12 күн бұрын
@@anon_y_mousse oh awesome thanks for letting me know! If you want, open a pull request with those changes with the older cmake version you found to work. That would be really cool and we’ve seen semi-3D games done before on older cpus so no doubt it would be even better now.
@anon_y_mousse
12 күн бұрын
@@alex_d_lee For sure, if you don't go for photo realism in real-time modern CPU's can handle things better than older ones could, but like I always say, graphics aren't everything, and what we had 10 years ago is more than enough if you've got a good story and good gameplay.
@Steam_VR
11 күн бұрын
Thanks for the explanation :)
@axltrain838
25 күн бұрын
great video! very interesting
@alex_d_lee
25 күн бұрын
Thank you!
@csaki01
13 күн бұрын
Thanks.
@harold2718
4 күн бұрын
For the PRNG I recommend something else if possible (can still be an LCG but with different constants), VPMULLD (_mm256_mullo_epi32) has not great throughput on Intel and horrible latency
@user-og6hl6lv7p
42 минут бұрын
Why assembly and not a more common language? Hard to translate this into my own projects tbh...
@elitegarbageman
25 күн бұрын
What are some resources for learning SIMD programming you'd recommend?
@alex_d_lee
25 күн бұрын
Check this out en.algorithmica.org/hpc/simd/ You can look through this to understand whats going on. That's only going to get you so far though, and I would really recommend building a small practice project with it on your own to actually get comfortable with it!
2:55 "a plus 'or' equals b" 🤔 I've never heard someone read it like that before. I guess I can see where it would come from given >= is "greater or equals," but (obviously) += is a modification ("add and assign"), not a comparison.
@alex_d_lee
9 күн бұрын
@@BryceDixonDev haha that was a mistake I didn’t realize I said until now. Meant to say “plus equals”. I read ± a lot so my brain is used to saying “plus or minus”
@stickguy9109
3 күн бұрын
I am too dumb to understand this 😭
@alex_d_lee
2 күн бұрын
@@stickguy9109 no!
@detoxifiedplant
Күн бұрын
@@alex_d_lee then please enlighten from where to get started with _/\_
Пікірлер: 35