2024-06-12-FEX-2406.md

FEX-Emu · Jun 13, 2024 · d3e3400 · d3e3400
1 parent 1438bc1
commit d3e3400
Show file tree

Hide file tree

Showing 2 changed files with 100 additions and 0 deletions.
diff --git a/_posts/2024-06-12-FEX-2406.md b/_posts/2024-06-12-FEX-2406.md
@@ -0,0 +1,100 @@
+---
+layout: post
+title: FEX 2406 Tagged!
+author: FEX-Emu Maintainers
+---
+
+A little late this month but we have a new FEX release has finally landed. This month we have some good optimization and fixes so let's get right in to it.
+
+## A bunch of JIT optimizations
+This last month is finally the culmination of preparation work over the past few months of cleanups in the FEX JIT. The new register allocator has
+landed in FEX which is significantly better than our previous RA. Our prior implementation was meant to be a temporary solution when FEX initially
+started as a project and as with most temporary code, it became permanent. It was excessively slow, best case it ran in quadratic time, worst case it
+could take **INFINITE** time which resulted in significant stutters or hangs. This new implementation by Alyssa now runs in two passes in linear time,
+significantly improving performance and also removing a ton of bad design decisions from the first implementation.
+
+In addition to the new RA, we also have a bunch of little optimizations spread around that improves performance all over the place. One of the bigger
+performance improvements for people with new hardware is enabling the AFP extension and RPRES if supported. Apple supports these in their latest SoC
+and the newer Cortex also supports them. This improves scalar SSE performance by quite a bit. We won't dive in to these too much but the various
+optimizations can improve performance from 2% to 12% in testing. We're marching ever closer to running applications at near native speeds now.
+
+## Add support for 32-bit OpenGL thunking
+This is a big feature! 32-bit thunking has been a long time coming and has crossed some significant hurdles towards actually working! One of the
+biggest CPU time sinks with games is the amount of time we need to spend in the video driver when running a game. "Thunking" allows us to remove that
+overhead and jump directly in to the AArch64 libGL directly and remove a bunch of emulation overhead. We have done a bunch of testing with this but we
+expect there will still be some bugs that need to be worked out. As for fun performance improvements, we have seen one game go from 150FPS up to
+270FPS, so it's worth trying in some cases.
+
+As a note though, this is only 32-bit OpenGL thunking. 32-bit Vulkan drivers still need to go down the emulation path, so things like DXVK in older
+32-bit games won't get these performance improvements.
+
+## Default TSO emulation options changed
+Over the course of the past couple months we have been testing the new TSO memory model emulation toggles and during this time we have determined the
+cost of accurately emulating Vector and Memory copy memory atomics to be too high for most hardware. The good news is that from all the testing we
+have done, this doesn't actually cause any problems in any known games. So from this release onward we are by default disabling TSO emulation on these
+operations. We may come back and visit this once hardware ships that has **FEAT_LRCPC3** which adds new instructions for Vector TSO loadstores.
+
+Users with an older configuration can go in to FEXConfig to toggle these options off and enjoy the free speed benefits of not doing accurate emulation
+today! As a note, Apple Silicon's TSO hardware emulation bit doesn't suffer the same performance degradation so once Asahi Linux supports this for
+users then they get accurate emulation and speed!
+<img width="50%" src="{{site.baseurl}}/images/posts/2406-06-12/TSOOptions.png">
+
+**Unaligned Half-barrier TSO Enabled is still recommended to keep enabled as that can cause significant bugs**
+
+## Fix fstatat/statx with NOFOLLOW And JIT bugs
+During a livestream one of our users encountered a bug in FEX-Emu that was breaking [Darwinia](https://store.steampowered.com/app/1500/Darwinia/).
+After diving in to the game to figure out what it was doing, it actually turned out to be three separate bugs that broke the game. The first bug fix
+with fstatat and statx syscalls were around edge case behaviour with the **NOFOLLOW** flag. The game was attempting to find the directory that the
+executable was living in and being smart in a way that broke FEX.
+
+The other bugs were behaviour in our optimization passes where we broke x86 **SIB** addressing in a couple ways. We have since added unittests for
+these two bugs but if you would like to read more you can check out [ConstProp fixes for Darwinia](https://github.com/FEX-Emu/FEX/pull/3669)
+
+With these bugs fixed the game now runs correctly under FEX-Emu without issue!
+
+## More ARM64EC improvements
+This was some cleanup work for helping more easily integrating with what upstream WINE is doing for ARM64EC support. While still not entirely usable
+for end-users yet, it is steadily improving and can run real games if the environment is setup correctly. A lot of good work here and we're hoping
+for more testing going forward.
+
+# NVIDIA Orin CPU errata!
+Over the past month or two we had noticed that the NVIDIA Orin platform with its Cortex-A78AE CPU cores were running games markedly worse than our
+Snapdragon 8cx Gen 3 platform with Cortex-A78C cores. While these CPU cores are not identical between platforms, they are both based on the Cortex-A78
+CPU core design so they should be relatively close. The NVIDIA Orin runs its cores at a 2.2Ghz clock frequency, while the Snapdragon runs its cores at
+2.4Ghz. Nearly a 9% clock speed difference wasn't accounting for the performance delta we were seeing!
+
+The game we were testing was the PC port of Sonic Adventure 2: Battle; On Orin the board could only achieve 18FPS, while on Snapdragon we were easily
+hitting 60FPS with headroom to go higher if VSync was disabled. We were stunned by this absolute performance difference and couldn't nail down the
+difference being due to different drivers.
+
+Turns out we only needed to look at the [Cortex-A78AE Software Developer Errata Notice](https://developer.arm.com/documentation/SDEN-1707912/1800/?lang=en) to find out why.
+
+<blockquote>
+<p>1951502</p>
+<b>Atomic instructions with acquire semantics might not be ordered with respect to older stores with release semantics</b>
+<p>Under certain conditions, atomic instructions with acquire semantics might not be ordered with respect to older instructions with release semantics. The older instruction could either be a store or store atomic.</p>
+<p>This erratum can be avoided by <b>inserting a DMB ST before acquire atomic instructions</b> without release semantics. This can be implemented through execution of the following code at EL3 as soon as possible after boot:</p>
+</blockquote>
+
+This then goes on to talking about some code that programs the CPU so that it injects these <b>DMB ST</b> instructions before atomic acquires automatically!
+This is why this platform has been so weird for performance testing for years! This massive hardware errata basically deletes any advantage that the
+<b>FEAT_LRCPC</b> extension gives FEX and goes back to emulating atomics using half-barriers similar to how FEX already does it around unaligned
+atomics!
+
+We are now looking to move off of this NVIDIA Orin platform as quickly as possible, it was already old and now that we have identified a significant
+problem around atomic performance it is higher priority. Luckily over the last few months we have great new hardware announcements. The Snapdragon X
+Elite devices are shipping soon, NVIDIA has announced a new Jetson AGX Thor platform, The NVIDIA Grace server platform is starting to become available, and
+Apple has some new M4 devices that will be interesting! Ideally we will get a new platform that we can plug a Radeon GPU in since it is a huge boon to
+our testing performance, but depending we may not have that luxury. We'll see as we move on to new and better platforms!
+
+# Video game showcase
+Instead of a video showcase from FEX this month, go checkout [Asahi Lina's Youtube Page](https://www.youtube.com/@AsahiLina/). She recently did a
+couple of live streams fixing issues with the Asahi Linux MicroVM solution for running FEX-Emu on on Apple Silicon! She showcases a bunch of games
+while covering some of the more technical problems involved with getting FEX-Emu running on that platform.
+
+Be warned, these are very long streams.
+
+<iframe width="280" height="158" src="https://www.youtube-nocookie.com/embed/JT9a_MrFV18" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
+<iframe width="280" height="158" src="https://www.youtube-nocookie.com/embed/qTj60x9eMu4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
+
+See the [2406 Release Notes](https://github.com/FEX-Emu/FEX/releases/tag/FEX-2406) or the [detailed change log](https://github.com/FEX-Emu/FEX/compare/FEX-2405...FEX-2406) in Github.
diff --git a/images/posts/2406-06-12/TSOOptions.png b/images/posts/2406-06-12/TSOOptions.png