Skip to content

Commit

Permalink
2024-06-12-FEX-2406.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Sonicadvance1 committed Jun 13, 2024
1 parent 1438bc1 commit d3e3400
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 0 deletions.
100 changes: 100 additions & 0 deletions _posts/2024-06-12-FEX-2406.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
layout: post
title: FEX 2406 Tagged!
author: FEX-Emu Maintainers
---

A little late this month but we have a new FEX release has finally landed. This month we have some good optimization and fixes so let's get right in to it.

## A bunch of JIT optimizations
This last month is finally the culmination of preparation work over the past few months of cleanups in the FEX JIT. The new register allocator has
landed in FEX which is significantly better than our previous RA. Our prior implementation was meant to be a temporary solution when FEX initially
started as a project and as with most temporary code, it became permanent. It was excessively slow, best case it ran in quadratic time, worst case it
could take **INFINITE** time which resulted in significant stutters or hangs. This new implementation by Alyssa now runs in two passes in linear time,
significantly improving performance and also removing a ton of bad design decisions from the first implementation.

In addition to the new RA, we also have a bunch of little optimizations spread around that improves performance all over the place. One of the bigger
performance improvements for people with new hardware is enabling the AFP extension and RPRES if supported. Apple supports these in their latest SoC
and the newer Cortex also supports them. This improves scalar SSE performance by quite a bit. We won't dive in to these too much but the various
optimizations can improve performance from 2% to 12% in testing. We're marching ever closer to running applications at near native speeds now.

## Add support for 32-bit OpenGL thunking
This is a big feature! 32-bit thunking has been a long time coming and has crossed some significant hurdles towards actually working! One of the
biggest CPU time sinks with games is the amount of time we need to spend in the video driver when running a game. "Thunking" allows us to remove that
overhead and jump directly in to the AArch64 libGL directly and remove a bunch of emulation overhead. We have done a bunch of testing with this but we
expect there will still be some bugs that need to be worked out. As for fun performance improvements, we have seen one game go from 150FPS up to
270FPS, so it's worth trying in some cases.

As a note though, this is only 32-bit OpenGL thunking. 32-bit Vulkan drivers still need to go down the emulation path, so things like DXVK in older
32-bit games won't get these performance improvements.

## Default TSO emulation options changed
Over the course of the past couple months we have been testing the new TSO memory model emulation toggles and during this time we have determined the
cost of accurately emulating Vector and Memory copy memory atomics to be too high for most hardware. The good news is that from all the testing we
have done, this doesn't actually cause any problems in any known games. So from this release onward we are by default disabling TSO emulation on these
operations. We may come back and visit this once hardware ships that has **FEAT_LRCPC3** which adds new instructions for Vector TSO loadstores.

Users with an older configuration can go in to FEXConfig to toggle these options off and enjoy the free speed benefits of not doing accurate emulation
today! As a note, Apple Silicon's TSO hardware emulation bit doesn't suffer the same performance degradation so once Asahi Linux supports this for
users then they get accurate emulation and speed!
<img width="50%" src="{{site.baseurl}}/images/posts/2406-06-12/TSOOptions.png">

**Unaligned Half-barrier TSO Enabled is still recommended to keep enabled as that can cause significant bugs**

## Fix fstatat/statx with NOFOLLOW And JIT bugs
During a livestream one of our users encountered a bug in FEX-Emu that was breaking [Darwinia](https://store.steampowered.com/app/1500/Darwinia/).
After diving in to the game to figure out what it was doing, it actually turned out to be three separate bugs that broke the game. The first bug fix
with fstatat and statx syscalls were around edge case behaviour with the **NOFOLLOW** flag. The game was attempting to find the directory that the
executable was living in and being smart in a way that broke FEX.

The other bugs were behaviour in our optimization passes where we broke x86 **SIB** addressing in a couple ways. We have since added unittests for
these two bugs but if you would like to read more you can check out [ConstProp fixes for Darwinia](https://github.com/FEX-Emu/FEX/pull/3669)

With these bugs fixed the game now runs correctly under FEX-Emu without issue!

## More ARM64EC improvements
This was some cleanup work for helping more easily integrating with what upstream WINE is doing for ARM64EC support. While still not entirely usable
for end-users yet, it is steadily improving and can run real games if the environment is setup correctly. A lot of good work here and we're hoping
for more testing going forward.

# NVIDIA Orin CPU errata!
Over the past month or two we had noticed that the NVIDIA Orin platform with its Cortex-A78AE CPU cores were running games markedly worse than our
Snapdragon 8cx Gen 3 platform with Cortex-A78C cores. While these CPU cores are not identical between platforms, they are both based on the Cortex-A78
CPU core design so they should be relatively close. The NVIDIA Orin runs its cores at a 2.2Ghz clock frequency, while the Snapdragon runs its cores at
2.4Ghz. Nearly a 9% clock speed difference wasn't accounting for the performance delta we were seeing!

The game we were testing was the PC port of Sonic Adventure 2: Battle; On Orin the board could only achieve 18FPS, while on Snapdragon we were easily
hitting 60FPS with headroom to go higher if VSync was disabled. We were stunned by this absolute performance difference and couldn't nail down the
difference being due to different drivers.

Turns out we only needed to look at the [Cortex-A78AE Software Developer Errata Notice](https://developer.arm.com/documentation/SDEN-1707912/1800/?lang=en) to find out why.

<blockquote>
<p>1951502</p>
<b>Atomic instructions with acquire semantics might not be ordered with respect to older stores with release semantics</b>
<p>Under certain conditions, atomic instructions with acquire semantics might not be ordered with respect to older instructions with release semantics. The older instruction could either be a store or store atomic.</p>
<p>This erratum can be avoided by <b>inserting a DMB ST before acquire atomic instructions</b> without release semantics. This can be implemented through execution of the following code at EL3 as soon as possible after boot:</p>
</blockquote>

This then goes on to talking about some code that programs the CPU so that it injects these <b>DMB ST</b> instructions before atomic acquires automatically!
This is why this platform has been so weird for performance testing for years! This massive hardware errata basically deletes any advantage that the
<b>FEAT_LRCPC</b> extension gives FEX and goes back to emulating atomics using half-barriers similar to how FEX already does it around unaligned
atomics!

We are now looking to move off of this NVIDIA Orin platform as quickly as possible, it was already old and now that we have identified a significant
problem around atomic performance it is higher priority. Luckily over the last few months we have great new hardware announcements. The Snapdragon X
Elite devices are shipping soon, NVIDIA has announced a new Jetson AGX Thor platform, The NVIDIA Grace server platform is starting to become available, and
Apple has some new M4 devices that will be interesting! Ideally we will get a new platform that we can plug a Radeon GPU in since it is a huge boon to
our testing performance, but depending we may not have that luxury. We'll see as we move on to new and better platforms!

# Video game showcase
Instead of a video showcase from FEX this month, go checkout [Asahi Lina's Youtube Page](https://www.youtube.com/@AsahiLina/). She recently did a
couple of live streams fixing issues with the Asahi Linux MicroVM solution for running FEX-Emu on on Apple Silicon! She showcases a bunch of games
while covering some of the more technical problems involved with getting FEX-Emu running on that platform.

Be warned, these are very long streams.

<iframe width="280" height="158" src="https://www.youtube-nocookie.com/embed/JT9a_MrFV18" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<iframe width="280" height="158" src="https://www.youtube-nocookie.com/embed/qTj60x9eMu4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

See the [2406 Release Notes](https://github.com/FEX-Emu/FEX/releases/tag/FEX-2406) or the [detailed change log](https://github.com/FEX-Emu/FEX/compare/FEX-2405...FEX-2406) in Github.
Binary file added images/posts/2406-06-12/TSOOptions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d3e3400

Please sign in to comment.