Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrueColor: implement optional tablified translucency #1233

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

JNechaevsky
Copy link
Collaborator

@JNechaevsky JNechaevsky commented Oct 30, 2024

This is slightly faster than fully dynamic and may look more consistent when TrueColor mode is off. Few remarks:

  • Fast LUT generating, takes about 2-4 milliseconds at program startup.
  • No malloc needed.
  • 262144 (512x512), not 256 colors for slightly better, yet consistent look.
  • Not limited by PLAYPAL colors.

src/i_truecolor.c Outdated Show resolved Hide resolved
@@ -37,6 +37,100 @@ typedef union
};
} tcpixel_t;

// [JN] LUTs to store precomputed values for all possible 512 color combinations
static uint32_t blendAddLUT[512][512]; // Additive blending
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good already, thank you! Do you think we could get away with only two arrays, a "main" one and an "alt" one (which would be additive in Doom)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this:

diff --git a/src/i_truecolor.c b/src/i_truecolor.c
index 21f9128b..b3b4b95a 100644
--- a/src/i_truecolor.c
+++ b/src/i_truecolor.c
@@ -38,9 +38,8 @@ typedef union
 } tcpixel_t;
 
 // [JN] LUTs to store precomputed values for all possible 512 color combinations
-static uint32_t blendAddLUT[512][512];      // Additive blending
 static uint32_t blendOverLUT[512][512];     // Overlay blending
-static uint32_t blendOverAltLUT[512][512];  // Overlay "alt" blending
+static uint32_t blendAltLUT[512][512];      // Additive blending (Doom), Overlay "alt" blending
 
 const uint32_t (*I_BlendAddFunc) (const uint32_t bg_i, const uint32_t fg_i);
 const uint32_t (*I_BlendOverFunc) (const uint32_t bg_i, const uint32_t fg_i, const int amount);
@@ -104,23 +103,28 @@ void R_InitBlendMaps (GameMission_t mission)
             fg.g = ((fg_index >> 3) & 0x07) << 5;
             fg.b = (fg_index & 0x07) << 5;
 
-            // Additive blending calculation
-            retAdd.r = MIN(bg.r + fg.r, 0xFFU);
-            retAdd.g = MIN(bg.g + fg.g, 0xFFU);
-            retAdd.b = MIN(bg.b + fg.b, 0xFFU);
-            blendAddLUT[bg_index][fg_index] = retAdd.i;
-
             // Overlay blending calculation
             retOver.r = (overlay_alpha * fg.r + (0xFFU - overlay_alpha) * bg.r) >> 8;
             retOver.g = (overlay_alpha * fg.g + (0xFFU - overlay_alpha) * bg.g) >> 8;
             retOver.b = (overlay_alpha * fg.b + (0xFFU - overlay_alpha) * bg.b) >> 8;
             blendOverLUT[bg_index][fg_index] = retOver.i;
 
-            // Overlay "alt" blending calculation
-            retOver.r = (overlay_alt_alpha * fg.r + (0xFFU - overlay_alt_alpha) * bg.r) >> 8;
-            retOver.g = (overlay_alt_alpha * fg.g + (0xFFU - overlay_alt_alpha) * bg.g) >> 8;
-            retOver.b = (overlay_alt_alpha * fg.b + (0xFFU - overlay_alt_alpha) * bg.b) >> 8;
-            blendOverAltLUT[bg_index][fg_index] = retOver.i;
+            if (overlay_alt_alpha == 0)
+            {
+                // Additive blending calculation
+                retAdd.r = MIN(bg.r + fg.r, 0xFFU);
+                retAdd.g = MIN(bg.g + fg.g, 0xFFU);
+                retAdd.b = MIN(bg.b + fg.b, 0xFFU);
+                blendAltLUT[bg_index][fg_index] = retAdd.i;
+            }
+            else
+            {
+                // Overlay "alt" blending calculation
+                retOver.r = (overlay_alt_alpha * fg.r + (0xFFU - overlay_alt_alpha) * bg.r) >> 8;
+                retOver.g = (overlay_alt_alpha * fg.g + (0xFFU - overlay_alt_alpha) * bg.g) >> 8;
+                retOver.b = (overlay_alt_alpha * fg.b + (0xFFU - overlay_alt_alpha) * bg.b) >> 8;
+                blendAltLUT[bg_index][fg_index] = retOver.i;
+            }
         }
     }
 }
@@ -157,7 +161,7 @@ const uint32_t I_BlendAddLow (const uint32_t bg_i, const uint32_t fg_i)
     bg_index = PixelToLUTIndex(bg);
     fg_index = PixelToLUTIndex(fg);
 
-    return blendAddLUT[bg_index][fg_index];
+    return blendAltLUT[bg_index][fg_index];
 }
 
 const uint32_t I_BlendDark (const uint32_t bg_i, const int d)
@@ -214,7 +218,7 @@ const uint32_t I_BlendOverAltLow (const uint32_t bg_i, const uint32_t fg_i, cons
     bg_index = PixelToLUTIndex(bg);
     fg_index = PixelToLUTIndex(fg);
 
-    return blendOverAltLUT[bg_index][fg_index];
+    return blendAltLUT[bg_index][fg_index];
 }
 
 // [crispy] TRANMAP blending emulation, used for Doom

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was thinking how to divide tablifying per game, but in the end chickened out because of:

  1. Possible hit 512x512 times of checking if/else conditions.
  2. To avoid bloating for loop.

But looks likes it's not that scary, so sure, let's go this way. Most important question for this PR: how should we name menu item? Maybe something like:

  • Translucency quality: 8bit / 32bit (or 8bpp / 32bpp? or low/high?)
  • Blending mode: speed / quality

Hot-switch is fairly simple, we just have to kick R_InitBlendQuality, even post-rendering doesn't seems to be not needed.

Finally, I have an itch that something else could be done. Current implementation helps to improve performance and get higher average framerate in -timedemo runs, but results are not as high as expected. I'm thinking to play around with R_DrawTLColumn by unrolling by four, @rfomin recommending to go farther by turning blending functions to inline's, but it's tricky since we are using function pointers, except for BlendDark for fuzz drawing.

So maybe you can give few recommendations regarding where to think or dig? 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Translucency quality: 8bit / 32bit (or 8bpp / 32bpp? or low/high?)

That'd be 9bit, strictly speaking. 😉

  • Blending mode: speed / quality

Translucency mode: speed / quality

I'd avoid using the term "blending" if we only call it "translucency" just one line above.

Finally, I have an itch that something else could be done. Current implementation helps to improve performance and get higher average framerate in -timedemo runs, but results are not as high as expected. I'm thinking to play around with R_DrawTLColumn by unrolling by four,

Sure, there's so much left for optimization in Doom's source code.

@rfomin recommending to go farther by turning blending functions to inline's, but it's tricky since we are using function pointers, except for BlendDark for fuzz drawing.

Nope, we cannot inline function pointers, obviously.

I'm fine with the current implementation. Heck, I was even too lazy for the last ~10 years to write the code that you just contributed in this PR. 😁

If you find some code path that could use some further optimization, sure, please feel free to investigate it. But this is probably topic for another PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfomin recommending to go farther by turning blending functions to inline's, but it's tricky since we are using function pointers, except for BlendDark for fuzz drawing.

Nope, we cannot inline function pointers, obviously.

I suggest making more functions (e.g. R_DrawTLColumnAdd, R_DrawTLColumnOver). Function calls on every pixel are expensive, I checked when implementing Woof's "function factory" (about ~30% performance hit).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dammit... There is just a small difference in performance comparing between 512x512 and 256x256 LUTs. Something really have to be done with column drawing functions.

Just in case, 256x256 LUT code, still much better than just 256 colors:

--- R:/GITHUB/crispy-doom/src/i_truecolor.c	Thu Oct 31 22:15:20 2024
+++ R:/MSYS/home/Julia/crispy-doom/src/i_truecolor.c	Fri Nov  1 23:33:47 2024
@@ -38,8 +38,8 @@
 } tcpixel_t;
 
 // [crispy] LUTs to store precomputed values for all possible 512x512 color combinations
-static uint32_t blendOverLUT[512][512];  // Overlay blending
-static uint32_t blendAltLUT[512][512];   // Additive blending (Doom), Overlay "alt" blending
+static uint32_t blendOverLUT[256][256];  // Overlay blending
+static uint32_t blendAltLUT[256][256];   // Additive blending (Doom), Overlay "alt" blending
 
 const uint32_t (*I_BlendAddFunc) (const uint32_t bg_i, const uint32_t fg_i);
 const uint32_t (*I_BlendOverFunc) (const uint32_t bg_i, const uint32_t fg_i, const int amount);
@@ -90,18 +90,18 @@
     retAdd.a = 0xFFU;
     retOver.a = 0xFFU;
 
-    for (int bg_index = 0; bg_index < 512; ++bg_index)
+    for (int bg_index = 0; bg_index < 256; ++bg_index)
     {
-        for (int fg_index = 0; fg_index < 512; ++fg_index)
+        for (int fg_index = 0; fg_index < 256; ++fg_index)
         {
             // Convert LUT indices back to RGB values with reduced bit depth
-            bg.r = (bg_index >> 6) << 5;
-            bg.g = ((bg_index >> 3) & 0x07) << 5;
-            bg.b = (bg_index & 0x07) << 5;
-
-            fg.r = (fg_index >> 6) << 5;
-            fg.g = ((fg_index >> 3) & 0x07) << 5;
-            fg.b = (fg_index & 0x07) << 5;
+            bg.r = (bg_index >> 5) << 5;
+            bg.g = ((bg_index >> 2) & 0x07) << 5;
+            bg.b = (bg_index & 0x03) << 6;
+
+            fg.r = (fg_index >> 5) << 5;
+            fg.g = ((fg_index >> 2) & 0x07) << 5;
+            fg.b = (fg_index & 0x03) << 6;
 
             // Overlay blending calculation
             retOver.r = (overlay_alpha * fg.r + (0xFFU - overlay_alpha) * bg.r) >> 8;
@@ -130,9 +130,9 @@
 }
 
 // [crispy] Helper function to convert a pixel color to a LUT index
-static inline uint16_t PixelToLUTIndex (const tcpixel_t color)
+static inline uint8_t PixelToLUTIndex (const tcpixel_t color)
 {
-    return ((color.r & 0xE0) << 1) | ((color.g & 0xE0) >> 2) | (color.b >> 5);
+    return ((color.r & 0xE0) | ((color.g & 0xE0) >> 3) | (color.b >> 6));
 }
 
 const uint32_t I_BlendAdd (const uint32_t bg_i, const uint32_t fg_i)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dammit... There is just a small difference in performance comparing between 512x512 and 256x256 LUTs.

Array access time doesn't depend on array size. 🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, good to know - that's important. I'm playing around with inline-ing, at least for additive blending in R_DrawTLColumn and it actually helps, -timedemo differences before and after inlining are about ~466 w/o inlining and ~522 with inlining.

Looks likes it is have to be done, maybe really in another PR, since we have to divide R_DrawTLColumn to blend/alt functions (and to high/low detail + high/low quality, oh God!), and unlikely I will handle Format Factory approach. Otherwise, this whole idea and implementation is nothing more than cosmetical change with tiny performance improvement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, tongue was ahead of thoughts. So the idea is put TrueColor TL drawing functions as macroses to i_truecolor.h where we can have inlining without overbloating existing code in r_draw.c of all four games and we don't need to care about tranmap[], tinttab[] and xlatab[] there. This will still require small separating of colfunc pointers for TrueColor only, but in theory, this will do the trick.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't overload this PR, though. Let's do one step after the other...

src/i_truecolor.c Outdated Show resolved Hide resolved
JNechaevsky and others added 5 commits October 31, 2024 12:10
Co-Authored-By: Fabian Greffrath <[email protected]>
Also, fix nasty typo: it's 512x512, not 512 colors, i.e. 262144 in total.

Co-Authored-By: Fabian Greffrath <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants