Skip to content

Commit

Permalink
kram-profile - update README
Browse files Browse the repository at this point in the history
  • Loading branch information
alecazam committed Feb 22, 2024
1 parent e2ceec1 commit 0dc07ff
Showing 1 changed file with 89 additions and 7 deletions.
96 changes: 89 additions & 7 deletions kram-profile/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,19 @@ kram-profile

This profiler current wraps SwiftUI atop a WKWebView running the Perfetto TraceViewer. Directories are searched, and files are open. Supported files are added to a file list, and then can quickly view these in Perfetto. The app is multidocument.

Flamegraphs are key to all profiling. Why look at giant table of numbers when you can see them visually. This also needs to be a dyanmic graph that you can zoom in and hover over. Fortunately there are several tools now supporting flamegraphs.

This is also a discussion of profilers and techniques for profiling.

Supported files

* .vmatrace - memory report generated by Kram scripts folder.
* .trace - performance timings in the form catapult trace json files
* .json - clang timing output generated using -ftime-trace

There are pre-built version of kram-profile for macOS 13.0 and higher.

References. See for more details:

* https://ui.perfetto.dev
* https://perfetto.dev/docs/visualization/deep-linking-to-perfetto-ui
* https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview#heading=h.yr4qxyxotyw
----------------

TODO:
* Fix document support, so can double click and have app open files
Expand All @@ -23,15 +25,67 @@ TODO:
* Scale specific traces to a single duration. That way the next file comes in at that scale.
* Preserve timeline duration across traces

----------------

#Profilers


Cpu Profilers. See for more details

* Catapult
* Perfetto
* Pefetto Deep Link - https://perfetto.dev/docs/visualization/deep-linking-to-perfetto-ui
* Flutter (using perfetto) https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview#heading=h.yr4qxyxotyw
* Optick - https://github.com/bombomby/optick
* Tracy - https://github.com/wolfpld/tracy
* Xcode Instruments
* AMD Code Analyst
* Intel Vtune
* ClangBuildAnalyzer - https://github.com/aras-p/ClangBuildAnalyzer

Gpu Profilers. See for more details

* Xcode Gpu Capture
* Android Gpu Inspector - https://developer.android.com/agi
* Nvidia NSight
* Mali Shader Compiler
* Pix Profiler
*

Catapult
---------

This was the tracing system that Perfetto replaced. Originally designed for Chrome profiling. Flamegraph and track-based. It also had a nice json API for recording thread names and profile scopes.

Perfetto
---------
* https://ui.perfetto.dev

This is a web-based profiling and flame-graph tool. It's fast on desktop, and continues to evolve. Only has second and timecode granularity which isn't enough. For example, performance profiling for games is in milliseconds. The team is mostly focused on Chrome profiling which apparently is in seconds. But the visuals are nice, and it now has hover tips with size/name, and also has an Issues list that the devs are responsive to. Flutter is using this profiler, and kram-profile does too.

Perfetto lives inside a sandbox due to the browser, so feeding files to Perfetto is it's weekness. As a result kram-profile's file list is a nice complement, and can send the file data across via Javascript. This is not unlike an Electron wrapper, but in much less memory.

One limitation is that traces must be nested. So timestamps cannot overlap. Make sure to honor this, or traces will overlap verticall and become confused. There is a C++ SDK to help with writing out traces, and that is a much more compact format than the json. But more languages can write to the json format. The Perfetto team is doing no further work on the json format. And fields like "color" are unsupported, and Perfetto uses it's own coloration for blocks instead. This coloration is nice and consistent and tied to name.

Orbit
---------
* https://orbitprofiler.com/

This profiler uses dynamic instrumentation of code via dtrace and trampolines. Note that Win, macOS can use this sort of system. Apple blocks access to dtrace on iOS, but there are mentions of ktrace. So you inject/remove traces dynamically by patching the dll sources directly. This used to run on macOS, Win, and Linux. Google Stadio adopted this project, and not it is limited to Linux support.

This avoids the need to blindly instrument code or inject scopes into high-frequency routines. But this patching may not be compatible by the security theater adopted by some mobile devices.

ClangBuildAnalyzer
--------
* https://github.com/aras-p/ClangBuildAnalyzer

A nice build profile aggregator. Runs through the json timings that Clang generates, and details which headers and templates and optimization are slowing down builds. Then go back and review the json files to validate the results. Uses hierarchical and not self time, so the timings do overlap. And timings across threads total up to more timing than the overal build takes.

Has an incremental system to snapshot and compare modestamps, and only do work on newer files. This is some great open-source. Aras optimized Unity builds with this, and that's a huge codebase. I've used this to optimize kram.


# Use Cases

Memory profiling
---------

Expand All @@ -47,12 +101,40 @@ Have app write out time and duration events using the Catapult json format. The
Build profiling
---------

Clang supports -ftime-trace across all platforms. Set that to dump the Perfetto trace files into the build directories alongside the .o files. Then use kram-profile to open these folders. Also see scripts/cba.sh for to run ClangBuildAnalyzer on these folders to identify where build timings are slow. Then address with optimizing includes and using pch where possible. A good timescale is 1s. Files that take longer than this to build should be targeted.
Clang supports -ftime-trace across all platforms. Set that to dump the Perfetto trace files into the build directories alongside the .o files. Then use kram-profile to open these folders. Also see scripts/cba.sh for to run ClangBuildAnalyzer on these folders to identify where build timings are slow. Then address with optimizing includes and using pch where possible. A good timescale is 1s. Files that take longer than this to build should be targeted.

Simd libraries, and especially files like STL with heavy template generation will often be at the top of the list. PCH will reduce parsing time for templates, but not the instantiation.
Simd libraries, and especially files like STL with heavy template generation will often be at the top of the list. PCH will reduce parsing time for templates, but not the instantiation. See the Optimization section for more details.

Ideally run the traces, run CBA, reduce headers and identify pch candidates. Then repeat, until overall timings go down. Remember that PCH is per link, so one per DLL or app. It also break isolation of headers in files, so may want a CI build not using it to catch unspecified headers.

# Optimization

Unity builds
-----------

Not to be confused with the Unity game engine. But unity builds combine several .cpp files into a single .cpp. This works around problems with slow linkers, and multile template and inline code instantations. But code and macros from one .cpp spill into the next. To facilitate this, be careful about undeffing at the bottoms of files. kram also uses a common namespaces across headers and source files. This allows "using namespace" in both, and keeps the sources compartmentalized.

Precompiled headers (PCH)
-----------

These are a precursor to C++ modules. pch are universally support across compilers, where we may never see C++ modules. You get one pch per library. So if your app is a DLL and a exe, then each could have their own pch. Need one pch per platform and config. Force include this since it must be the first file in each, or explicitly include a file if you want to be explicit about which files get the pch.

pch spread headers into files. So the build can break if some don't use it, or configs skip it. Occasionally fixup missing headers by disabling it. Templates are parsed by only specializations are instatiated. So may be worth defining specializations in the pch. STL is always a top offender with vector/unordered_map, function, and others at the top.

SIMD
-----------

Vector instructions are universal now via SIMD. For 16B SIMD, ARM has Neon and x64 has SSE4.2. AVX/2 introduce 32B, and AVX-512 is 64B registers, but Intel has stripped that from newer consumer chips, and is introducing AVX10. So AVX2 is as safe as it gets. Note that Apple's Rosetta 2 emulator only supports SSE 4.2 at the time of this writing. x64 SSE is always 16B size and 16B aligned, where Neon has an 8B float32x2 and 16B float32x4. The default allodator for macOS is 16B aligned. x64 is 16B aligned, but x86 was 8B alignd.

Apple has a very nice SIMD (simd/simd.h) library. This uses the gcc vector extensions so swizzles and math operators are built into the compiler. This makes the code look more HLSL like which is a good thing. This ships with all calls inline, but optimized 2/3/4 way trancendental calls are buried in the Accelerate library, and the implementation just calls the c stdlib functions multiple times as a fallback. It has a nice abstraction for int, uint, float, double simd math. One defines the maximum SIMD level supported by the app, and the library then uses the largest register size that it can for that platform. The higher size registers work with 16B alignment, so that is what Apple uses.

Optimized debug builds
-----------

One nice aspect of C++ is that specific files can be optimized. But to do so, calls become functions instead of inlines. Setting this up on a SIMD library takes a bit of work, but then callers are running optimized SIMD math even in debug.

Also Microsoft has various debug build flags that can optimize and optimize force_inline calls. Need to find out the details for clang.




Expand Down

0 comments on commit 0dc07ff

Please sign in to comment.