Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use PDB MSDIA instead of PDB Universal #65

Open
justanotheranonymoususer opened this issue Dec 16, 2023 · 16 comments
Open

Comments

@justanotheranonymoususer

For large binaries, Universal fails with OOM. See:
NationalSecurityAgency/ghidra#2485

For this reason I couldn't try this tool with my binary.

Please add a command line option to switch to MSDIA.

image

@clearbluejar
Copy link
Owner

This request speaks to a larger requirement to be able to provide custom analyzer options to ghidriff, which I have been meaning to do and shouldn't be too hard. As I already set some custom ones.

For example. If you save the options for the screenshot above it generates a custom options file like:

{
  "SAVE_STATE_NAME": "File_Options",
  "VALUES": {
    "WindowsPE x86 Propagate External Parameters": true,
    "Aggressive Instruction Finder": true,
    "PDB Universal.Search remote symbol servers": true,
    "Condense Filler Bytes": true,
    "Decompiler Parameter ID": true,
    "Variadic Function Signature Override": true,
    "PDB MSDIA": true
  },
  "TYPES": {
    "WindowsPE x86 Propagate External Parameters": "boolean",
    "Aggressive Instruction Finder": "boolean",
    "PDB Universal.Search remote symbol servers": "boolean",
    "Condense Filler Bytes": "boolean",
    "Decompiler Parameter ID": "boolean",
    "Variadic Function Signature Override": "boolean",
    "PDB MSDIA": "boolean"
  },
  "ENUM_CLASSES": {}
}

I think in short order I could support that in ghidriff, as a command line option to supply custom analysis. What do you think?

Alternatively, at the moment, if you want to try your already analyzed file in Ghidra. Just export the binary / each binary to a Ghidra Zipped format. See the latest release picture. You can export the binary to my_large_bin1.gzf and my_large_bin2.gzf. Then you can pass the already analyzed bins to to ghidriff for diffing.

ghidriff my_large_bin1.gzf my_large_bin2.gzf

I just put this out though, so I am curious of the results. Let me know if you try it and if it works for you. Based on your feedback, I'll likely create a ticket to support custom analysis options generally.

@justanotheranonymoususer
Copy link
Author

justanotheranonymoususer commented Dec 17, 2023 via email

@justanotheranonymoususer
Copy link
Author

Download of pdbs always fails for me, I had to use another tool to download:

INFO | ghidriff | Setting up Symbol Server for symbols...
INFO | ghidriff | path: ghidriffs\symbols level: 1
INFO | ghidriff | Symbol Server Configured path: SymbolServerService:
        symbolStore: LocalSymbolStore: [ rootDir: C:\Users\User\Desktop\diff2\ghidriffs\symbols, storageLevel: -1],
        symbolServers:
                HttpSymbolServer: [ url: https://msdl.microsoft.com/download/symbols/, storageLevel: -1]
                HttpSymbolServer: [ url: https://chromium-browser-symsrv.commondatastorage.googleapis.com/, storageLevel: -1]
                HttpSymbolServer: [ url: https://symbols.mozilla.org/, storageLevel: -1]
                HttpSymbolServer: [ url: https://software.intel.com/sites/downloads/symbols/, storageLevel: -1]
                HttpSymbolServer: [ url: https://driver-symbols.nvidia.com/, storageLevel: -1]
                HttpSymbolServer: [ url: https://download.amd.com/dir/bin/, storageLevel: -1]
INFO  Connecting to https://msdl.microsoft.com/download/symbols/ (ConsoleTaskMonitor)
INFO  Success (ConsoleTaskMonitor)
INFO  Storing <XXX>.pdb in local symbol store (338.91MB) (ConsoleTaskMonitor)
WARN  SymbolServerService: error copying file https://msdl.microsoft.com/download/symbols/<XXX>.pdb/<YYY>/<XXX>.pdb to C:\Users\User\Desktop\diff2\ghidriffs\symbols: closed (SymbolServerService)
INFO  Connecting to https://msdl.microsoft.com/download/symbols/ (ConsoleTaskMonitor)
INFO  Success (ConsoleTaskMonitor)
INFO  Storing <XXX>.pdb in local symbol store (338.91MB) (ConsoleTaskMonitor)

Then I got this assert:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Scripts\ghidriff.exe\__main__.py", line 7, in <module>
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\__main__.py", line 82, in main
    pdiff = d.diff_bins(diff[0], diff[1])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\ghidra_diff_engine.py", line 1170, in diff_bins
    assert sym_count_diff < 4000, f'Symbols counts between programs ({p1.name} and {p2.name}) are too high {sym_count_diff}! Likely bad analyiss or only one binary has symbols! Check Ghidra analysis or pdb! Add --force-diff to ignore this assert'
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError: Symbols counts between programs (<XXX>_1.dll and <XXX>-2.dll) are too high 82149! Likely bad analyiss or only one binary has symbols! Check Ghidra analysis or pdb! Add --force-diff to ignore this assert

BTW typo: analyiss

I added --force-diff, now it seems to be working, I'm waiting for it to complete.

@clearbluejar
Copy link
Owner

Symbols counts between programs (_1.dll and -2.dll) are too high 82149!

If one version has symbols and the other doesn't, it becomes difficult to match the functions because Ghidra will have a different set of functions for each binary. So sometimes functions won't be aligned. That assertion is there to let you know you are stepping into a diff that might not work.

That being said, I have seen even partial diffs be useful. There is also an option to run without symbols (which again sometimes can be best if the analysis with and without symbols is so changed). Everything depends.

@clearbluejar
Copy link
Owner

Did the diff finish?

@justanotheranonymoususer
Copy link
Author

If one version has symbols and the other doesn't

I don't think that's the case, file size is similar. Here are both files:
old: https://msdl.microsoft.com/download/symbols/windows.ui.xaml.dll/9C04CA1E1226000/windows.ui.xaml.dll
new: https://msdl.microsoft.com/download/symbols/windows.ui.xaml.dll/A6D203221226000/windows.ui.xaml.dll

Did the diff finish?

It failed with:

...
INFO | ghidriff | Completed 5111 at 95%
WARNING| ghidriff | Code diff type not appended for ?close_reset@?$close_invoke_helper@$00P6AXPEAX@_E$1?ReleaseMutex@details@wil@@YAX0@ZPEAX@details@wil@@SAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?close_reset@?$close_invoke_helper@$00P6AXPEAX@_E$1?CloseHandle@details@wil@@YAX0@ZPEAX@details@wil@@SAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?OSMemoryFree@XcpAllocation@@YAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?OSMemoryFree@XcpAllocation@@YAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?OSMemoryFree@XcpAllocation@@YAXPEAX@Z due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?ReleaseWeak@control_block@details@xref@@QEAAIXZ due to jumptable decomp issue
WARNING| ghidriff | Code diff type not appended for ?_Tidy@?$vector@Vxstring_ptr@@V?$allocator@Vxstring_ptr@@@std@@@std@@AEAAXXZ due to jumptable decomp issue
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Scripts\ghidriff.exe\__main__.py", line 7, in <module>
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\__main__.py", line 82, in main
    pdiff = d.diff_bins(diff[0], diff[1])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\ghidra_diff_engine.py", line 1446, in diff_bins
    pdiff['old_pe_url'] = self.get_pe_download_url(old, pdiff['old_meta'][pe_key])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\ghidra_diff_engine.py", line 820, in get_pe_download_url
    pe_info = get_pe_extra_data(path)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\ghidriff\utils.py", line 41, in get_pe_extra_data
    machine = unpack('<H', word)[0]
              ^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 2 bytes

@clearbluejar
Copy link
Owner

ah.. seems like the pe_url generation is failing for that binary.

That isn't a critical function. just gives you a nice wget original binary command line.
Like this:
image

Which seems like another issue to resolve. :)

Storing windows.ui.xaml.pdb in local symbol store (338.91MB) (ConsoleTaskMonitor)
The PDB for the binary is 350 MB!
wow.

And the binary is 18MB...

I just kicked off a local test. I will see if it survives it.

@justanotheranonymoususer
Copy link
Author

@clearbluejar
Copy link
Owner

This is how analysis is going:
image

I ran out of heap and actually crashed the JVM. This is Ghidra analysis (before ghidriff is doing any work). I can bump up the heap for the jvm, but how much will I need. How much RAM are you working with? I can also turn off threading so it only analyzes one binary at a time with --no-threaded. Trying again.

@justanotheranonymoususer
Copy link
Author

@clearbluejar
Copy link
Owner

ah no, just using command-line on linux, regular pdb universal. maybe it can't handle it...

@justanotheranonymoususer
Copy link
Author

@clearbluejar
Copy link
Owner

Full circle. 🤦‍♂️ Sorry.

I have yet to use MSDIA for Ghidra, besides the analysis option needed, and having to run it on Windows (because that is a requirement for MSDIA right?), is there anything else you need to run on the PDB to make it work? Or MSDIA is just another parser for the PDB that handles large ones better, so there is no preprocessing needed, it can just run with the original PDB.

@justanotheranonymoususer
Copy link
Author

@clearbluejar
Copy link
Owner

Will need to get back to you when I can test with Windows. I will try to add the options json import to enable all the Ghidra analysis settings.

@justanotheranonymoususer
Copy link
Author

Now Ghidra 11 is released with some pdb improvements, maybe now it won't OOM, worth trying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants