Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rz buffer optimize #4779

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

Conversation

tushar3q34
Copy link
Contributor

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed)

Detailed description

Inefficiency in RzBuffer occurs due to failure in inlining methods like read, write, get_whole_buf etc. as mentioned in the issue. The changes I made include :

  • Preventing use of RzBufferMethods for RZ_BUFFER_BYTES and RZ_BUFFER_MMAP
  • Replacing the functions with the actual function to support potential inlining. For example, instead of using b->methods->init for byte type buffer, buf_bytes_init() is used which should get inlined.

Test plan

...

Closing issues

closes #4769
...

Copy link
Member

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the changes already with perf or similar?
I am not sure if this is enough. Since the compiler not necessarily knows the b->type value. Depending on how the user of RzBuffer is calls the buffer API.

It might require some if (b->type != RZ_BUF_....) return in the using function, to signal the compiler which type the RzBuffer object is.
But we should try this out.

@tushar3q34
Copy link
Contributor Author

How should I test it against test cases which contain hotpaths?

If this doesn't work, then one of the solutions could be using this (although I'm not sure):

static bool buf_init(RzBuffer *b, const void *user){
       if(b->type != RZ_BUFFER_BYTES && b->type != RZ_BUFFER_MMAP){
              rz_return_val_if_fail(b && b->methods, false);
	      return b->methods->init ? b->methods->init(b, user) : true;
       }
       // Otherwise handle. 
}

We can also create a wrapper function which handles RZ_BUFFER_BYTES and others separately.
We can also possibly use switch case as compilers find it easier to inline funcs in switch case statements.
I will test it once and try to implement these and the other changes suggested.

@Rot127
Copy link
Member

Rot127 commented Dec 18, 2024

I will test it once and try to implement these and the other changes suggested.

I think the easiest is to write some small C program which you link against librz_util.
In this program fill the buffer with a lot (50MB+) of random data, iterate over it and compare it against some constant.

You have to install Rizin every time you make a change to RzBuffer of course.

@XVilka XVilka requested a review from wargio December 19, 2024 04:45
Comment on lines +1416 to +1417
ut64 sz;
const ut8 *buf;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define these at the beginning of the function and remove the other definitions within the if/else.

@tushar3q34
Copy link
Contributor Author

@Rot127 I ran tests for an input of 100.8 MB, writing it to buffer once and reading through it 10,000 times and here is what I found:

  • The cpu time used is comparable if we use rz_buf_read or buf_bytes_read directly on RzBuffer so I do not think the issue is with inlining because directly calling buf_bytes_read should have given a slightly better performance but it didn't.
  • The only case where the two reads showed a big difference was if we try to read more characters from the buffer than it's size or if we try to read more characters than the remaining length from cursor persists. The reason for heavy extra cpu time in case of rz_buf_read in this case is due to the presence of memset which sets remaining memory to some constant
if (len > result) {
		memset(buf + result, b->Oxff_priv, len - result);
	}

Can you please check once whether the case you were implementing rz_buf_read was using the second point?

@Rot127
Copy link
Member

Rot127 commented Dec 20, 2024

@tushar3q34 Thank you for the preliminary results! Can you share your test code please? So we have the same basis to work with?

@tushar3q34
Copy link
Contributor Author

tushar3q34 commented Dec 20, 2024

Yeah sure : Test code for RzBuffer

Cases I tested :

  • Multiple reads with original rz_buf_read
  • Multiple reads with original buf_bytes_read
  • Multiple reads with modified rz_buf_read ( buf_read inside rz_buf_read) according to this PR
  • Multiple reads with modified rz_buf_read in the following way :
switch (b->type) {
	case RZ_BUFFER_BYTES:
	case RZ_BUFFER_MMAP:
		return buf_bytes_read(b, buf, len);
		break;
	default:
		rz_return_val_if_fail(b->methods, -1);
		return b->methods->read ? b->methods->read(b, buf, len) : -1;
		break;
	}
  • Multiple reads with modified rz_buf_read in the following way(which is incorrect but just for testing) :
return buf_bytes_read(b, buf, len);
  • Multiple reads with modified rz_buf_read in the following way(bringing whole buf_bytes_read, again incorrect but just for testing ) :
struct buf_bytes_priv *priv = get_priv_bytes(b);
	if (!priv->buf) {
		// This can happen when called from buf_mmap.c, when you open a 0-length
		// file.
		return -1;
	}
	ut64 real_len = priv->length < priv->offset ? 0 : RZ_MIN(priv->length - priv->offset, len);
	memmove(buf, priv->buf + priv->offset, real_len);
	priv->offset += real_len;
	return real_len;
  • For my system, all of the above gave similar results
  • The difference comes when we remove rz_buf_seek(buf, 0, 0). After removing this, the cursor isn't reset to the beginning which makes buf_bytes_read significantly faster than rz_buf_read(due to memset).

@Rot127
Copy link
Member

Rot127 commented Dec 20, 2024

Ok, this is pretty funny. I rewrote your testing a little, to be more similar to my slow use case from before. But I couldn't replicate it.
The thing is, I know it was RzBuffer before and the callbacks. Because I replaced it with a normal ut8* and everything got ~3-4 times faster. Also perf showed the runtime increase on the callback functions.

Thanks so far for checking it. I will implement my working minimal example tomorrow. And send it here.

@wargio
Copy link
Member

wargio commented Dec 21, 2024

Just a comment: please be careful on using memmove since the buffer needs to be immutable (unless is a copy)

@Rot127
Copy link
Member

Rot127 commented Dec 21, 2024

@tushar3q34 Ok, so here is the branch with the problem:
https://github.com/Rot127/rizin/tree/rz_buf_runtime
The branch refactors the hexadecimal byte search.

You can build it and run the following command on some larger file:

cat /dev/urandom > input.data
# Press Ctrl + C after a few seconds
rizin -qc "/x a158f101" ./input.data

The functions which are relevant are:

  • serach.c::rz_search_on_io() - It search on the IO layer. It prepares a list of chunks/windows into memory. Each of those chunks/windows is searched for the bytes in a separated thread (via rz_th_iterate_list()).
  • search.c::search_iterator_io_map_cb() - This function is run concurrently in separated threads. It reads a the bytes of a chunk from the file and passes it on to find().
  • find() - is a callback to bytes_search.c::bytes_pattern_compare(). In there it does a simple memcmp to the bytes pattern searched for.

If you simply time the command you get:

time rizin -qc "/x a158f101" ./input.data 
Searching... hits: 0
real	0m1.618s
user	0m5.596s
sys	0m0.104s

Now, this is a branch which uses ut8 * instead of RzBuffer. The runtime for it is roughly twice as fast:

time rizin -qc "/x a158f101" ./input.data 
Searching... hits: 0
real	0m0.835s
user	0m2.376s
sys	0m0.105s

I don't know how people do performance testing on Windows or MacOS. But on Linux you usually use perf. perf shows you which function took how much CPU time (in percent).

You use it like this:

perf record rizin -qc "/x a158f101" ./input.data
perf report # Opens details which functions took how much CPU time

The RzBuffer (slow) version gets these results:

  35.68%  rizin    librz_util.so.0.8.0    [.] rz_buf_fwd_scan
  17.83%  rizin    libc.so.6              [.] __memcmp_avx2_movbe
  17.32%  rizin    librz_search.so.0.8.0  [.] bytes_pattern_compare
  11.32%  rizin    librz_search.so.0.8.0  [.] bytes_find
  10.96%  rizin    librz_util.so.0.8.0    [.] buf_bytes_get_whole_buf
   2.78%  rizin    librz_search.so.0.8.0  [.] memcmp@plt
   1.43%  rizin    libc.so.6              [.] __memmove_avx_unaligned_erms
   0.36%  rizin    libc.so.6              [.] _int_free
   0.19%  rizin    libc.so.6              [.] pthread_mutex_lock@@GLIBC_2.2.5
   0.18%  rizin    libc.so.6              [.] malloc_consolidate

The (faster) ut8 * version has these statistics:

  64.64%  rizin    libc.so.6              [.] __memcmp_avx2_movbe
  26.04%  rizin    librz_search.so.0.8.0  [.] bytes_find
   3.64%  rizin    libc.so.6              [.] __memmove_avx_unaligned_erms
   0.60%  rizin    libc.so.6              [.] pthread_mutex_lock@@GLIBC_2.2.5
   0.40%  rizin    libc.so.6              [.] __memset_avx2_unaligned_erms
   0.32%  rizin    libc.so.6              [.] _int_free
...

In general it is ok if RzBuffer methods have some overhead. But they never should consume +30%. No matter what. The largest par tof the runtime should be spent in __memcmp

I hope you can work from there on. If you have questions please let me know. Also, feel free to question my assumptions I made, that it is indeed the fetching of function pointers which lead to the performance hit.

Also please note, that the "fixed" faster branch has some other edits. But they still do logically more or less the same.

@XVilka XVilka added the performance A performance problem/enhancement label Dec 22, 2024
@tushar3q34
Copy link
Contributor Author

@Rot127 I ran the performance tests on the branch and modified the code in different ways to get the following results. I ran performance tests using Time Profiling Instrument on Xcode since my system runs on MacOS.
For all the tests, the CPU time % is on the left and for relevant tests, the call tree is on the right (which also shows which functions were inlined). And yes, your claim was the correct reason for inefficiency.

  1. RzBuffer without any changes (rz_buf_runtime branch) - Similar results to your system
    SCR-20241222-shrt
  • buf_bytes_get_whole_buf was not inlined as expected due to use of func pointers.
  • Another thing causing inefficiency was the call of bytes_pattern_compare inside rz_buf_fwd_scan which should not be the case as we are returning the function pointer fwd_scan(which in this case takes pointer to bytes_pattern_compare) with args to bytes_find.
  1. With ut8* + size type buffer (rz-search branch) - Similar results to your system
    SCR-20241222-skxr

  2. Modified RzBuffer according to the changes in this PR
    SCR-20241222-sltw

  • buf_bytes_get_whole_buf gets inlined.
  • bytes_pattern_compare is called from bytes_find which is better than original RzBuffer.
  • This gives < 30% cpu time for rz_buf_fwd_scan. The only issue is that it is still more than _memcmp. The reason is that each call of rz_buf_fwd_scan takes a lot of space due to its function size (>50 lines) and only first 20 lines are relevant when b->type == RZ_BUFFER_BYTES.
  1. Modified RzBuffer according to the changes in this PR & rz_buf_fwd_scan changed in this manner
    SCR-20241222-sqvd
  • buf_bytes_get_whole_buf gets inlined and bytes_pattern_compare is called from bytes_find.
  • Since the stack size decreases for more frequent path with b->type == RZ_BUFFER_BYTES, it is more optimized (<30% and rz_buf_fwd_scan takes less than _memcmp).

How would you like to proceed from here? Should I push the latest changes to the PR for review, or would you prefer a discussion or further refinements?

@XVilka
Copy link
Member

XVilka commented Dec 22, 2024

@tushar3q34 looks like the 5th option is the best one.

@Rot127
Copy link
Member

Rot127 commented Dec 22, 2024

@tushar3q34 Nice work! fyi: Inlining bytes_pattern_compare() is out of scope. Because we refactor the search anyways and have to deal with it in another PR.

As it looks like we would get around +5% runtime spent on memcmp with your changes, right? Which is already nice. But not enough for hot paths IMHO. For hot paths it should be similar to the ut8* + size runtime.

Those ~23% on rz_buf_swd_scan are a lot, but as an API it is really convenient to use. Do you think rz_buf_swd_scan can be inlined somehow as well? Maybe by adding all buffer types in the switch case? And giving the compiler a hint about the buffer type by adding an assert(buf->type == RZ_BUFFER_...) before calling rz_buf_swd_scan?

This is the only idea I have currently. Of course we could replace rz_buf_swd_scan with something like const ut8* rz_buf_get_raw(RZ_OUT size_t *raw_buf_size). But this defeats the whole purpose of RzBuffer. So not an option IMO.

It is an interesting problem, but I honestly don't know how to design it nicely. If you have any ideas, please drop them.

@tushar3q34
Copy link
Contributor Author

Do you think rz_buf_swd_scan can be inlined somehow as well?

There is very less chance that it is possible because of RZ_API (__attribute__((visibility("default")))). But I will try to change and test according to the suggestions you gave or any other ideas I am able to come up with.

If you have any ideas, please drop them.

I will think and try a few more things and get back to you. Thanks.

@XVilka
Copy link
Member

XVilka commented Dec 23, 2024

It's possible to inline an API if it's done like in librz/include/rz_vector.h. It makes using Rizin from other languages harder though, e.g. with Rust you need to write additional functions in Rust itself to provide the way to do the same on your own.

@XVilka
Copy link
Member

XVilka commented Dec 23, 2024

Another option you could try GCC/Clang function attribute __attribute__((always_inline)) to see if it works in that particular case:

#ifdef _MSC_VER
    #define forceinline __forceinline
#elif defined(__GNUC__)
    #define forceinline inline __attribute__((__always_inline__))
#elif defined(__CLANG__)
    #if __has_attribute(__always_inline__)
        #define forceinline inline __attribute__((__always_inline__))
    #else
        #define forceinline inline
    #endif
#else
    #define forceinline inline
#endif

@tushar3q34
Copy link
Contributor Author

Another option you could try GCC/Clang function attribute __attribute__((always_inline)) to see if it works in that particular case

Yes, I tried that but the issue is that we need definition of rz_buf_swd_scan as external linkage is required. If we use this attribute, compiler ommits the definition of rz_buf_swd_scan which causes compilation error.

If I use the attribute __attribute__((always_inline,visibility("default"))), there is no improvement in performance as rz_buf_swd_scan does not get inlined in out use case.

@tushar3q34
Copy link
Contributor Author

tushar3q34 commented Dec 25, 2024

Hi @Rot127, I tried these methods but none of them could lead to inlining of rz_buf_fwd_scan :

  • Using switch-case statements and assert statements as well
  • Using attributes (I explained the issue with this above) - It is possible to inline APIs as @XVilka mentioned but in this case it is not happening (or when it's inlining, it causes compile errors)
  • Making rz_buf_fwd_scan as small as containing only the case for RZ_BUFFER_BYTES and return the flow
buf = buf_bytes_get_whole_buf(b, &sz);
if (sz <= start) {
	return 0;
}
return fwd_scan(buf + start, RZ_MIN(sz - start, amount), user);

But in all these, rz_buf_fwd_scan did not get inlined. A possible reason could be passing of function pointer fwd_scan.
And since it doesn't get inlined, cpu time % of rz_buf_fwd_scan and __memcmp will be comparable.
Although we need not use ut8* + size as we can get similar results with RzBuffer using the way it was done in ut8* + size. Instead of using rz_buf_fwd_scan, we can do the following :

buf = buffer->methods->get_whole_buf(buffer, &sz); // or other methods to read whole buffer
bytes_pattern_compare(buf + offset, RZ_MIN(sz - offset, leftovers), user);

Performance in this case :
SCR-20241226-bduo

So we can use RzBuffer to get similar results on hot paths but rz_buf_fwd_scan is not the best way.
Let me know your thoughts, and how you'd like to proceed further. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance problem/enhancement RzUtil
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RzBuffer is inefficient (only on hot paths)
4 participants