Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bifrost pipeline doesn't seem to be very fast #50

Closed
telegraphic opened this issue Nov 4, 2016 · 7 comments
Closed

Bifrost pipeline doesn't seem to be very fast #50

telegraphic opened this issue Nov 4, 2016 · 7 comments
Labels

Comments

@telegraphic
Copy link
Collaborator

telegraphic commented Nov 4, 2016

I am finally playing around with the bifrost numpy interface, and it seems to be pretty slow. I made a script, your_first_pipeline.py, which takes a test vector and does FFT -> IFFT -> FFT -> IFFT (so the output should be equal to the input):

import time
import sys
import numpy as np
from bifrost.block import Pipeline, NumpyBlock, NumpySourceBlock


# Read data size from command line
import sys
arr_size = int(sys.argv[1])
n_iter   = int(sys.argv[2])


# Our simple test-bench
def generate_test_vectors():
    for ii in xrange(n_iter):
        yield np.arange(0, arr_size, 1, dtype='complex64')


def validate_test_vectors(x):
    v = np.arange(0, arr_size, 1, dtype='complex64')
    assert np.allclose(x, v)


# Benchmark with Python only
t0 = time.time()
for xarr in generate_test_vectors():
    xfft = np.fft.fft(xarr)
    xifft = np.fft.ifft(xfft)
    xfft = np.fft.fft(xifft)
    xifft = np.fft.ifft(xfft)
    validate_test_vectors(xifft)

t1 = time.time()
print "Time taken: ", t1-t0

# Benchmark with bifrost
t0 = time.time()
blocks = [
  (NumpySourceBlock(generate_test_vectors), {'out_1': 'raw'}),
  (NumpyBlock(np.fft.fft),  {'in_1': 'raw',   'out_1': 'fft0'}),
  (NumpyBlock(np.fft.ifft), {'in_1': 'fft0',  'out_1': 'ifft0'}),
  (NumpyBlock(np.fft.fft),  {'in_1': 'ifft0', 'out_1': 'fft1'}),
  (NumpyBlock(np.fft.ifft), {'in_1': 'fft1',  'out_1': 'ifft1'}),
  (NumpyBlock(validate_test_vectors, outputs=0), {'in_1': 'ifft1'})]

Pipeline(blocks).main()
t1 = time.time()
print "Bifrost time: ", t1-t0

And I am doing basic speed tests:

dancpr@bldcpr:/bldata/bifrost/examples$ python your_first_pipeline.py 1000 1000
Time taken:  0.0867929458618
Bifrost time:  5.81598901749
dancpr@bldcpr:/bldata/bifrost/examples$ python your_first_pipeline.py 10000 100
Time taken:  0.0630059242249
Bifrost time:  0.601629018784
dancpr@bldcpr:/bldata/bifrost/examples$ python your_first_pipeline.py 10000 1000
Time taken:  0.624031066895
Bifrost time:  5.99617195129

But it seems like bifrost is slower in most circumstances.... What's up?

@MilesCranmer
Copy link
Collaborator

I am finding similar results/conclusions when I run this script. From reading through the code, it looks like everything is set up exactly as it should be, so it does look like there is an internal problem with the Pipeline or the Numpy Block classes.
The speed tests that I had run two months ago, which had shown a 2.5x speedup over the serial equivalent using 10000 element complex64 arrays going through an 8 block FFT/IFFT pipeline did use the NumpyBlock, but did not use the NumpySourceBlock, so that might be one contributor to the slow speed. I can't think of any other recent changes that would cause the issue.

I will look a bit more for the cause of this before the telecon, and will try to study it more over the weekend.

Which branch are you using btw?

@MilesCranmer
Copy link
Collaborator

Hi Danny,

I figured part of the problem out. The speed tests I did in September did use the NumpySourceBlock, but had the following flag set: changing=False (this flag exists on the branches /example/fft-speed-test and /feature/multi-sequence-support). However, I don't think this will completely solve the speed problem even on those branches (see my next point below).

This flag is by default set to True, and basically what it does is force the NumpySourceBlock to decide when it has to create a new sequence/header on the output ring. It does this by checking the generated arrays' shape/size/dtypes for changes at every single iteration, which I see can make it quite slow. I remember making the default True, because I thought that it would be safer to assume the user's generator might start outputting different size arrays, but I suppose this default behaviour is snail-paced.
Would it be a good idea to change the default to be changing=False instead? I suppose it is a question of safety vs. speed as default behaviour (though a bottleneck might be hard to localize vs. a broken block which becomes evident immediately).

The other part of the problem is that I don't think the current versions of /master or the other most up-to-date branches are actually parallelizing blocks on my machine (this process should occur by the affinity core function in the _main() method of the MultiTransformBlock). Even when I increase the array size to 1,000,000, so as to minimize the impact of Bifrost's overhead on the pipeline performance, the Bifrost pipeline speed only approaches the serial speed. I will have to look at this more next week.

Cheers,

Miles

@jaycedowell
Copy link
Member

Since I'm going through old tickets I decided to try this out under the autoconf branch and Py3.6:

$ python your_first_pipeline.py 1000 1000
Time taken:  0.3951113224029541
Bifrost time:  10.06708550453186
$ python your_first_pipeline.py 10000 100
Time taken:  0.19725704193115234
Bifrost time:  1.342881441116333
$ python your_first_pipeline.py 10000 1000
Time taken:  1.9654483795166016
Bifrost time:  9.767495155334473

I had similar results when setting changing=False on the NumpySourceBlock.

If go to larger array sizes the trend starts to reverse:

$ python your_first_pipeline.py 100000 1000
Time taken:  18.668033123016357
Bifrost time:  8.693476915359497
$ python your_first_pipeline.py 1000000 100
Time taken:  22.8755042552948
Bifrost time:  9.941920518875122

Digging into this a little more it looks like the smaller array Bifrost times are being dominated by a slow down in the input ring. For example with python your_first_pipeline.py 1000 10, the NumPy time is ~4 ms and the Bifrost time is ~630 ms. Of that ~550 ms can be attributed to a single slow span acquire. If you ignore that one gulp the runtime drops down to ~80 ms which I think is much more inline with the expected overheads for Bifrost. What do you do about it? No idea. It always seems to be there even if I have a pipeline that consists of only a single NumpySourceBlock.

@telegraphic
Copy link
Collaborator Author

I think this is using an older version of Block -- not sure if anyone is actually using this? I've been using
https://github.com/ledatelescope/bifrost/blob/d9d23d2967c824f8c09158b4e9c81ca46b6db41e/python/bifrost/pipeline.py#L319

I think we should deprecate the block.py version in favor of supporting the pipeline.py version ...

@telegraphic
Copy link
Collaborator Author

PS: on the topic of slow performance, I noted that if CUDA_CACHE_DIR is set to an NFS mount the pipeline startup is really slow. (Can't recall if I documented this elsewhere)

(Broader suggestion is to set CUDA_CACHE_DIR and/or symlink ~/.nv thing to /dev/shm on compute nodes with shared filesystem)

@jaycedowell
Copy link
Member

jaycedowell commented Nov 12, 2021

Yeah, the whole block.py vs pipeline.py thing seems to be kind of a theme in Bifrost right now where there are two ways to do the same thing. I'll open an issue on that (see #156).

On your second point we should try to collect this kind of wisdom somewhere. I'm not sure where, though. The documentation? The tutorials?

@jaycedowell
Copy link
Member

After some additional digging it looks like the slowdown is from the closeout of the innermost ExitStack used in the writemethod of the base block. Given that I now understand where/why it happens and the fact that I don't think anyone is using block.py I am inclined to close this. @telegraphic any objections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants