Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timings for PFAC_matchFromDeviceReduce vs PFAC_matchFromHostReduce #1

Open
GoogleCodeExporter opened this issue May 1, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

What version of the product are you using? On what operating system?
PFAC 1.0, on RHEL 6

Please provide any additional information below.

I measured the time it takes for PFAC_matchFromHostReduce and the equivalent 
steps when using PFAC_matchFromDeviceReduce. Both functions take about the same 
time to complete when the size of the input string is 100MB.

Timing for PFAC_matchFromHostReduce: 56 ms

Timing for equivalent steps using PFAC_matchFromDeviceReduce:
cudaMalloc: 0.3 ms
cudaMemcpy(d_input_string, h_input_string, input_size, cudaMemcpyHostToDevice): 
18 ms
PFAC_matchFromDeviceReduce: 26 ms
cudaMemcpy of d_pos and d_match_result back to CPU: 0.3 ms
cudaFree of d_input_string, d_pos and d_match_result: 11 ms
Total: 57 ms

Original issue reported on code.google.com by [email protected] on 29 Apr 2011 at 1:49

@GoogleCodeExporter
Copy link
Author

PFAC_matchFromHostReduce() needs to free working space d_input_string, 
d_matched_result and d_pos.
[code]
    cudaMalloc((void **) &d_input_string, n_hat*sizeof(int) );
    cudaMalloc((void **) &d_matched_result, input_size*sizeof(int) );
    cudaMalloc((void **) &d_pos, input_size*sizeof(int) );

    cudaMemcpy(d_input_string, h_input_string, input_size, cudaMemcpyHostToDevice);

    same as PFAC_matchFromDeviceReduce()

    cudaMemcpy(h_pos,          d_pos,          (*h_num_matched)*sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_match_result, d_match_result_zip, (*h_num_matched)*sizeof(int), cudaMemcpyDeviceToHost); 

    cudaFree(d_input_string);
    cudaFree(d_matched_result);
    cudaFree(d_pos);
[/code]   

In my tests, cudaFree() needs 12ms for 100MB input string and 24ms for 200MB 
input stream.

If you are not a beginner, then I will suggest PFAC_matchFromDeviceReduce().

Original comment by [email protected] on 29 Apr 2011 at 2:18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant