Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rex is >7x slower than hsds #169

Open
ssolson opened this issue Jan 10, 2024 · 5 comments
Open

rex is >7x slower than hsds #169

ssolson opened this issue Jan 10, 2024 · 5 comments

Comments

@ssolson
Copy link
Contributor

ssolson commented Jan 10, 2024

I wrote the script at the bottom of this issue to spot check the performance of using hsds vs rex when I noticed rex taking significantly longer to run than hsds for the same call.

This issue is really just a question to if you guys have an idea as to why this is or if this is to be expected for some reason?

Comparison of Execution Times (in seconds):

On average, the HSDS method is faster by a factor of 7.62.

HSDS Method:

  • Minimum Time: 0.681
  • Maximum Time: 0.762
  • Average Time: 0.722

Rex Method:

  • Minimum Time: 4.958
  • Maximum Time: 6.380
  • Average Time: 5.498
from rex import WindX
import h5pyd
import pandas as pd
import time

def measure_hsds_execution_time():
    start_time = time.time()

    f = h5pyd.File("/nrel/wtk/conus/wtk_conus_2014.h5", 'r')
    time_index = pd.to_datetime(f['time_index'][...].astype(str))
    print(time_index)

    return time.time() - start_time

def measure_rex_execution_time():
    start_time = time.time()
    wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5'
    with WindX(wtk_file, hsds=True) as f:
        time_index = f.time_index
        print(time_index)

    return time.time() - start_time

# Function to calculate min, max, and average times
def calculate_stats(times):
    min_time = min(times)
    max_time = max(times)
    avg_time = sum(times) / len(times)
    return min_time, max_time, avg_time

# Pause for 5 seconds between calls
def wait():
    time.sleep(5)

# Running the script 5 times and recording execution times
hsds_execution_times = []
rex_execution_times = []

for _ in range(5):
    hsds_execution_times.append(measure_hsds_execution_time())
    wait()
    rex_execution_times.append(measure_rex_execution_time())
    wait()

# Calculating stats for each method
hsds_min, hsds_max, hsds_avg = calculate_stats(hsds_execution_times)
rex_min, rex_max, rex_avg = calculate_stats(rex_execution_times)

# Printing comparison
print("\nComparison of Execution Times (in seconds):\n")
print(f"HSDS Method:")
print(f"  Minimum Time: {hsds_min:.3f}")
print(f"  Maximum Time: {hsds_max:.3f}")
print(f"  Average Time: {hsds_avg:.3f}")

print(f"\nRex Method:")
print(f"  Minimum Time: {rex_min:.3f}")
print(f"  Maximum Time: {rex_max:.3f}")
print(f"  Average Time: {rex_avg:.3f}")

# Comparing the average times and calculating the speed difference
if hsds_avg < rex_avg:
    speed_difference = rex_avg / hsds_avg
    print(f"\nOn average, the HSDS method is faster by a factor of {speed_difference:.2f}.")
else:
    speed_difference = hsds_avg / rex_avg
    print(f"\nOn average, the Rex method is faster by a factor of {speed_difference:.2f}.")
@grantbuster
Copy link
Member

grantbuster commented Jan 10, 2024

Not sure off the top of my head... On first glance yes this is surprising. The rex resource classes should not be doing anything too fancy here. Some ideas (none of which i am fully convinced by):

  1. h5pyd caches some data behind the scenes, it is possible windx.__exit__() is clearing the cache
  2. Possible that gracefully closing the file handler in the with statement takes extra time
  3. Possible the preflight / verification checks in windx.__init__ or windx.time_index are taking long

Ideas for a more direct comparison:

  1. Do a formal code profile with cProfile or something similar
  2. Use the WindResource class instead of WindX
  3. Use a with statement in both cases
  4. For the WindResource class, try something simple outside of the property like pd.to_datetime(WindResource['time_index'].astype(str) (I think this will work, maybe not)

@ssolson
Copy link
Contributor Author

ssolson commented Jan 10, 2024

Thanks for the suggestions Grant.

Looking at

  1. https://github.com/NREL/rex/tree/main/examples/WIND
  2. https://nrel.github.io/rex/misc/examples.wind.html

The use of WindResource is not mentioned. To check my understanding one should still access these resources via WindX in production but the WindResource class would be slightly more optimized and could help us figure out if there is overhead in using the WindX class?

@grantbuster
Copy link
Member

The "extraction" classes add some quality-of-life features (e.g., lat/lon lookup and SAM dataframe extraction) but are ultimately just wrappers of the base resource classes (e.g., WindResource). We typically advertise the extraction classes to the public because of the nice features but the base resource classes have less overhead.

@ssolson
Copy link
Contributor Author

ssolson commented Jan 10, 2024

Grant,

Using WindResource and a with in both methods did not improve the results (current script below).

A formal code profile is outside the scope of my time to solve/ help with this issue. I was not exactly sure what you meant with suggestion 4 above so I did not try it but I would not expect it to make much of a difference.

from rex import WindX, WindResource
import h5pyd
import pandas as pd
import time

def measure_hsds_execution_time():
    start_time = time.time()

    with h5pyd.File("/nrel/wtk/conus/wtk_conus_2014.h5", 'r') as f:
        time_index = pd.to_datetime(f['time_index'][...].astype(str))
        print(time_index)

    return time.time() - start_time

def measure_rex_execution_time():
    start_time = time.time()
    wtk_file = '/nrel/wtk/conus/wtk_conus_2014.h5'
    with WindResource(wtk_file, hsds=True) as f:
        time_index = f.time_index
        print(time_index)

    return time.time() - start_time

# Function to calculate min, max, and average times
def calculate_stats(times):
    min_time = min(times)
    max_time = max(times)
    avg_time = sum(times) / len(times)
    return min_time, max_time, avg_time

# Pause for 5 seconds between calls
def wait():
    time.sleep(5)

# Running the script 5 times and recording execution times
hsds_execution_times = []
rex_execution_times = []

for _ in range(5):
    hsds_execution_times.append(measure_hsds_execution_time())
    wait()
    rex_execution_times.append(measure_rex_execution_time())
    wait()

# Calculating stats for each method
hsds_min, hsds_max, hsds_avg = calculate_stats(hsds_execution_times)
rex_min, rex_max, rex_avg = calculate_stats(rex_execution_times)

# Printing comparison
print("\nComparison of Execution Times (in seconds):\n")
print(f"HSDS Method:")
print(f"  Minimum Time: {hsds_min:.3f}")
print(f"  Maximum Time: {hsds_max:.3f}")
print(f"  Average Time: {hsds_avg:.3f}")

print(f"\nRex Method:")
print(f"  Minimum Time: {rex_min:.3f}")
print(f"  Maximum Time: {rex_max:.3f}")
print(f"  Average Time: {rex_avg:.3f}")

# Comparing the average times and calculating the speed difference
if hsds_avg < rex_avg:
    speed_difference = rex_avg / hsds_avg
    print(f"\nOn average, the HSDS method is faster by a factor of {speed_difference:.2f}.")
else:
    speed_difference = hsds_avg / rex_avg
    print(f"\nOn average, the Rex method is faster by a factor of {speed_difference:.2f}.")

@grantbuster
Copy link
Member

okay well thanks for the heads up about the possible performance issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants