Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better estimating the file size #275

Merged
merged 11 commits into from
Feb 10, 2025
Merged

better estimating the file size #275

merged 11 commits into from
Feb 10, 2025

Conversation

uriii3
Copy link
Collaborator

@uriii3 uriii3 commented Jan 23, 2025

We discovered some incongruence between the size of the dataset and our estimation. Although it seems that the proposed new method is not bullet-proof either, it looks like it will be more consistent in the estimation.

Trying to solve issues CMT-175 and CMT-171.


📚 Documentation preview 📚: https://copernicusmarine--275.org.readthedocs.build/en/275/

@uriii3
Copy link
Collaborator Author

uriii3 commented Jan 23, 2025

Some examples with it's discrepancies:

Example 1:

copernicusmarine subset --dataset-id cmems_mod_glo_phy_myint_0.083deg_P1D-m --variable thetao --start-datetime 2024-10-22T00:00:00 --end-datetime 2024-10-22T00:00:00 --minimum-longitude -180 --maximum-longitude 179.9166717529297 --minimum-latitude -80 --maximum-latitude 90 --minimum-depth 0.49402499198913574 --maximum-depth 5727.9169921875

Estimated: 3365 MB
Real final size: 881 MB -> this is the weirdest case, maybe a thorough investigation should be done.

Example 2:

copernicusmarine subset --dataset-id cmems_mod_glo_phy_anfc_0.083deg-climatology-uncertainty_P1M-m -t 1990

Estimated: 1211 MB -> this one has improved from an old estimation of 2400MB!! (so we are better now)
Real final size: 1270 MB

Example 3:

copernicusmarine subset -i baltic_omi_health_codt_volume -t 2020     

Estimated: 16 kB -> this has improved from 5kb, which means we are better now!
Real final size: 23 kB

@uriii3
Copy link
Collaborator Author

uriii3 commented Jan 23, 2025

Maybe (seeing that all checks have passed) we could add some checks into files that we might already be downloading, to check that the estimated size and the final size doesn't diverge a threshold value? what do you think @renaudjester ?

@uriii3
Copy link
Collaborator Author

uriii3 commented Jan 23, 2025

I created an issue in xarray, hopefully they have also some insights about the problem: pydata/xarray#9979 .

@renaudjester renaudjester self-requested a review January 23, 2025 15:43
Copy link
Collaborator

@renaudjester renaudjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's improved but not perfect right? And it doesn't solve all the use cases of CMT-175 I guess

@uriii3
Copy link
Collaborator Author

uriii3 commented Jan 29, 2025

Maybe I would like to add some tests that have tmp_path, as the function seems like is quite handmade.

@uriii3 uriii3 requested a review from renaudjester February 5, 2025 15:17
>= response["file_size"] * (1 - size_variance) - offset_size
)
assert response["file_size"] <= response["data_transfer_size"]
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this return?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know actually

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked it and it is more of a style thing. I also checked the repo and it looks like the 3 versions are on it. I'm used to this but we can standardise the use if you want to!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vale then good for me!

Comment on lines 239 to 244
# compressed = False
for variable in dataset.data_vars:
variables_size += dataset[variable].encoding["dtype"].itemsize
# compressed = True if "add_offset" in dataset[variable].encoding else False
# if not compressed:
# return dataset.nbytes / 1048e3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments

@uriii3 uriii3 merged commit f4f722c into main Feb 10, 2025
3 checks passed
@uriii3 uriii3 deleted the estimate-size branch February 10, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants