better estimating the file size #275

uriii3 · 2025-01-23T11:33:19Z

We discovered some incongruence between the size of the dataset and our estimation. Although it seems that the proposed new method is not bullet-proof either, it looks like it will be more consistent in the estimation.

Trying to solve issues CMT-175 and CMT-171.

📚 Documentation preview 📚: https://copernicusmarine--275.org.readthedocs.build/en/275/

uriii3 · 2025-01-23T11:39:35Z

Some examples with it's discrepancies:

Example 1:

copernicusmarine subset --dataset-id cmems_mod_glo_phy_myint_0.083deg_P1D-m --variable thetao --start-datetime 2024-10-22T00:00:00 --end-datetime 2024-10-22T00:00:00 --minimum-longitude -180 --maximum-longitude 179.9166717529297 --minimum-latitude -80 --maximum-latitude 90 --minimum-depth 0.49402499198913574 --maximum-depth 5727.9169921875

Estimated: 3365 MB
Real final size: 881 MB -> this is the weirdest case, maybe a thorough investigation should be done.

Example 2:

copernicusmarine subset --dataset-id cmems_mod_glo_phy_anfc_0.083deg-climatology-uncertainty_P1M-m -t 1990

Estimated: 1211 MB -> this one has improved from an old estimation of 2400MB!! (so we are better now)
Real final size: 1270 MB

Example 3:

copernicusmarine subset -i baltic_omi_health_codt_volume -t 2020

Estimated: 16 kB -> this has improved from 5kb, which means we are better now!
Real final size: 23 kB

uriii3 · 2025-01-23T11:49:49Z

Maybe (seeing that all checks have passed) we could add some checks into files that we might already be downloading, to check that the estimated size and the final size doesn't diverge a threshold value? what do you think @renaudjester ?

uriii3 · 2025-01-23T14:33:34Z

I created an issue in xarray, hopefully they have also some insights about the problem: pydata/xarray#9979 .

renaudjester

So it's improved but not perfect right? And it doesn't solve all the use cases of CMT-175 I guess

uriii3 · 2025-01-29T14:29:43Z

Maybe I would like to add some tests that have tmp_path, as the function seems like is quite handmade.

Theophile-Varnier · 2025-02-05T15:32:17Z

tests/test_utils.py

+        >= response["file_size"] * (1 - size_variance) - offset_size
+    )
+    assert response["file_size"] <= response["data_transfer_size"]
+    return


do you need this return?

I don't know actually

I checked it and it is more of a style thing. I also checked the repo and it looks like the 3 versions are on it. I'm used to this but we can standardise the use if you want to!

vale then good for me!

renaudjester · 2025-02-10T10:28:18Z

copernicusmarine/download_functions/utils.py

+    # compressed = False
+    for variable in dataset.data_vars:
+        variables_size += dataset[variable].encoding["dtype"].itemsize
+        # compressed = True if "add_offset" in dataset[variable].encoding else False
+    # if not compressed:
+    #     return dataset.nbytes / 1048e3


uriii3 requested a review from Theophile-Varnier January 23, 2025 12:19

renaudjester self-requested a review January 23, 2025 15:43

renaudjester approved these changes Jan 23, 2025

View reviewed changes

uriii3 requested a review from renaudjester February 5, 2025 15:17

Theophile-Varnier approved these changes Feb 5, 2025

View reviewed changes

Theophile-Varnier reviewed Feb 5, 2025

View reviewed changes

renaudjester approved these changes Feb 10, 2025

View reviewed changes

uriii3 added 11 commits February 10, 2025 12:43

doing it with xarray internal methods

2c6d3ec

mmh now looks good

5d56380

some changes, not many

968daa1

reponse or dict

48d109d

increase variance

923568f

half of the size is too much

295b8ec

add offset

b1f95f6

add offset

0ff473d

more tests

fda5281

few tests

c303409

removing comments

57174e6

uriii3 force-pushed the estimate-size branch from 7a1aa12 to 57174e6 Compare February 10, 2025 11:45

uriii3 merged commit f4f722c into main Feb 10, 2025
3 checks passed

uriii3 deleted the estimate-size branch February 10, 2025 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better estimating the file size #275

better estimating the file size #275

uriii3 commented Jan 23, 2025 •

edited

Loading

uriii3 commented Jan 23, 2025

uriii3 commented Jan 23, 2025

uriii3 commented Jan 23, 2025

renaudjester left a comment

uriii3 commented Jan 29, 2025

Theophile-Varnier Feb 5, 2025

uriii3 Feb 5, 2025

uriii3 Feb 5, 2025

Theophile-Varnier Feb 5, 2025

renaudjester Feb 10, 2025

better estimating the file size #275

better estimating the file size #275

Conversation

uriii3 commented Jan 23, 2025 • edited Loading

uriii3 commented Jan 23, 2025

uriii3 commented Jan 23, 2025

uriii3 commented Jan 23, 2025

renaudjester left a comment

Choose a reason for hiding this comment

uriii3 commented Jan 29, 2025

Theophile-Varnier Feb 5, 2025

Choose a reason for hiding this comment

uriii3 Feb 5, 2025

Choose a reason for hiding this comment

uriii3 Feb 5, 2025

Choose a reason for hiding this comment

Theophile-Varnier Feb 5, 2025

Choose a reason for hiding this comment

renaudjester Feb 10, 2025

Choose a reason for hiding this comment

uriii3 commented Jan 23, 2025 •

edited

Loading