Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set time arrays explicitly to datetime64[ns] in parsers; remove ZarrCombine & duplicate / old ping_time stuff [all tests ci] #1117

Merged
merged 11 commits into from
Aug 19, 2023

Conversation

emiliom
Copy link
Collaborator

@emiliom emiliom commented Aug 5, 2023

We're now getting these run-time warnings on open_raw:

.../OSOceanAcoustics/echopype/echopype/convert/set_groups_ek80.py:161: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
ds = xr.Dataset(

There don't seem to be any actual, negative consequences, but it's off-putting and unnecessarily concerning to users. For EK80, there are about 20 of these warnings being generated!

The warnings are likely resulting from the unppinning of Pandas, which now results in Pandas 2 being used.

The solution is to recast or explicitly set every parsed time array to datetime64[ns], in the parsers.

  • AZFP
  • EK60
  • EK80

@emiliom emiliom added this to the 0.8.0 milestone Aug 5, 2023
@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2023

Codecov Report

Merging #1117 (8b6c6ca) into dev (ad17757) will increase coverage by 2.12%.
Report is 1 commits behind head on dev.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1117      +/-   ##
==========================================
+ Coverage   78.27%   80.40%   +2.12%     
==========================================
  Files          65       64       -1     
  Lines        6265     6037     -228     
==========================================
- Hits         4904     4854      -50     
+ Misses       1361     1183     -178     
Flag Coverage Δ
unittests 80.40% <100.00%> (+2.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
echopype/convert/parse_azfp.py 92.97% <ø> (ø)
echopype/convert/set_groups_ek60.py 100.00% <ø> (+6.97%) ⬆️
echopype/convert/parse_base.py 87.42% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@emiliom
Copy link
Collaborator Author

emiliom commented Aug 6, 2023

@leewujung let's see if the CI's all pass. Based on a limited EK80 test I ran, all the ek80 warnings are gone. But I'm puzzled why changing this "[ms]" occurrence below results in an error:

self.config_datagram["timestamp"] = np.datetime64(
self.config_datagram["timestamp"].replace(tzinfo=None), "[ms]"
)

Please take a look at both the changes I made and whether any more changes need to be made.

@emiliom emiliom closed this Aug 6, 2023
@emiliom emiliom changed the title Set time arrays explicitly to datetime64[ns] in parsers Set time arrays explicitly to datetime64[ns] in parsers [all tests ci] Aug 6, 2023
@emiliom emiliom reopened this Aug 6, 2023
@leewujung
Copy link
Member

@emiliom: so for that timestamp setting issue related to config_datagram, the actual issue lies in parse_base.py::_print_status where we print out the first timestamp of the raw file. The .tolist is an ugly way to getting to the underlying numbers (which is identical to results of .astype(datetime), but I feel this one is a little confusing so opted for tolist), which can be formatted by strftime.

@leewujung
Copy link
Member

I ran all the convert tests locally, and there is one test failing test_convert_ek60_duplicate_ping_times where "old_ping_time" in ed["Provenance"] does not exist in ed["Provenance"]:

assert "old_ping_time" in ed["Provenance"]

@emiliom : do you know what this is? I don't remember anything about it and haven't tried to look into that file, but it seems that there's a duplicated ping_time and it gets correctly automatically (??) and then the original timestamps are preserved in the Provenance group?

@emiliom
Copy link
Collaborator Author

emiliom commented Aug 13, 2023

Hmm. I've confirmed the test failure locally. I've been looking into it and making some headway in understanding the problem.

But I'm also puzzled that in this PR, the test CI's for 3.9 and 3.10 are displayed as successful, but when you actually drill in to the test details (eg, https://github.com/OSOceanAcoustics/echopype/actions/runs/5843062232/job/15844826484?pr=1117#step:14:490), the error is actually there! I specifically added the [all tests ci] string to the PR title when I submitted the PR to ensure all tests were run.

@leewujung
Copy link
Member

leewujung commented Aug 13, 2023

Ok, I think I know what's going on. What I was able to trace include the following:

  • in set_groups_ek60.py::set_provenance: data variable duplicate_ping_times is set to 0 or 1 depending on if self.old_ping_time is None.
  • but duplicate_ping_times only appears in zarr_combine.py, which is deprecated after the new combine_echodata code
  • self.old_ping_time only appears in SetGroupsEK60.__init__, and the operations are somewhat convoluted (though it boils down to finding if there is a duplicated ping_time timestamp, and if so, add +1ns to the duplicated timestamp if the backscatter data is not identical, or drop the duplicated timestamp if the backscatter data is indeed duplicated for all values. Code section:
    for ch in self.sorted_channel.keys():
    ping_time = self.parser_obj.ping_time[ch]
    _, unique_idx = np.unique(ping_time, return_index=True)
    duplicates = np.invert(np.isin(np.arange(len(ping_time)), unique_idx))
    if duplicates.any():
    if self.old_ping_time is None:
    if (
    len({arr.shape for arr in self.parser_obj.ping_time.values()}) == 1
    and np.unique(np.stack(self.parser_obj.ping_time.values()), axis=0).shape[0]
    == 1
    ):
    self.old_ping_time = self.parser_obj.ping_time[ch]
    else:
    ping_times = [
    xr.DataArray(arr, dims="ping_time")
    for arr in self.parser_obj.ping_time.values()
    ]
    self.old_ping_time = xr.concat(ping_times, dim="ping_time")
    backscatter_r = self.parser_obj.ping_data_dict["power"][ch]
    # indexes of duplicates including the originals
    # (if there are 2 times that are the same, both will be included)
    (all_duplicates_idx,) = np.where(np.isin(ping_time, ping_time[duplicates][0]))
    if np.array_equal(
    backscatter_r[all_duplicates_idx[0]],
    backscatter_r[all_duplicates_idx[1]],
    ):
    logger.warning(
    "duplicate pings with identical values detected; the duplicate pings will be removed" # noqa
    )
    for v in self.parser_obj.ping_data_dict.values():
    if v[ch] is None or len(v[ch]) == 0:
    continue
    if isinstance(v[ch], np.ndarray):
    v[ch] = v[ch][unique_idx]
    else:
    v[ch] = [v[ch][i] for i in unique_idx]
    self.parser_obj.ping_time[ch] = self.parser_obj.ping_time[ch][unique_idx]
    else:
    logger.warning(
    "duplicate ping times detected; the duplicate times will be incremented by 1 nanosecond and remain in the ping_time coordinate. The original ping times will be preserved in the Provenance group" # noqa
    )
    deltas = duplicates * np.timedelta64(1, "ns")
    new_ping_time = ping_time + deltas
    self.parser_obj.ping_time[ch] = new_ping_time

In terms of actions, my thoughts are:

  • remove zarr_combine completely in this release, since it is not used anymore; this module is all internal, so there is no actual impact in user-facing things
  • remove self.old_ping_time and all the related mechanisms in SetGroupsEK60.__init__
  • remove duplicate_ping_times from the Provenance group

Also, I am still not sure why the two assert statements are in test_convert_ek60.py::test_convert_ek60_duplicate_ping_times, since there is no duplicated ping_time timestamps in this file. Due to the various changes we just introduced in dev among xarray/pandas/netcdf4, I tested 2 environments, one is "old" (pandas=1.5.3, netcdf4=1.5.8, xarray=2023.7.0), and the other is "new" (pandas=2.0.3, netcdf4=1.6.4, xarray=2023.7.0), and for both environments the parsed ping_time does not have duplicated timestamps. I wonder if we should just remove this test altogether.


I'll next look into the various ping_time errors/warnings revealed in the existing tests...

@leewujung
Copy link
Member

leewujung commented Aug 13, 2023

Some more messing around revealed the different timestamps resulted from ns or ms unit. It does make sense that these are different, but we need to decide what we want to do...

Using the file in test_plot.py::test_vertical_offset_echodata as an example, the first ping_time timestamp looks like the following:

  • With everything set to ns unit (ie with the changes made in this PR):
    Screenshot 2023-08-13 at 4 37 47 PM

  • With the timestamps set to ms unit (ie before changes made in this PR):
    Screenshot 2023-08-13 at 4 39 12 PM

These values do not have practical differences, and we can make the test pass by either using index-based selection or doing method="nearest".

The puzzling results of duplicated timestamp test_convert_ek60.py::test_convert_ek60_duplicate_ping_times is induced from the same changes:

  • With everything set to ns unit (ie with the changes made in this PR): there is no duplicated timestamps in ping_time
  • With the timestamps set to ms unit (ie before changes made in this PR): there is no duplicated timestamps in ping_time

I wonder what we should do in this case...


I decided to dig deeper into where the timestamp came from (in convert.utils.ek_date_conversion.py::nt_to_unit).

def nt_to_unix(nt_timestamp_tuple, return_datetime=True):
"""
:param nt_timestamp_tuple: Tuple of two longs representing the NT date
:type nt_timestamp_tuple: (long, long)
:param return_datetime: Return a datetime object instead of float
:type return_datetime: bool
Returns a datetime.datetime object w/ UTC timezone
calculated from the nt time tuple
lowDateTime, highDateTime = nt_timestamp_tuple
The timestamp is a 64bit count of 100ns intervals since the NT epoch
broken into two 32bit longs, least significant first:
>>> dt = nt_to_unix((19496896L, 30196149L))
>>> match_dt = datetime.datetime(2011, 12, 23, 20, 54, 3, 964000, pytz_utc)
>>> assert abs(dt - match_dt) <= dt.resolution
"""
lowDateTime, highDateTime = nt_timestamp_tuple
sec_past_nt_epoch = ((highDateTime << 32) + lowDateTime) * 1.0e-7
if return_datetime:
return UTC_NT_EPOCH + datetime.timedelta(seconds=sec_past_nt_epoch)
else:
sec_past_unix_epoch = sec_past_nt_epoch - EPOCH_DELTA_SECONDS
return sec_past_unix_epoch

Taking the value returned by this function, the following illustrates what's going on:
Screenshot 2023-08-13 at 5 05 08 PM

Even though these are not really of practical difference, it seems that the change we're making here to change to using ns unit is a good idea.

@emiliom : I'll change all the tests to just grab the first ping time based on index, and wait for you to chime in here before any further actions.

…set to isel for ping_time instead of actual timestamp values
@emiliom
Copy link
Collaborator Author

emiliom commented Aug 15, 2023

Thanks @leewujung ! I reviewed the issues, and came to agreement with your recommendations. I opened a new issue, #1122, to give more visibility to the larger changes we're talking about here, for future reference.

I tried to do the removal of zarr_combine.py and the associated test in a separate PR, but ran into a problem of the duplicate - ping_time test no longer passing. I ended up including everything in this PR, though I now, just now, see that I could've just included the deletion of zarr_combine.py and test_zarr_combine.py in that PR 😞. Oh well.

With respect to the deletion of the duplicate_ping_time and old_ping_time code, see my comments in #1122. I chose to remove them too, but included those changes in a single commit two commits for easier tracking in the future, if needed.

Regarding the impact of np.datetime64 unit (resolution), I had put together a stand-alone illustration similar to yours:

dt_str = '2001-01-01T00:00:00.923654'

np.datetime64(dt_str, 's')
Out[27]: numpy.datetime64('2001-01-01T00:00:00')
np.datetime64(dt_str, 'ms')
Out[28]: numpy.datetime64('2001-01-01T00:00:00.923')
np.datetime64(dt_str, 'ns')
Out[29]: numpy.datetime64('2001-01-01T00:00:00.923654000')

Previously, with "ms", the original, finer resolution would've been clipped to "ms", then reset to "ns" with zero padding (+/- floating point error) in the backend to adhere to the Pandas < 2.0 limitation.

@emiliom
Copy link
Collaborator Author

emiliom commented Aug 15, 2023

This is now ready to merge.

@emiliom emiliom changed the title Set time arrays explicitly to datetime64[ns] in parsers [all tests ci] Set time arrays explicitly to datetime64[ns] in parsers; remove ZarrCombine & duplicate / old ping_time stuff [all tests ci] Aug 15, 2023
@emiliom emiliom removed the request for review from lsetiawan August 15, 2023 08:20
@emiliom emiliom self-assigned this Aug 15, 2023
Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I think we can merge this now!

@leewujung leewujung merged commit 2436f59 into OSOceanAcoustics:dev Aug 19, 2023
3 checks passed
@emiliom emiliom deleted the datetime64ns-warnings branch August 19, 2023 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants