Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor I/O to rely on DataFrames & provide storage options #27

Open
aufdenkampe opened this issue Mar 5, 2021 · 9 comments
Open

Refactor I/O to rely on DataFrames & provide storage options #27

aufdenkampe opened this issue Mar 5, 2021 · 9 comments
Assignees

Comments

@aufdenkampe
Copy link
Member

aufdenkampe commented Mar 5, 2021

Implement the plan we've discussed to abstract out model I/O, so that:

  • model runs by interacting with a dictionary of data frames in memory
    • presently the model reads/writes to HDF5 during the model execution
  • reading/writing to storage from the dictionary of data frames is done with a separate set of functions.

This will enable us to read/write to Parquet (#23), Zarr, and other high performance file formats, in addition to HDF5.

For performance, consider using Dask DataFrame, which is very similar to and integrated with Pandas Dataframes but has built-in parallelization and chunking that allows for performant manipulation of very large datasets that are bigger than RAM. We might want to use these Dask dataframes instead of Pandas dataframes throughout the repo (or at least easily switch between the two).

Dask Tutorial: http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/dask.html

@htaolimno
Copy link

forked a branch develop_WDMclass for development of a WDM class and separate HDF5 from the reader.

@htaolimno
Copy link

htaolimno commented Mar 9, 2021

@ptomasula I encountered some issue in the code from the branch - develop_readWDM that I forked from. When I commented out the numba decorators for debug purpose, the code runs into infinite loop. It took me a while to figure out the default integer Numpy is only 4bytes. This short length caused losing bits in your datetime encoding function. The code seems to run fine when the numba decorators were restored. I guess the compiled code may be allocated longer integer size and therefore no such issue. Any comments ?

@ptomasula
Copy link
Member

ptomasula commented Mar 9, 2021

@htaolimno, I also encountered that behavior when I was developing the datetime functions. The datetime integer uses 26bits for the month through second components and then any remaining bits store year information. As you noted any 4byte integer will cause problems because the we will loose bits related to the year component.

I thought had solved that with the conversion to a python Int object on these lines. Without numba, I would expect whatever integer was input to be converted to the python Int object, which is as a long of arbitrary length. It should expand the number of bit used to accommodate even in the initial type of the integer was only 32bit. so I'm surprised you encountered this issue when you commented out the numba decorators.

However I did some testing and there is also some problematic behavior with how the Int conversion works when numba is applied. When using numba, if I explicitly define a numpy.int32 object and try to convert it to a python Int I can still get a numpy.int32 object back (i.e. type(int(numpy.int32)) = numpy.int32), so numba is taking some liberties in interpreting the type for that conversion. Anyway, I think the solution to both of these problems is the same. Lets explicitly define the integers that are used in these datetime functions as 64bit using the numpy.int64 object. You can pull in commit 1a62f38 which should solve the issue you encountered.

@htaolimno
Copy link

@ptomasula I hesitate to pull in your most recent changes, since I am further along my development. Instead, I've developed a simple solution for the time being. I kept the numpy.int32, but used .item() to pass the underlying python integer to the datetime encoding subroutine.

aufdenkampe referenced this issue Mar 26, 2021
The datetime functions added to support numba in commit e5d64a1 required that integers input into these functions are 64bits or year information will be lost during bit manipulations. The previous implementation left integer type up to numba and in some instances could produce a in32 object. This commit causes integer conversion to be explicitly int64 so that year information is not lost.
aufdenkampe referenced this issue Mar 26, 2021
Even with datetime conversions removed from the group processing loop, the conversion time using datetime.datetime() remains slow. After trying attempts using some datetime conversion approaches with pandas I was still unable to achieve a significant performance boost.

Numba does not support the creation of datetime objects, however it does support datetime arithmetic. This commit adds in a numba compatible datetime conversion function which calculates a dates offset from the epoch and adds the appropriate timedelta64 objects to return a datetime64 object.
aufdenkampe referenced this issue Mar 26, 2021
I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.
@aufdenkampe
Copy link
Member Author

We tried to merge in Paul's fixes with 22b457a, but stripped most of the fix by resolving the merge conflict.

@ptomasula, can you fix?

@ptomasula
Copy link
Member

ptomasula commented Mar 29, 2021

@aufdenkampe @htaolimno I reset the branch prior to the merge, and then performed a new merge where I manually resolved the merge conflicts. See commit d64c25a.

This also gave me a chance review the code so I can provide some feedback on what I think next steps are. I'll open up a pull request shortly so I can type up the notes in the code review format.

@aufdenkampe
Copy link
Member Author

@ptomasula, awesome! That's a lot cleaner and much easier to understand. Thanks.

@ptomasula
Copy link
Member

ptomasula commented Mar 29, 2021

@htaolimno
I did a code review of the WDMReader class. Thank you so much for the solid work on the first phase of this. I think the class is pretty close and with some restructuring should be good to merge in. Hua when you have a chance please look over the code review and see if you can start addressing some of the comments.

Meanwhile this week, I'll start putting together my thoughts on the IO abstraction class and maybe even write an outline of a class that we can build off of.

@aufdenkampe
Copy link
Member Author

aufdenkampe commented Mar 29, 2021

@ptomasula, thanks for setting up PR #34 and doing an initial code review with suggestions.

As you ponder a new IO abstraction class, it might be useful to know that I did some further exploration this weekend of HDF5, and I've changed my tune regarding using PyTables. The short story is that we will continue to rely on PyTables for our HDF5 IO, which gives us that nice integration with Pandas and Dask data frames.

So let's get the abstraction to a level where we're managing the tables in Pandas (and interchangeably with Dask).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants