-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor I/O to rely on DataFrames & provide storage options #27
Comments
forked a branch develop_WDMclass for development of a WDM class and separate HDF5 from the reader. |
@ptomasula I encountered some issue in the code from the branch - develop_readWDM that I forked from. When I commented out the numba decorators for debug purpose, the code runs into infinite loop. It took me a while to figure out the default integer Numpy is only 4bytes. This short length caused losing bits in your datetime encoding function. The code seems to run fine when the numba decorators were restored. I guess the compiled code may be allocated longer integer size and therefore no such issue. Any comments ? |
@htaolimno, I also encountered that behavior when I was developing the datetime functions. The datetime integer uses 26bits for the month through second components and then any remaining bits store year information. As you noted any 4byte integer will cause problems because the we will loose bits related to the year component. I thought had solved that with the conversion to a python Int object on these lines. Without numba, I would expect whatever integer was input to be converted to the python Int object, which is as a long of arbitrary length. It should expand the number of bit used to accommodate even in the initial type of the integer was only 32bit. so I'm surprised you encountered this issue when you commented out the numba decorators. However I did some testing and there is also some problematic behavior with how the Int conversion works when numba is applied. When using numba, if I explicitly define a numpy.int32 object and try to convert it to a python Int I can still get a numpy.int32 object back (i.e. type(int(numpy.int32)) = numpy.int32), so numba is taking some liberties in interpreting the type for that conversion. Anyway, I think the solution to both of these problems is the same. Lets explicitly define the integers that are used in these datetime functions as 64bit using the numpy.int64 object. You can pull in commit 1a62f38 which should solve the issue you encountered. |
@ptomasula I hesitate to pull in your most recent changes, since I am further along my development. Instead, I've developed a simple solution for the time being. I kept the numpy.int32, but used .item() to pass the underlying python integer to the datetime encoding subroutine. |
The datetime functions added to support numba in commit e5d64a1 required that integers input into these functions are 64bits or year information will be lost during bit manipulations. The previous implementation left integer type up to numba and in some instances could produce a in32 object. This commit causes integer conversion to be explicitly int64 so that year information is not lost.
Even with datetime conversions removed from the group processing loop, the conversion time using datetime.datetime() remains slow. After trying attempts using some datetime conversion approaches with pandas I was still unable to achieve a significant performance boost. Numba does not support the creation of datetime objects, however it does support datetime arithmetic. This commit adds in a numba compatible datetime conversion function which calculates a dates offset from the epoch and adds the appropriate timedelta64 objects to return a datetime64 object.
I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.
We tried to merge in Paul's fixes with 22b457a, but stripped most of the fix by resolving the merge conflict. @ptomasula, can you fix? |
@aufdenkampe @htaolimno I reset the branch prior to the merge, and then performed a new merge where I manually resolved the merge conflicts. See commit d64c25a. This also gave me a chance review the code so I can provide some feedback on what I think next steps are. I'll open up a pull request shortly so I can type up the notes in the code review format. |
@ptomasula, awesome! That's a lot cleaner and much easier to understand. Thanks. |
@htaolimno Meanwhile this week, I'll start putting together my thoughts on the IO abstraction class and maybe even write an outline of a class that we can build off of. |
@ptomasula, thanks for setting up PR #34 and doing an initial code review with suggestions. As you ponder a new IO abstraction class, it might be useful to know that I did some further exploration this weekend of HDF5, and I've changed my tune regarding using PyTables. The short story is that we will continue to rely on PyTables for our HDF5 IO, which gives us that nice integration with Pandas and Dask data frames. So let's get the abstraction to a level where we're managing the tables in Pandas (and interchangeably with Dask). |
Implement the plan we've discussed to abstract out model I/O, so that:
This will enable us to read/write to Parquet (#23), Zarr, and other high performance file formats, in addition to HDF5.
For performance, consider using Dask DataFrame, which is very similar to and integrated with Pandas Dataframes but has built-in parallelization and chunking that allows for performant manipulation of very large datasets that are bigger than RAM. We might want to use these Dask dataframes instead of Pandas dataframes throughout the repo (or at least easily switch between the two).
Dask Tutorial: http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/dask.html
The text was updated successfully, but these errors were encountered: