Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation on binary file serialisation #43

Open
tjgalvin opened this issue Dec 2, 2019 · 4 comments
Open

Documentation on binary file serialisation #43

tjgalvin opened this issue Dec 2, 2019 · 4 comments
Assignees
Labels
discussion Open discussion enhancement New feature or request
Milestone

Comments

@tjgalvin
Copy link

tjgalvin commented Dec 2, 2019

I see in the FILE_FORMATS.md file that there are descriptions of the header information of each type of binary file.

Can we expand this to clearly specify the order in which the corresponding arrays (SOMs, neurons, transforms etc) are written to this files? Although that are example codes that show examples of reading these they can be a little difficult to follow and transplant into other codes.

@BerndDoser
Copy link
Member

The current binary file fomat is a provisional solution. It was not a requirement so far. For the python interface I am also using numpy binary files (npy, npz), which is also not the final solution, as we have to handle more complex data structures than tensors. Already the hexagonal layout is not directly provided.

Therefore, I would like to open the discussion to find a suitable data format. I would recommend XML, JSON, or HDF5. Maybe there is already a astro data format we can simple use or extend for our purpose.

Changes in the data format should only be done in a major release, once we have a stable specification.

@BerndDoser BerndDoser added discussion Open discussion enhancement New feature or request labels Dec 3, 2019
@BerndDoser BerndDoser added this to the 3.0 milestone Dec 3, 2019
@tjgalvin
Copy link
Author

tjgalvin commented Dec 3, 2019 via email

@RafaelMostert
Copy link

Yes - the hexagonal scheme is a bit of a pain with the current infrastructure. I have taken to ignoring it completely. This makes my life a little simpler.

I did the same almost from the start. That makes life easier during debugging, easier during further visualization steps and easier when the SOM BMU's are folded into further ML techniques (think random forests or auto-encoders).

Therefore, I would like to open the discussion to find a suitable data format. I would recommend XML, JSON, or HDF5. Maybe there is already a astro data format we can simple use or extend for our purpose.

HDF5 is vastly superior to FITS (better python/pandas integration, faster I/O, smaller files) although both are better than XML and JSON.
Having the option to also store the WCS parameters for the cutouts in the HDF5 as suggested by the paper that Tim links to would be nice.

@BerndDoser
Copy link
Member

Thanks for your valuable notes and suggestions. I have added some more details of the memory layout to https://github.com/HITS-AIN/PINK/blob/master/FILE_FORMATS.md#cartesian-layout. Indeed the hexagonal layout is a bit fiddly. Maybe the python interface using numpy can be helpful.

The suggested paper about HDFITS is very promising. I will think about it, how we can use it not only for the input files, but also for the result files (SOM and mapping).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Open discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants