Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you prevent a column from being rounded? #2

Open
stevans opened this issue Apr 23, 2021 · 5 comments
Open

How do you prevent a column from being rounded? #2

stevans opened this issue Apr 23, 2021 · 5 comments

Comments

@stevans
Copy link

stevans commented Apr 23, 2021

Expected behavior:
A column of floats has all decimal places persevered.

Actual behavior:
A column of floats is rounded to two digits and put into scientific notation (i.e. 0.951366 is displayed as 9.5e-2).

Steps to reproduce:
I follow the example on the readme and I've tried explicitly defining the format with set_format, but that seems to be ignored.
I'm using an astropy table with ~1.6 million rows, so it'd be difficult to share the actual data. Not all float columns have this behavior and it doesn't happen when I use a test set of only 100 rows--I see the actual behavior when I load the entire dataset.

Version: 1.3.0

@gilleslandais
Copy link
Collaborator

Hi -
oups - the library was not built for large volumetry (the largest table I tested contained 400K records) - but it is a good new if it works!

The Library read the string representation of astropy table. I suspect that you have a column with mixed float and scientific representation may be due to a limit value reached somewhere in a column.
(eg minimal value <1.e-38 or a maximum value >1e+38)
do you confirm ?

@stevans
Copy link
Author

stevans commented Apr 30, 2021

Hi @gilleslandais ,
I don't see any values <1e-38 or >1e+38 (after taking the absolute value because of negative numbers). The smallest absolute value is 3e-6.

@gilleslandais
Copy link
Collaborator

difficult to reproduce .. is it possible to have an access on your file?

@emilyhunt
Copy link

I encountered this bug too, and didn't notice it until the tables were already published at the CDS and a user pointed it out. I'd be happy to help with getting it fixed, especially because this module was otherwise so helpful for getting my data ready to publish!

It only happened in one column of a table with around 1.3 million rows (view on CDS), with the "DE_ICRS" column being converted to scientific notation despite only going from values between -90 and +90. The smallest absolute value in the column is -0.000034, which was enough to trigger a conversion and rounding that reduced the precision on the column enough to cause problems. An additional table with the same columns but fewer values did not encounter this bug, suggesting that it's due to the size of the smallest absolute value being below some threshold.

This folder contains the data uploaded to the CDS. The csv_new and parquet_new folders contain the data in alternative formats (where this issue wasn't present.) The cds_new folder contains the tables that were originally uploaded to the CDS and generated by cdspyreadme, which contain the issue in the DE_ICRS column.

Hi - oups - the library was not built for large volumetry (the largest table I tested contained 400K records) - but it is a good new if it works!

The Library read the string representation of astropy table. I suspect that you have a column with mixed float and scientific representation may be due to a limit value reached somewhere in a column. (eg minimal value <1.e-38 or a maximum value >1e+38) do you confirm ?

As an aside, if it isn't possible to modify how astropy converts an entire table to a string to fix this issue, then maybe this way of generating tables could be changed anyway? I found that the module had very high RAM usage and took a long time (30+ minutes) to generate the tables. I imagine the main reason for this is because the string object containing each table in its entirety was multiple gigabytes in size and took a long time to generate, as such a large string saved in RAM will inevitably be extremely unwieldy.

Maybe instead the module could switch to using Python's native file saving (using open(file, "a")) and simply iterating over rows in a for loop, appending them to the file one by one? That would offer another way to save the data, where control over the formatting of each entry in each row could be much better. In addition, it should dramatically reduce RAM usage on large tables, as the entire astropy table would not need to be converted into a string at once - instead, each row would be processed individually (a bit like how saving a .csv file works in most implementations.)

Even though the module wasn't originally designed for such large tables, I think it's actually really the only way to process them - I found that the CDS web interface wasn't able to handle my tables due to their size, and processing them locally with this module worked a lot better. It was much easier than writing out an entire ReadMe file by hand for such large tables containing so many columns and entries.

Thanks for your help!

@gilleslandais
Copy link
Collaborator

  • scientific notation issue
    annoying!
    The library based on astropy selects format in output. To avoid the problem, you can specify the format with
table =  tablemaker.addTable(...)
column = table.get_column("DE_ICRS")
column.set_format("F12.8")

The problem remains that there is a need for author verification. I will investigate for a clean update

  • large table issue: I share your RAM analyse.
    to improve efficiency, it requires to develop a new library - not based on astropy.
    ... an interesting project that requires investment in development

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants