How do you prevent a column from being rounded? #2

stevans · 2021-04-23T23:24:34Z

Expected behavior:
A column of floats has all decimal places persevered.

Actual behavior:
A column of floats is rounded to two digits and put into scientific notation (i.e. 0.951366 is displayed as 9.5e-2).

Steps to reproduce:
I follow the example on the readme and I've tried explicitly defining the format with set_format, but that seems to be ignored.
I'm using an astropy table with ~1.6 million rows, so it'd be difficult to share the actual data. Not all float columns have this behavior and it doesn't happen when I use a test set of only 100 rows--I see the actual behavior when I load the entire dataset.

Version: 1.3.0

gilleslandais · 2021-04-26T09:14:19Z

Hi -
oups - the library was not built for large volumetry (the largest table I tested contained 400K records) - but it is a good new if it works!

The Library read the string representation of astropy table. I suspect that you have a column with mixed float and scientific representation may be due to a limit value reached somewhere in a column.
(eg minimal value <1.e-38 or a maximum value >1e+38)
do you confirm ?

stevans · 2021-04-30T20:46:34Z

Hi @gilleslandais ,
I don't see any values <1e-38 or >1e+38 (after taking the absolute value because of negative numbers). The smallest absolute value is 3e-6.

gilleslandais · 2021-05-03T06:59:10Z

difficult to reproduce .. is it possible to have an access on your file?

emilyhunt · 2023-07-20T08:16:39Z

I encountered this bug too, and didn't notice it until the tables were already published at the CDS and a user pointed it out. I'd be happy to help with getting it fixed, especially because this module was otherwise so helpful for getting my data ready to publish!

It only happened in one column of a table with around 1.3 million rows (view on CDS), with the "DE_ICRS" column being converted to scientific notation despite only going from values between -90 and +90. The smallest absolute value in the column is -0.000034, which was enough to trigger a conversion and rounding that reduced the precision on the column enough to cause problems. An additional table with the same columns but fewer values did not encounter this bug, suggesting that it's due to the size of the smallest absolute value being below some threshold.

This folder contains the data uploaded to the CDS. The csv_new and parquet_new folders contain the data in alternative formats (where this issue wasn't present.) The cds_new folder contains the tables that were originally uploaded to the CDS and generated by cdspyreadme, which contain the issue in the DE_ICRS column.

Hi - oups - the library was not built for large volumetry (the largest table I tested contained 400K records) - but it is a good new if it works!

The Library read the string representation of astropy table. I suspect that you have a column with mixed float and scientific representation may be due to a limit value reached somewhere in a column. (eg minimal value <1.e-38 or a maximum value >1e+38) do you confirm ?

As an aside, if it isn't possible to modify how astropy converts an entire table to a string to fix this issue, then maybe this way of generating tables could be changed anyway? I found that the module had very high RAM usage and took a long time (30+ minutes) to generate the tables. I imagine the main reason for this is because the string object containing each table in its entirety was multiple gigabytes in size and took a long time to generate, as such a large string saved in RAM will inevitably be extremely unwieldy.

Maybe instead the module could switch to using Python's native file saving (using open(file, "a")) and simply iterating over rows in a for loop, appending them to the file one by one? That would offer another way to save the data, where control over the formatting of each entry in each row could be much better. In addition, it should dramatically reduce RAM usage on large tables, as the entire astropy table would not need to be converted into a string at once - instead, each row would be processed individually (a bit like how saving a .csv file works in most implementations.)

Even though the module wasn't originally designed for such large tables, I think it's actually really the only way to process them - I found that the CDS web interface wasn't able to handle my tables due to their size, and processing them locally with this module worked a lot better. It was much easier than writing out an entire ReadMe file by hand for such large tables containing so many columns and entries.

Thanks for your help!

gilleslandais · 2023-07-20T09:44:10Z

scientific notation issue
annoying!
The library based on astropy selects format in output. To avoid the problem, you can specify the format with

table =  tablemaker.addTable(...)
column = table.get_column("DE_ICRS")
column.set_format("F12.8")

The problem remains that there is a need for author verification. I will investigate for a clean update

large table issue: I share your RAM analyse.
to improve efficiency, it requires to develop a new library - not based on astropy.
... an interesting project that requires investment in development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you prevent a column from being rounded? #2

How do you prevent a column from being rounded? #2

stevans commented Apr 23, 2021

gilleslandais commented Apr 26, 2021

stevans commented Apr 30, 2021

gilleslandais commented May 3, 2021

emilyhunt commented Jul 20, 2023

gilleslandais commented Jul 20, 2023

How do you prevent a column from being rounded? #2

How do you prevent a column from being rounded? #2

Comments

stevans commented Apr 23, 2021

gilleslandais commented Apr 26, 2021

stevans commented Apr 30, 2021

gilleslandais commented May 3, 2021

emilyhunt commented Jul 20, 2023

gilleslandais commented Jul 20, 2023